Speech Recognition — Technology Wiki

Overview

Direct Answer

Speech recognition is technology that converts spoken audio into written text by processing acoustic and linguistic features. It operates as a core component of voice interfaces and accessibility systems across enterprise and consumer applications.

How It Works

The process typically involves acoustic modelling, which maps sound wave characteristics to phonetic units, combined with language modelling that predicts probable word sequences. Modern implementations use deep neural networks to extract features from audio spectrograms, followed by decoding algorithms that output the most likely text sequence given the acoustic and linguistic constraints.

Why It Matters

Organisations deploy this technology to reduce transcription labour costs, enable hands-free device control in safety-critical environments, and improve accessibility for users with mobility impairments. Accuracy improvements in deep learning models have made deployment economically viable across customer service, medical documentation, and voice command systems.

Common Applications

Virtual assistants use it for command processing, contact centres employ it for call transcription and quality assurance, and healthcare providers utilise it for clinical note generation. Telecommunications companies integrate it for voicemail-to-text services, whilst accessibility tools leverage it to provide real-time captioning for deaf and hard-of-hearing users.

Key Considerations

Accuracy degrades significantly with background noise, accents outside training data, and domain-specific terminology, requiring careful dataset curation and model fine-tuning. Latency requirements vary by application; real-time systems demand optimised inference, whilst batch transcription permits more computationally intensive approaches.

Related in Speech & Audio

Speech Synthesis

The artificial production of human speech from text, also known as text-to-speech.

Text-to-Speech

Technology that converts written text into natural-sounding spoken audio using neural networks, enabling voice interfaces, accessibility tools, and content narration.

Speech-to-Text

The automatic transcription of spoken language into written text using acoustic and language models, foundational to voice assistants and meeting transcription systems.

More in Natural Language Processing

Relation Extraction

Parsing & Structure

Identifying semantic relationships between entities mentioned in text.

Multilingual Model

Semantics & Representation

A language model trained on text from dozens or hundreds of languages simultaneously, enabling cross-lingual understanding and generation without language-specific fine-tuning.

Sentiment Analysis

Text Analysis

The computational study of people's opinions, emotions, and attitudes expressed in text.

Cross-Lingual Transfer

Core NLP

The application of models trained in one language to perform tasks in another language, leveraging shared multilingual representations learned during pre-training.

Grounding

Semantics & Representation

Connecting language model outputs to real-world knowledge, facts, or data sources to improve factual accuracy.

Hallucination Detection

Semantics & Representation

Techniques for identifying when AI language models generate plausible but factually incorrect or unsupported content.

Code Generation

Semantics & Representation

The automated production of source code from natural language specifications or partial code context, powered by large language models trained on programming repositories.

Text-to-SQL

Generation & Translation

The task of automatically converting natural language questions into executable SQL queries, enabling non-technical users to interrogate databases through conversational interfaces.