5. Audio Processing: Speech Recognition and Generation

Transformers.js extends its capabilities beyond text and vision to include audio processing tasks. This chapter will cover two fundamental audio tasks: Automatic Speech Recognition (ASR) to convert spoken words into text, and Text-to-Speech (TTS) to generate natural-sounding speech from text.

5.1. Automatic Speech Recognition (ASR)

ASR allows applications to transcribe spoken language into written text. This is crucial for voice assistants, dictation tools, and transcribing audio recordings.

5.1.1. Detailed Explanation

An ASR pipeline takes an audio input (from a microphone, an audio file, or a URL) and outputs the transcribed text. Models like OpenAI’s Whisper are prominent in this field, known for their accuracy across various languages and domains. The audio needs to be sampled at a specific rate (e.g., 16kHz) and often pre-processed (e.g., normalized) before being fed into the model. Transformers.js handles much of this automatically when using the pipeline.

5.1.2. Code Examples: Live Voice Transcriber

Let’s build a simple transcriber that takes audio input from the user’s microphone.

<!-- index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Voice Transcriber</title>
    <style>
        body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; margin: 20px; background-color: #f4f7f6; color: #333; display: flex; flex-direction: column; align-items: center; }
        .container { background-color: #ffffff; padding: 30px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); width: 100%; max-width: 600px; text-align: center; }
        h1 { color: #2c3e50; margin-bottom: 20px; }
        button { background-color: #d32f2f; color: white; padding: 12px 25px; border: none; border-radius: 5px; cursor: pointer; font-size: 17px; transition: background-color 0.3s ease; margin-bottom: 15px; }
        button:hover:not(:disabled) { background-color: #b71c1c; }
        button:disabled { background-color: #cccccc; cursor: not-allowed; }
        button.recording { background-color: #4CAF50; }
        button.recording:hover { background-color: #45a049; }
        #output { margin-top: 25px; padding: 20px; border: 1px solid #e0e0e0; border-radius: 8px; background-color: #ffebee; text-align: left; }
        #loadingSpinner { border: 4px solid #f3f3f3; border-top: 4px solid #d32f2f; border-radius: 50%; width: 20px; height: 20px; animation: spin 1s linear infinite; display: inline-block; margin-right: 10px; vertical-align: middle; display: none; }
        @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } }
        #status { font-style: italic; color: #555; margin-bottom: 15px; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Live Voice Transcriber</h1>
        <p id="status">Click "Start Recording" to begin.</p>
        <button id="recordButton">
            <span id="loadingSpinner"></span> Start Recording
        </button>
        <div id="output">
            <h3>Transcription:</h3>
            <p id="transcriptionResult"></p>
        </div>
    </div>

    <script type="module" src="./app.js"></script>
</body>
</html>

// app.js
import { pipeline, RawAudio } from "https://esm.sh/@huggingface/transformers";

document.addEventListener('DOMContentLoaded', async () => {
    const recordButton = document.getElementById('recordButton');
    const transcriptionResult = document.getElementById('transcriptionResult');
    const loadingSpinner = document.getElementById('loadingSpinner');
    const statusText = document.getElementById('status');

    statusText.textContent = "Loading ASR model... this may take a moment.";
    recordButton.disabled = true;
    loadingSpinner.style.display = 'inline-block';

    let transcriber;
    try {
        // Using a small Whisper model for ASR. 'tiny.en' is good for English, 'tiny' for multilingual.
        transcriber = await pipeline(
            'automatic-speech-recognition',
            'Xenova/whisper-tiny.en',
            {
                device: 'webgpu',
                dtype: 'q8',
            }
        );
        statusText.textContent = "Model loaded. Click 'Start Recording' to speak.";
        recordButton.disabled = false;
        loadingSpinner.style.display = 'none';
    } catch (error) {
        console.error("Failed to load ASR model:", error);
        statusText.textContent = "Error loading ASR model. Check console.";
        loadingSpinner.style.display = 'none';
        return; // Prevent further execution if model fails to load
    }

    let mediaRecorder;
    let audioChunks = [];
    let isRecording = false;

    recordButton.addEventListener('click', async () => {
        if (!isRecording) {
            // Start recording
            try {
                const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
                mediaRecorder = new MediaRecorder(stream);
                audioChunks = [];

                mediaRecorder.ondataavailable = (event) => {
                    audioChunks.push(event.data);
                };

                mediaRecorder.onstop = async () => {
                    recordButton.disabled = true;
                    loadingSpinner.style.display = 'inline-block';
                    statusText.textContent = "Transcribing audio...";
                    recordButton.classList.remove('recording');
                    recordButton.textContent = "Processing...";

                    const audioBlob = new Blob(audioChunks, { type: 'audio/webm' });

                    try {
                        // Create a RawAudio object from the Blob
                        // The RawAudio constructor can take a Blob or a URL
                        const audio = await RawAudio.fromBlob(audioBlob);

                        // Perform transcription
                        const output = await transcriber(audio);
                        transcriptionResult.textContent = output.text;
                        statusText.textContent = "Transcription complete. Click 'Start Recording' again.";

                    } catch (error) {
                        console.error("Error during transcription:", error);
                        transcriptionResult.textContent = "Error transcribing audio. Please try again.";
                        statusText.textContent = "Error. Click 'Start Recording' to try again.";
                    } finally {
                        recordButton.disabled = false;
                        loadingSpinner.style.display = 'none';
                        recordButton.textContent = "Start Recording";
                    }

                    // Stop microphone stream tracks
                    stream.getTracks().forEach(track => track.stop());
                };

                mediaRecorder.start();
                isRecording = true;
                recordButton.textContent = "Stop Recording";
                recordButton.classList.add('recording');
                statusText.textContent = "Recording... Click 'Stop Recording' to transcribe.";

            } catch (err) {
                console.error('Error accessing microphone:', err);
                statusText.textContent = "Microphone access denied or failed. Please allow microphone access.";
            }
        } else {
            // Stop recording
            mediaRecorder.stop();
            isRecording = false;
        }
    });
});

Audio File Upload: Instead of live recording, add an input for users to upload an audio file (e.g., .wav, .mp3) and transcribe it. You’ll need to use RawAudio.fromFile(file) for file inputs.
Streaming Transcription (Advanced): Implement a real-time, streaming ASR. This is complex and would involve:
- Capturing audio in small chunks (e.g., using AudioWorkletNode).
- Continuously feeding these chunks to the ASR model, potentially using a tokenizer and model directly instead of pipeline for finer control over input/output.
- Displaying partial transcriptions as they are generated.
Language Detection + ASR: For multilingual Whisper models (Xenova/whisper-tiny), try to first detect the language of the spoken audio (some ASR models can output this), then use that information (or allow the user to select) to guide the transcription for better accuracy.

5.2. Text-to-Speech (TTS)

TTS converts written text into synthesized human-like speech. This is essential for screen readers, voice interfaces, and generating audio content.

5.2.1. Detailed Explanation

A TTS pipeline takes text as input and generates an audio waveform (typically as a Float32Array of audio samples). This waveform then needs to be played using the Web Audio API or a simple HTML <audio> element. Advanced TTS models like SpeechT5 can also incorporate speaker embeddings to synthesize speech in a specific voice.

5.2.2. Code Examples: Text-to-Speech Synthesizer

<!-- index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Text-to-Speech</title>
    <style>
        body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; margin: 20px; background-color: #f4f7f6; color: #333; display: flex; flex-direction: column; align-items: center; }
        .container { background-color: #ffffff; padding: 30px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); width: 100%; max-width: 600px; text-align: center; }
        h1 { color: #2c3e50; margin-bottom: 20px; }
        textarea { width: calc(100% - 20px); height: 120px; margin-bottom: 15px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; font-size: 16px; resize: vertical; }
        button { background-color: #3f51b5; color: white; padding: 12px 25px; border: none; border-radius: 5px; cursor: pointer; font-size: 17px; transition: background-color 0.3s ease; margin-bottom: 15px; }
        button:hover:not(:disabled) { background-color: #303f9f; }
        button:disabled { background-color: #cccccc; cursor: not-allowed; }
        #output { margin-top: 25px; padding: 20px; border: 1px solid #e0e0e0; border-radius: 8px; background-color: #e8eaf6; text-align: left; }
        #loadingSpinner { border: 4px solid #f3f3f3; border-top: 4px solid #3f51b5; border-radius: 50%; width: 20px; height: 20px; animation: spin 1s linear infinite; display: inline-block; margin-right: 10px; vertical-align: middle; display: none; }
        @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } }
        #audioPlayer { width: 100%; margin-top: 15px; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Text-to-Speech Synthesizer</h1>
        <textarea id="ttsText" placeholder="Enter text to convert to speech..."></textarea>
        <button id="speakButton">
            <span id="loadingSpinner"></span> Synthesize Speech
        </button>
        <div id="output">
            <h3>Generated Audio:</h3>
            <p id="ttsStatus">Enter text and click "Synthesize Speech".</p>
            <audio id="audioPlayer" controls></audio>
        </div>
    </div>

    <script type="module" src="./app.js"></script>
</body>
</html>

// app.js
import { pipeline, RawAudio } from "https://esm.sh/@huggingface/transformers";

document.addEventListener('DOMContentLoaded', async () => {
    const ttsText = document.getElementById('ttsText');
    const speakButton = document.getElementById('speakButton');
    const ttsStatus = document.getElementById('ttsStatus');
    const audioPlayer = document.getElementById('audioPlayer');
    const loadingSpinner = document.getElementById('loadingSpinner');

    ttsStatus.textContent = "Loading Text-to-Speech model...";
    speakButton.disabled = true;
    loadingSpinner.style.display = 'inline-block';

    let synthesizer;
    try {
        // SpeechT5 is a good choice for high-quality TTS. It can also use speaker embeddings.
        synthesizer = await pipeline(
            'text-to-speech',
            'Xenova/speecht5_tts',
            {
                device: 'webgpu',
                // Per-module dtypes for SpeechT5 for better quality/performance balance
                dtype: {
                    embed_tokens: "fp16", // for faster token embedding
                    decoder_model_merged: "q4", // for the main decoder
                },
            }
        );

        // Load a default speaker embedding for a natural voice
        // This is crucial for SpeechT5. 'Xenova/speecht5_hifigan' is a common vocoder.
        const speaker_embeddings = await RawAudio.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings/cmu_us_slt_arctic-wav-arctic_a0005.bin');

        // Note: The `speaker_embeddings` is an extra argument, not part of pipeline config directly
        synthesizer.speaker_embeddings = speaker_embeddings;

        ttsStatus.textContent = "Model loaded. Enter text and synthesize speech!";
        speakButton.disabled = false;
        loadingSpinner.style.display = 'none';

    } catch (error) {
        console.error("Failed to load Text-to-Speech model:", error);
        ttsStatus.textContent = "Error loading TTS model. Check console.";
        loadingSpinner.style.display = 'none';
        return;
    }

    speakButton.addEventListener('click', async () => {
        const text = ttsText.value.trim();
        if (text === "") {
            ttsStatus.textContent = "Please enter some text to synthesize.";
            audioPlayer.src = '';
            return;
        }

        speakButton.disabled = true;
        loadingSpinner.style.display = 'inline-block';
        ttsStatus.textContent = "Synthesizing speech...";
        audioPlayer.src = ''; // Clear previous audio

        try {
            // SpeechT5 requires speaker_embeddings. Ensure it's passed here.
            const output = await synthesizer(text, { speaker_embeddings: synthesizer.speaker_embeddings });

            // 'output' will contain the audio data (Float32Array) and sample_rate
            const audioData = output.audio;
            const sampleRate = output.sampling_rate;

            // Use the Web Audio API to create a playable blob
            const audioContext = new (window.AudioContext || window.webkitAudioContext)();
            const buffer = audioContext.createBuffer(1, audioData.length, sampleRate);
            buffer.copyToChannel(audioData, 0);

            const source = audioContext.createBufferSource();
            source.buffer = buffer;

            const destination = audioContext.createMediaStreamDestination();
            source.connect(destination);
            source.start();

            // Convert AudioBuffer to WAV Blob to play in <audio> tag
            const worker = new Worker('./audioRecorderWorker.js'); // We'll create this worker
            worker.postMessage({
                command: 'init',
                config: { sampleRate: sampleRate }
            });
            worker.postMessage({
                command: 'record',
                buffer: buffer.getChannelData(0)
            });
            worker.postMessage({
                command: 'exportWAV',
                type: 'audio/wav'
            });

            worker.onmessage = (e) => {
                if (e.data.type === 'audio/wav') {
                    const audioUrl = URL.createObjectURL(e.data.blob);
                    audioPlayer.src = audioUrl;
                    audioPlayer.play();
                    ttsStatus.textContent = "Speech synthesized and playing!";
                }
            };


        } catch (error) {
            console.error("Error during speech synthesis:", error);
            ttsStatus.textContent = "Error synthesizing speech. Please try again.";
            audioPlayer.src = '';
        } finally {
            speakButton.disabled = false;
            loadingSpinner.style.display = 'none';
        }
    });
});

audioRecorderWorker.js (create this file in the same directory as app.js):

// audioRecorderWorker.js
let recLength = 0,
    recBuffer = [],
    sampleRate;

this.onmessage = function(e){
    switch(e.data.command){
        case 'init':
            init(e.data.config);
            break;
        case 'record':
            record(e.data.buffer);
            break;
        case 'exportWAV':
            exportWAV(e.data.type);
            break;
        case 'clear':
            clear();
            break;
    }
};

function init(config){
    sampleRate = config.sampleRate;
    recBuffer = [];
    recLength = 0;
}

function record(inputBuffer){
    recBuffer.push(inputBuffer);
    recLength += inputBuffer.length;
}

function exportWAV(type){
    let buffer = mergeBuffers(recBuffer, recLength);
    let dataview = encodeWAV(buffer, sampleRate);
    let audioBlob = new Blob([dataview], { type: type });

    this.postMessage({
        type: type,
        blob: audioBlob
    });
}

function clear(){
    recLength = 0;
    recBuffer = [];
}

function mergeBuffers(recBuffers, recLength){
    let result = new Float32Array(recLength);
    let offset = 0;
    for (let i = 0; i < recBuffers.length; i++){
        result.set(recBuffers[i], offset);
        offset += recBuffers[i].length;
    }
    return result;
}

function writeString(view, offset, string){
    for (let i = 0; i < string.length; i++){
        view.setUint8(offset + i, string.charCodeAt(i));
    }
}

function encodeWAV(samples, sampleRate){
    let buffer = new ArrayBuffer(44 + samples.length * 2);
    let view = new DataView(buffer);

    /* RIFF identifier */
    writeString(view, 0, 'RIFF');
    view.setUint32(4, 36 + samples.length * 2, true);
    writeString(view, 8, 'WAVE');
    /* FMT chunk */
    writeString(view, 12, 'fmt ');
    view.setUint32(16, 16, true);
    view.setUint16(20, 1, true);
    view.setUint16(22, 1, true);
    view.setUint32(24, sampleRate, true);
    view.setUint32(28, sampleRate * 2, true);
    view.setUint16(32, 2, true);
    view.setUint16(34, 16, true);
    /* data chunk */
    writeString(view, 36, 'data');
    view.setUint32(40, samples.length * 2, true);

    floatTo16BitPCM(view, 44, samples);

    return view;
}

function floatTo16BitPCM(output, offset, input){
    for (let i = 0; i < input.length; i++, offset+=2){
        let s = Math.max(-1, Math.min(1, input[i]));
        output.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
    }
}

5.2.3. Exercises/Mini-Challenges: TTS Voice Control

Multiple Speaker Embeddings: SpeechT5 can synthesize in different voices using speaker embeddings. Find several speaker_embeddings files on the Hugging Face Hub (or generate your own) and provide a dropdown or buttons in your UI to allow the user to select different voices. Each time a new voice is selected, update synthesizer.speaker_embeddings.
Adjust Speech Speed/Pitch: While not directly controlled by transformers.js model parameters, you can use the browser’s native SpeechSynthesisUtterance along with transformers.js (though it won’t be using the model’s voice). Or, for finer control using the generated audio, you would need to implement audio processing using the Web Audio API to manipulate pitch and speed, which is an advanced audio processing task. For this exercise, try integrating SpeechSynthesisUtterance as an alternative TTS method and compare its capabilities (e.g., speed, pitch, voice selection) with the model’s output.
Real-time Read Aloud: Create a feature where, as the user types text into the textarea, the application attempts to speak out the last completed sentence or paragraph as it’s typed. This would require intelligent sentence boundary detection and managing queued speech requests to avoid interrupting current speech with new input.

Audio Processing: Speech Recognition and Generation

// table of contents