Skip to content

Audio Basics

Overview

LlmTornado provides comprehensive audio capabilities including speech-to-text transcription, text-to-speech synthesis, and audio translation. These features enable you to build voice-enabled applications, transcribe audio content, and create natural-sounding speech from text. The library supports multiple AI providers and audio formats.

Quick Start

Here's a simple transcription example:

csharp
using LlmTornado;
using LlmTornado.Audio;
using LlmTornado.Audio.Models;

TornadoApi api = new TornadoApi("your-api-key");

// Load audio file
byte[] audioData = await File.ReadAllBytesAsync("recording.wav");

// Transcribe audio to text
TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Whisper.V2,
    ResponseFormat = AudioTranscriptionResponseFormats.Text
});

Console.WriteLine(transcription?.Text);

Prerequisites

Before using audio features, ensure you have:

  1. The LlmTornado package installed
  2. A valid API key with audio endpoint access
  3. Audio files in supported formats (mp3, mp4, wav, webm, etc.)
  4. Basic understanding of async programming

Detailed Explanation

Audio Transcription

Audio transcription converts spoken words in audio files into text. LlmTornado supports multiple transcription models including OpenAI's Whisper and Mistral's Voxtral.

Supported Audio Formats

  • WAV (.wav)
  • MP3 (.mp3)
  • MP4 (.mp4)
  • WebM (.webm)
  • And more standard audio formats

Response Formats

  • Text - Plain text transcription
  • JSON - Structured response with metadata
  • Verbose JSON - Detailed response with timestamps and segments
  • SRT - Subtitle format with timestamps
  • VTT - WebVTT subtitle format

Text-to-Speech (TTS)

Convert text into natural-sounding speech with various voice options and configurations.

Available Voices

LlmTornado supports multiple voice options including:

  • Alloy - Neutral and balanced voice
  • Echo - Clear and professional voice
  • Fable - Warm and expressive voice
  • Onyx - Deep and authoritative voice
  • Nova - Youthful and energetic voice
  • Shimmer - Soft and gentle voice

Basic Usage

Simple Transcription

csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.wav");

TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Whisper.V2,
    ResponseFormat = AudioTranscriptionResponseFormats.Text
});

if (transcription is not null)
{
    Console.WriteLine(transcription.Text);
}

Transcription with Different Providers

csharp
// OpenAI Whisper
TranscriptionResult? openAiResult = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Whisper.V2,
    ResponseFormat = AudioTranscriptionResponseFormats.Text
});

// Mistral Voxtral
TranscriptionResult? mistralResult = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.Mistral.Free.VoxtralMini2507,
    ResponseFormat = AudioTranscriptionResponseFormats.Text
});

// Groq Whisper
TranscriptionResult? groqResult = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.Groq.OpenAi.WhisperV3Turbo,
    ResponseFormat = AudioTranscriptionResponseFormats.VerboseJson
});

JSON Response Format

Get structured transcription data:

csharp
byte[] audioData = await File.ReadAllBytesAsync("sample.wav");

TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Whisper.V2,
    ResponseFormat = AudioTranscriptionResponseFormats.Json
});

if (transcription is not null)
{
    Console.WriteLine($"Text: {transcription.Text}");
    Console.WriteLine($"Language: {transcription.Language}");
    Console.WriteLine($"Duration: {transcription.Duration}");
}

Text-to-Speech Synthesis

Generate speech from text:

csharp
SpeechTtsResult? result = await api.Audio.CreateSpeech(new SpeechRequest
{
    Input = "Hi, how are you?",
    Model = AudioModel.OpenAi.Gpt4.Gpt4OMiniTts,
    ResponseFormat = SpeechResponseFormat.Mp3,
    Voice = SpeechVoice.Alloy,
    Instructions = "You are a very sad, tired person."
});

if (result is not null)
{
    await result.SaveAndDispose("output.mp3");
}

Advanced Usage

Streaming Transcription

Stream transcription results in real-time:

csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.wav");

await api.Audio.StreamTranscriptionRich(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Gpt4.Gpt4OTranscribe,
    ResponseFormat = AudioTranscriptionResponseFormats.Text
}, new TranscriptionStreamEventHandler
{
    ChunkHandler = (chunk) =>
    {
        Console.Write(chunk);
        return ValueTask.CompletedTask;
    },
    BlockHandler = (block) =>
    {
        Console.WriteLine();
        return ValueTask.CompletedTask;
    }
});

Transcription with Timestamps

Get detailed timing information:

csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.mp3");

TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Whisper.V2,
    ResponseFormat = AudioTranscriptionResponseFormats.VerboseJson,
    TimestampGranularities = [ 
        TimestampGranularities.Segment, 
        TimestampGranularities.Word 
    ]
});

if (transcription is not null)
{
    Console.WriteLine("Transcript");
    Console.WriteLine("--------------------------");
    Console.WriteLine(transcription.Text);
    Console.WriteLine();
    
    Console.WriteLine("Segments");
    Console.WriteLine("--------------------------");
    foreach (TranscriptionSegment segment in transcription.Segments)
    {
        Console.WriteLine($"[{segment.Start:F2}s - {segment.End:F2}s]: {segment.Text}");
    }
    
    Console.WriteLine();
    Console.WriteLine("Words");
    Console.WriteLine("--------------------------");
    foreach (TranscriptionWord word in transcription.Words)
    {
        Console.WriteLine($"[{word.Start:F2}s]: {word.Word}");
    }
}

Transcription with Log Probabilities

Get confidence scores for transcription:

csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.wav");

TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Gpt4.Gpt4OTranscribe,
    ResponseFormat = AudioTranscriptionResponseFormats.Json,
    Include = [ TranscriptionRequestIncludeItems.Logprobs ]
});

if (transcription is not null)
{
    Console.WriteLine("Transcript");
    Console.WriteLine("--------------------------");
    Console.WriteLine(transcription.Text);
    Console.WriteLine();
    
    Console.WriteLine("Logprobs (Confidence Scores)");
    Console.WriteLine("--------------------------");
    if (transcription.Logprobs is not null)
    {
        foreach (TranscriptionLogprob logprob in transcription.Logprobs)
        {
            Console.WriteLine($"Token: {logprob.Token}, Probability: {logprob.Logprob:F4}");
        }   
    }
}

SRT Subtitle Generation

Generate SRT subtitles for videos:

csharp
byte[] audioData = await File.ReadAllBytesAsync("video_audio.wav");

TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
    File = new AudioFile(audioData, AudioFileTypes.Wav),
    Model = AudioModel.OpenAi.Whisper.V2,
    ResponseFormat = AudioTranscriptionResponseFormats.Srt
});

if (transcription is not null)
{
    // Save SRT file
    await File.WriteAllTextAsync("subtitles.srt", transcription.Text);
    Console.WriteLine("Subtitles saved to subtitles.srt");
}

Customized TTS

Create expressive speech with custom instructions:

csharp
SpeechTtsResult? result = await api.Audio.CreateSpeech(new SpeechRequest
{
    Input = "Welcome to our service. We're glad to have you here!",
    Model = AudioModel.OpenAi.Gpt4.Gpt4OMiniTts,
    ResponseFormat = SpeechResponseFormat.Mp3,
    Voice = SpeechVoice.Nova,
    Instructions = "You are an enthusiastic, welcoming host. Speak with energy and warmth.",
    Speed = 1.0 // Normal speed
});

if (result is not null)
{
    await result.SaveAndDispose("welcome.mp3");
}

Best Practices

1. Choose Appropriate Audio Format

  • Use WAV for highest quality transcription
  • Use MP3 for smaller file sizes and web delivery
  • Ensure audio quality is sufficient for accurate transcription

2. Handle Large Audio Files

  • Split very long audio files into chunks
  • Consider streaming for real-time transcription
  • Monitor API rate limits and quotas

3. Optimize Transcription Accuracy

  • Use high-quality audio recordings
  • Minimize background noise
  • Use appropriate models for your language/domain

4. Manage TTS Resources

  • Dispose of audio streams properly
  • Cache frequently used audio when appropriate
  • Consider audio compression for storage

5. Error Handling

  • Validate audio file format and size
  • Handle network errors gracefully
  • Implement retry logic for transient failures

Common Issues

Poor Transcription Accuracy

  • Issue: Inaccurate or garbled transcription
  • Solution: Improve audio quality, reduce background noise
  • Prevention: Use high-quality recording equipment

Unsupported Audio Format

  • Issue: API rejects audio file format
  • Solution: Convert to supported format (WAV, MP3, etc.)
  • Prevention: Validate format before sending

Timeout on Large Files

  • Issue: Transcription times out for long audio
  • Solution: Split into smaller chunks or use streaming
  • Prevention: Implement chunking for files over 10 minutes

Rate Limiting

  • Issue: Too many audio requests
  • Solution: Implement request queuing and rate limiting
  • Prevention: Monitor usage and implement backoff strategies

API Reference

TranscriptionRequest

  • AudioFile File - Audio file to transcribe
  • AudioModel Model - Transcription model to use
  • AudioTranscriptionResponseFormats ResponseFormat - Output format
  • TimestampGranularities[] TimestampGranularities - Timestamp detail level
  • TranscriptionRequestIncludeItems[] Include - Additional data to include

TranscriptionResult

  • string Text - Transcribed text
  • string Language - Detected language
  • double Duration - Audio duration in seconds
  • List<TranscriptionSegment> Segments - Time-stamped segments
  • List<TranscriptionWord> Words - Individual words with timestamps
  • List<TranscriptionLogprob> Logprobs - Confidence scores

SpeechRequest

  • string Input - Text to convert to speech
  • AudioModel Model - TTS model to use
  • SpeechResponseFormat ResponseFormat - Audio output format
  • SpeechVoice Voice - Voice to use
  • string Instructions - Style/emotion instructions
  • double Speed - Playback speed (0.25 to 4.0)

AudioModel

  • OpenAi.Whisper.V2 - OpenAI Whisper model
  • OpenAi.Gpt4.Gpt4OTranscribe - GPT-4O transcription
  • OpenAi.Gpt4.Gpt4OMiniTts - GPT-4O Mini TTS
  • Mistral.Free.VoxtralMini2507 - Mistral Voxtral
  • Groq.OpenAi.WhisperV3Turbo - Groq Whisper Turbo