Audio Basics
Overview
LlmTornado provides comprehensive audio capabilities including speech-to-text transcription, text-to-speech synthesis, and audio translation. These features enable you to build voice-enabled applications, transcribe audio content, and create natural-sounding speech from text. The library supports multiple AI providers and audio formats.
Quick Start
Here's a simple transcription example:
csharp
using LlmTornado;
using LlmTornado.Audio;
using LlmTornado.Audio.Models;
TornadoApi api = new TornadoApi("your-api-key");
// Load audio file
byte[] audioData = await File.ReadAllBytesAsync("recording.wav");
// Transcribe audio to text
TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Whisper.V2,
ResponseFormat = AudioTranscriptionResponseFormats.Text
});
Console.WriteLine(transcription?.Text);Prerequisites
Before using audio features, ensure you have:
- The LlmTornado package installed
- A valid API key with audio endpoint access
- Audio files in supported formats (mp3, mp4, wav, webm, etc.)
- Basic understanding of async programming
Detailed Explanation
Audio Transcription
Audio transcription converts spoken words in audio files into text. LlmTornado supports multiple transcription models including OpenAI's Whisper and Mistral's Voxtral.
Supported Audio Formats
- WAV (.wav)
- MP3 (.mp3)
- MP4 (.mp4)
- WebM (.webm)
- And more standard audio formats
Response Formats
- Text - Plain text transcription
- JSON - Structured response with metadata
- Verbose JSON - Detailed response with timestamps and segments
- SRT - Subtitle format with timestamps
- VTT - WebVTT subtitle format
Text-to-Speech (TTS)
Convert text into natural-sounding speech with various voice options and configurations.
Available Voices
LlmTornado supports multiple voice options including:
- Alloy - Neutral and balanced voice
- Echo - Clear and professional voice
- Fable - Warm and expressive voice
- Onyx - Deep and authoritative voice
- Nova - Youthful and energetic voice
- Shimmer - Soft and gentle voice
Basic Usage
Simple Transcription
csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.wav");
TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Whisper.V2,
ResponseFormat = AudioTranscriptionResponseFormats.Text
});
if (transcription is not null)
{
Console.WriteLine(transcription.Text);
}Transcription with Different Providers
csharp
// OpenAI Whisper
TranscriptionResult? openAiResult = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Whisper.V2,
ResponseFormat = AudioTranscriptionResponseFormats.Text
});
// Mistral Voxtral
TranscriptionResult? mistralResult = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.Mistral.Free.VoxtralMini2507,
ResponseFormat = AudioTranscriptionResponseFormats.Text
});
// Groq Whisper
TranscriptionResult? groqResult = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.Groq.OpenAi.WhisperV3Turbo,
ResponseFormat = AudioTranscriptionResponseFormats.VerboseJson
});JSON Response Format
Get structured transcription data:
csharp
byte[] audioData = await File.ReadAllBytesAsync("sample.wav");
TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Whisper.V2,
ResponseFormat = AudioTranscriptionResponseFormats.Json
});
if (transcription is not null)
{
Console.WriteLine($"Text: {transcription.Text}");
Console.WriteLine($"Language: {transcription.Language}");
Console.WriteLine($"Duration: {transcription.Duration}");
}Text-to-Speech Synthesis
Generate speech from text:
csharp
SpeechTtsResult? result = await api.Audio.CreateSpeech(new SpeechRequest
{
Input = "Hi, how are you?",
Model = AudioModel.OpenAi.Gpt4.Gpt4OMiniTts,
ResponseFormat = SpeechResponseFormat.Mp3,
Voice = SpeechVoice.Alloy,
Instructions = "You are a very sad, tired person."
});
if (result is not null)
{
await result.SaveAndDispose("output.mp3");
}Advanced Usage
Streaming Transcription
Stream transcription results in real-time:
csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.wav");
await api.Audio.StreamTranscriptionRich(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Gpt4.Gpt4OTranscribe,
ResponseFormat = AudioTranscriptionResponseFormats.Text
}, new TranscriptionStreamEventHandler
{
ChunkHandler = (chunk) =>
{
Console.Write(chunk);
return ValueTask.CompletedTask;
},
BlockHandler = (block) =>
{
Console.WriteLine();
return ValueTask.CompletedTask;
}
});Transcription with Timestamps
Get detailed timing information:
csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.mp3");
TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Whisper.V2,
ResponseFormat = AudioTranscriptionResponseFormats.VerboseJson,
TimestampGranularities = [
TimestampGranularities.Segment,
TimestampGranularities.Word
]
});
if (transcription is not null)
{
Console.WriteLine("Transcript");
Console.WriteLine("--------------------------");
Console.WriteLine(transcription.Text);
Console.WriteLine();
Console.WriteLine("Segments");
Console.WriteLine("--------------------------");
foreach (TranscriptionSegment segment in transcription.Segments)
{
Console.WriteLine($"[{segment.Start:F2}s - {segment.End:F2}s]: {segment.Text}");
}
Console.WriteLine();
Console.WriteLine("Words");
Console.WriteLine("--------------------------");
foreach (TranscriptionWord word in transcription.Words)
{
Console.WriteLine($"[{word.Start:F2}s]: {word.Word}");
}
}Transcription with Log Probabilities
Get confidence scores for transcription:
csharp
byte[] audioData = await File.ReadAllBytesAsync("audio.wav");
TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Gpt4.Gpt4OTranscribe,
ResponseFormat = AudioTranscriptionResponseFormats.Json,
Include = [ TranscriptionRequestIncludeItems.Logprobs ]
});
if (transcription is not null)
{
Console.WriteLine("Transcript");
Console.WriteLine("--------------------------");
Console.WriteLine(transcription.Text);
Console.WriteLine();
Console.WriteLine("Logprobs (Confidence Scores)");
Console.WriteLine("--------------------------");
if (transcription.Logprobs is not null)
{
foreach (TranscriptionLogprob logprob in transcription.Logprobs)
{
Console.WriteLine($"Token: {logprob.Token}, Probability: {logprob.Logprob:F4}");
}
}
}SRT Subtitle Generation
Generate SRT subtitles for videos:
csharp
byte[] audioData = await File.ReadAllBytesAsync("video_audio.wav");
TranscriptionResult? transcription = await api.Audio.CreateTranscription(new TranscriptionRequest
{
File = new AudioFile(audioData, AudioFileTypes.Wav),
Model = AudioModel.OpenAi.Whisper.V2,
ResponseFormat = AudioTranscriptionResponseFormats.Srt
});
if (transcription is not null)
{
// Save SRT file
await File.WriteAllTextAsync("subtitles.srt", transcription.Text);
Console.WriteLine("Subtitles saved to subtitles.srt");
}Customized TTS
Create expressive speech with custom instructions:
csharp
SpeechTtsResult? result = await api.Audio.CreateSpeech(new SpeechRequest
{
Input = "Welcome to our service. We're glad to have you here!",
Model = AudioModel.OpenAi.Gpt4.Gpt4OMiniTts,
ResponseFormat = SpeechResponseFormat.Mp3,
Voice = SpeechVoice.Nova,
Instructions = "You are an enthusiastic, welcoming host. Speak with energy and warmth.",
Speed = 1.0 // Normal speed
});
if (result is not null)
{
await result.SaveAndDispose("welcome.mp3");
}Best Practices
1. Choose Appropriate Audio Format
- Use WAV for highest quality transcription
- Use MP3 for smaller file sizes and web delivery
- Ensure audio quality is sufficient for accurate transcription
2. Handle Large Audio Files
- Split very long audio files into chunks
- Consider streaming for real-time transcription
- Monitor API rate limits and quotas
3. Optimize Transcription Accuracy
- Use high-quality audio recordings
- Minimize background noise
- Use appropriate models for your language/domain
4. Manage TTS Resources
- Dispose of audio streams properly
- Cache frequently used audio when appropriate
- Consider audio compression for storage
5. Error Handling
- Validate audio file format and size
- Handle network errors gracefully
- Implement retry logic for transient failures
Common Issues
Poor Transcription Accuracy
- Issue: Inaccurate or garbled transcription
- Solution: Improve audio quality, reduce background noise
- Prevention: Use high-quality recording equipment
Unsupported Audio Format
- Issue: API rejects audio file format
- Solution: Convert to supported format (WAV, MP3, etc.)
- Prevention: Validate format before sending
Timeout on Large Files
- Issue: Transcription times out for long audio
- Solution: Split into smaller chunks or use streaming
- Prevention: Implement chunking for files over 10 minutes
Rate Limiting
- Issue: Too many audio requests
- Solution: Implement request queuing and rate limiting
- Prevention: Monitor usage and implement backoff strategies
API Reference
TranscriptionRequest
AudioFile File- Audio file to transcribeAudioModel Model- Transcription model to useAudioTranscriptionResponseFormats ResponseFormat- Output formatTimestampGranularities[] TimestampGranularities- Timestamp detail levelTranscriptionRequestIncludeItems[] Include- Additional data to include
TranscriptionResult
string Text- Transcribed textstring Language- Detected languagedouble Duration- Audio duration in secondsList<TranscriptionSegment> Segments- Time-stamped segmentsList<TranscriptionWord> Words- Individual words with timestampsList<TranscriptionLogprob> Logprobs- Confidence scores
SpeechRequest
string Input- Text to convert to speechAudioModel Model- TTS model to useSpeechResponseFormat ResponseFormat- Audio output formatSpeechVoice Voice- Voice to usestring Instructions- Style/emotion instructionsdouble Speed- Playback speed (0.25 to 4.0)
AudioModel
OpenAi.Whisper.V2- OpenAI Whisper modelOpenAi.Gpt4.Gpt4OTranscribe- GPT-4O transcriptionOpenAi.Gpt4.Gpt4OMiniTts- GPT-4O Mini TTSMistral.Free.VoxtralMini2507- Mistral VoxtralGroq.OpenAi.WhisperV3Turbo- Groq Whisper Turbo
Related Topics
- Chat Basics - Core chat functionality
- Vision - Image understanding
- Files - File management
- Agents - Building AI agents