Vision Basics
Overview
Vision capabilities in LlmTornado allow AI models to understand and analyze images. You can send images to AI models and ask questions about them, extract information, describe content, or perform visual reasoning tasks. This multimodal capability enables building applications that can understand both text and images together.
Quick Start
Here's a simple vision example:
csharp
using LlmTornado;
using LlmTornado.Chat;
using LlmTornado.Chat.Models;
TornadoApi api = new TornadoApi("your-api-key");
// Analyze an image from URL
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(new Uri("https://example.com/image.jpg")),
new ChatMessagePart("What is in this image?")
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 256);
Console.WriteLine(result?.Choices?[0].Message?.Content);Prerequisites
Before using vision capabilities, ensure you have:
- The LlmTornado package installed
- A valid API key with vision model access
- Understanding of Chat Messages
- Images accessible via URL or as local files
Detailed Explanation
Vision-Enabled Models
Several models support vision capabilities:
OpenAI
- GPT-4 Vision - Original vision model
- GPT-4O - Optimized multimodal model
- GPT-4 Turbo with Vision - Fast vision processing
Google Gemini
- Gemini Pro Vision - Google's vision model
- Gemini 1.5 Pro - Advanced multimodal understanding
- Gemini 2.0 Flash - Fast vision processing
Anthropic
- Claude 3 Opus - Highest capability vision
- Claude 3 Sonnet - Balanced performance
- Claude 3 Haiku - Fast and efficient
Image Input Formats
Images can be provided in two ways:
URL-Based Images
Direct links to publicly accessible images:
csharp
new ChatMessagePart(new Uri("https://example.com/image.jpg"))Base64-Encoded Images
Local files or binary data encoded as base64:
csharp
byte[] bytes = await File.ReadAllBytesAsync("image.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(bytes)}";
new ChatMessagePart(base64, ImageDetail.Auto)Image Detail Levels
Control processing detail and cost:
- Auto - Model automatically chooses detail level
- Low - Faster, less detailed analysis
- High - Slower, more detailed analysis
Basic Usage
Analyze Image from URL
csharp
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(new Uri("https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSGfpQ3m-QWiXgCBJJbrcUFdNdWAhj7rcUqjeNUC6eKcXZDAtWm")),
new ChatMessagePart("What is on this image?")
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 256);
Console.WriteLine(result?.Choices?[0].Message?.Content);Analyze Local Image File
csharp
byte[] bytes = await File.ReadAllBytesAsync("path/to/image.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(bytes)}";
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(base64, ImageDetail.Auto),
new ChatMessagePart("What is on this image?")
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 256);
Console.WriteLine(result?.Choices?[0].Message?.Content);Multiple Images in One Request
Analyze multiple images together:
csharp
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(new Uri("https://example.com/image1.jpg")),
new ChatMessagePart(new Uri("https://example.com/image2.jpg")),
new ChatMessagePart("Compare these two images and describe the differences.")
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 512);
Console.WriteLine(result?.Choices?[0].Message?.Content);Using Different Models
csharp
// OpenAI GPT-4O
ChatResult? gpt4o = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(imageUrl),
new ChatMessagePart("Describe this image in detail.")
])
], ChatModel.OpenAi.Gpt4.O, maxTokens: 500);
// Google Gemini
ChatResult? gemini = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(imageUrl),
new ChatMessagePart("Describe this image in detail.")
])
], ChatModel.Google.Gemini.Gemini15Pro, maxTokens: 500);
// Anthropic Claude
ChatResult? claude = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(imageUrl),
new ChatMessagePart("Describe this image in detail.")
])
], ChatModel.Anthropic.Claude3Opus, maxTokens: 500);Advanced Usage
Image Detail Control
Optimize cost and quality by controlling detail level:
csharp
// High detail for complex images
byte[] bytes = await File.ReadAllBytesAsync("complex_diagram.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(bytes)}";
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(base64, ImageDetail.High),
new ChatMessagePart("Extract all text and details from this diagram.")
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 1000);Multi-Turn Vision Conversation
Build conversations with image context:
csharp
Conversation conversation = api.Chat.CreateConversation(new ChatRequest
{
Model = ChatModel.OpenAi.Gpt4.O,
MaxTokens = 500
});
// First turn - introduce the image
byte[] imageBytes = await File.ReadAllBytesAsync("photo.jpg");
string base64Image = $"data:image/jpeg;base64,{Convert.ToBase64String(imageBytes)}";
conversation.Messages.Add(new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(base64Image, ImageDetail.Auto),
new ChatMessagePart("What objects can you see in this image?")
]));
ChatRichResponse response1 = await conversation.GetResponseRich();
Console.WriteLine($"AI: {response1.Content}");
// Second turn - ask follow-up without re-sending image
conversation.AddUserMessage("Can you count how many people are in the image?");
ChatRichResponse response2 = await conversation.GetResponseRich();
Console.WriteLine($"AI: {response2.Content}");
// Third turn - more specific questions
conversation.AddUserMessage("What are they doing?");
ChatRichResponse response3 = await conversation.GetResponseRich();
Console.WriteLine($"AI: {response3.Content}");OCR and Text Extraction
Extract text from images:
csharp
byte[] documentBytes = await File.ReadAllBytesAsync("document.png");
string base64 = $"data:image/png;base64,{Convert.ToBase64String(documentBytes)}";
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.System,
"You are an OCR assistant. Extract all visible text from images accurately."),
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(base64, ImageDetail.High),
new ChatMessagePart("Extract all text from this document.")
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 1000);
Console.WriteLine(result?.Choices?[0].Message?.Content);Image Classification
Classify or categorize images:
csharp
string[] categories = ["landscape", "portrait", "food", "technology", "nature"];
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.System,
$"Classify the image into one of these categories: {string.Join(", ", categories)}"),
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(imageUrl),
new ChatMessagePart("What category does this image belong to?")
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 50);
Console.WriteLine($"Category: {result?.Choices?[0].Message?.Content}");Visual Question Answering
Answer specific questions about images:
csharp
byte[] imageBytes = await File.ReadAllBytesAsync("scene.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(imageBytes)}";
string[] questions = [
"How many people are in this image?",
"What is the weather like?",
"What time of day does this appear to be?",
"What activities are people doing?"
];
foreach (string question in questions)
{
ChatResult? result = await api.Chat.CreateChatCompletion([
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(base64, ImageDetail.Auto),
new ChatMessagePart(question)
])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 100);
Console.WriteLine($"Q: {question}");
Console.WriteLine($"A: {result?.Choices?[0].Message?.Content}");
Console.WriteLine();
}Structured Data Extraction
Extract structured information from images:
csharp
byte[] receiptBytes = await File.ReadAllBytesAsync("receipt.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(receiptBytes)}";
ChatResult? result = await api.Chat.CreateChatCompletion(new ChatRequest
{
Model = ChatModel.OpenAi.Gpt4.VisionPreview,
ResponseFormat = ChatRequestResponseFormats.Json,
MaxTokens = 500,
Messages = [
new ChatMessage(ChatMessageRoles.System,
"Extract receipt information and return as JSON with fields: merchant, date, total, items."),
new ChatMessage(ChatMessageRoles.User, [
new ChatMessagePart(base64, ImageDetail.High),
new ChatMessagePart("Extract the receipt data.")
])
]
});
Console.WriteLine(result?.Choices?[0].Message?.Content);
// Output: {"merchant": "...", "date": "...", "total": "...", "items": [...]}Best Practices
1. Optimize Image Size
- Resize large images before encoding
- Use appropriate image formats (JPEG for photos, PNG for screenshots)
- Compress images while maintaining quality
- Consider image detail level vs cost tradeoff
2. Write Effective Prompts
- Be specific about what you want to know
- Ask one clear question per request
- Provide context in system messages
- Use structured output formats when needed
3. Handle Multiple Images
- Group related images in single requests when appropriate
- Reference specific images when asking questions
- Consider context window limits
- Balance detail level with number of images
4. Manage Costs
- Use Low detail for simple tasks
- Use High detail only when necessary
- Cache vision results when appropriate
- Batch similar image analyses
5. Error Handling
- Validate image URLs are accessible
- Check image file sizes and formats
- Handle vision model unavailability
- Implement fallbacks for unsupported images
Common Issues
Image Not Accessible
- Issue: Model cannot access image URL
- Solution: Ensure URL is publicly accessible or use base64
- Prevention: Test URLs before sending to API
Image Too Large
- Issue: Image file size exceeds limits
- Solution: Resize or compress image before sending
- Prevention: Check and limit image dimensions
Poor Recognition Quality
- Issue: Model doesn't recognize image content well
- Solution: Use higher detail level or better quality image
- Prevention: Test with clear, high-quality images
Token Limit Exceeded
- Issue: Response truncated due to token limit
- Solution: Increase maxTokens parameter
- Prevention: Set appropriate token limits for task
API Reference
ChatMessagePart (Image)
ChatMessagePart(Uri imageUrl)- Create from URLChatMessagePart(string base64, ImageDetail detail)- Create from base64ImageUrl- Image URL or base64 stringDetail- Processing detail level
ImageDetail
Auto- Automatic detail selectionLow- Low detail processingHigh- High detail processing
Vision-Capable Models
ChatModel.OpenAi.Gpt4.VisionPreview- GPT-4 VisionChatModel.OpenAi.Gpt4.O- GPT-4O multimodalChatModel.OpenAi.Gpt4.Turbo- GPT-4 Turbo with visionChatModel.Google.Gemini.Gemini15Pro- Gemini Pro VisionChatModel.Anthropic.Claude3Opus- Claude 3 OpusChatModel.Anthropic.Claude3Sonnet- Claude 3 Sonnet
Related Topics
- Chat Messages - Multipart message structure
- Chat Basics - Core chat functionality
- Images - Image generation
- Files - File management