Skip to content

Vision Basics

Overview

Vision capabilities in LlmTornado allow AI models to understand and analyze images. You can send images to AI models and ask questions about them, extract information, describe content, or perform visual reasoning tasks. This multimodal capability enables building applications that can understand both text and images together.

Quick Start

Here's a simple vision example:

csharp
using LlmTornado;
using LlmTornado.Chat;
using LlmTornado.Chat.Models;

TornadoApi api = new TornadoApi("your-api-key");

// Analyze an image from URL
ChatResult? result = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(new Uri("https://example.com/image.jpg")),
        new ChatMessagePart("What is in this image?")
    ])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 256);

Console.WriteLine(result?.Choices?[0].Message?.Content);

Prerequisites

Before using vision capabilities, ensure you have:

  1. The LlmTornado package installed
  2. A valid API key with vision model access
  3. Understanding of Chat Messages
  4. Images accessible via URL or as local files

Detailed Explanation

Vision-Enabled Models

Several models support vision capabilities:

OpenAI

  • GPT-4 Vision - Original vision model
  • GPT-4O - Optimized multimodal model
  • GPT-4 Turbo with Vision - Fast vision processing

Google Gemini

  • Gemini Pro Vision - Google's vision model
  • Gemini 1.5 Pro - Advanced multimodal understanding
  • Gemini 2.0 Flash - Fast vision processing

Anthropic

  • Claude 3 Opus - Highest capability vision
  • Claude 3 Sonnet - Balanced performance
  • Claude 3 Haiku - Fast and efficient

Image Input Formats

Images can be provided in two ways:

URL-Based Images

Direct links to publicly accessible images:

csharp
new ChatMessagePart(new Uri("https://example.com/image.jpg"))

Base64-Encoded Images

Local files or binary data encoded as base64:

csharp
byte[] bytes = await File.ReadAllBytesAsync("image.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(bytes)}";
new ChatMessagePart(base64, ImageDetail.Auto)

Image Detail Levels

Control processing detail and cost:

  • Auto - Model automatically chooses detail level
  • Low - Faster, less detailed analysis
  • High - Slower, more detailed analysis

Basic Usage

Analyze Image from URL

csharp
ChatResult? result = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(new Uri("https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSGfpQ3m-QWiXgCBJJbrcUFdNdWAhj7rcUqjeNUC6eKcXZDAtWm")),
        new ChatMessagePart("What is on this image?")
    ])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 256);

Console.WriteLine(result?.Choices?[0].Message?.Content);

Analyze Local Image File

csharp
byte[] bytes = await File.ReadAllBytesAsync("path/to/image.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(bytes)}";

ChatResult? result = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(base64, ImageDetail.Auto),
        new ChatMessagePart("What is on this image?")
    ])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 256);

Console.WriteLine(result?.Choices?[0].Message?.Content);

Multiple Images in One Request

Analyze multiple images together:

csharp
ChatResult? result = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(new Uri("https://example.com/image1.jpg")),
        new ChatMessagePart(new Uri("https://example.com/image2.jpg")),
        new ChatMessagePart("Compare these two images and describe the differences.")
    ])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 512);

Console.WriteLine(result?.Choices?[0].Message?.Content);

Using Different Models

csharp
// OpenAI GPT-4O
ChatResult? gpt4o = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(imageUrl),
        new ChatMessagePart("Describe this image in detail.")
    ])
], ChatModel.OpenAi.Gpt4.O, maxTokens: 500);

// Google Gemini
ChatResult? gemini = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(imageUrl),
        new ChatMessagePart("Describe this image in detail.")
    ])
], ChatModel.Google.Gemini.Gemini15Pro, maxTokens: 500);

// Anthropic Claude
ChatResult? claude = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(imageUrl),
        new ChatMessagePart("Describe this image in detail.")
    ])
], ChatModel.Anthropic.Claude3Opus, maxTokens: 500);

Advanced Usage

Image Detail Control

Optimize cost and quality by controlling detail level:

csharp
// High detail for complex images
byte[] bytes = await File.ReadAllBytesAsync("complex_diagram.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(bytes)}";

ChatResult? result = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(base64, ImageDetail.High),
        new ChatMessagePart("Extract all text and details from this diagram.")
    ])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 1000);

Multi-Turn Vision Conversation

Build conversations with image context:

csharp
Conversation conversation = api.Chat.CreateConversation(new ChatRequest
{
    Model = ChatModel.OpenAi.Gpt4.O,
    MaxTokens = 500
});

// First turn - introduce the image
byte[] imageBytes = await File.ReadAllBytesAsync("photo.jpg");
string base64Image = $"data:image/jpeg;base64,{Convert.ToBase64String(imageBytes)}";

conversation.Messages.Add(new ChatMessage(ChatMessageRoles.User, [
    new ChatMessagePart(base64Image, ImageDetail.Auto),
    new ChatMessagePart("What objects can you see in this image?")
]));

ChatRichResponse response1 = await conversation.GetResponseRich();
Console.WriteLine($"AI: {response1.Content}");

// Second turn - ask follow-up without re-sending image
conversation.AddUserMessage("Can you count how many people are in the image?");
ChatRichResponse response2 = await conversation.GetResponseRich();
Console.WriteLine($"AI: {response2.Content}");

// Third turn - more specific questions
conversation.AddUserMessage("What are they doing?");
ChatRichResponse response3 = await conversation.GetResponseRich();
Console.WriteLine($"AI: {response3.Content}");

OCR and Text Extraction

Extract text from images:

csharp
byte[] documentBytes = await File.ReadAllBytesAsync("document.png");
string base64 = $"data:image/png;base64,{Convert.ToBase64String(documentBytes)}";

ChatResult? result = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.System, 
        "You are an OCR assistant. Extract all visible text from images accurately."),
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(base64, ImageDetail.High),
        new ChatMessagePart("Extract all text from this document.")
    ])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 1000);

Console.WriteLine(result?.Choices?[0].Message?.Content);

Image Classification

Classify or categorize images:

csharp
string[] categories = ["landscape", "portrait", "food", "technology", "nature"];

ChatResult? result = await api.Chat.CreateChatCompletion([
    new ChatMessage(ChatMessageRoles.System, 
        $"Classify the image into one of these categories: {string.Join(", ", categories)}"),
    new ChatMessage(ChatMessageRoles.User, [
        new ChatMessagePart(imageUrl),
        new ChatMessagePart("What category does this image belong to?")
    ])
], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 50);

Console.WriteLine($"Category: {result?.Choices?[0].Message?.Content}");

Visual Question Answering

Answer specific questions about images:

csharp
byte[] imageBytes = await File.ReadAllBytesAsync("scene.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(imageBytes)}";

string[] questions = [
    "How many people are in this image?",
    "What is the weather like?",
    "What time of day does this appear to be?",
    "What activities are people doing?"
];

foreach (string question in questions)
{
    ChatResult? result = await api.Chat.CreateChatCompletion([
        new ChatMessage(ChatMessageRoles.User, [
            new ChatMessagePart(base64, ImageDetail.Auto),
            new ChatMessagePart(question)
        ])
    ], ChatModel.OpenAi.Gpt4.VisionPreview, maxTokens: 100);
    
    Console.WriteLine($"Q: {question}");
    Console.WriteLine($"A: {result?.Choices?[0].Message?.Content}");
    Console.WriteLine();
}

Structured Data Extraction

Extract structured information from images:

csharp
byte[] receiptBytes = await File.ReadAllBytesAsync("receipt.jpg");
string base64 = $"data:image/jpeg;base64,{Convert.ToBase64String(receiptBytes)}";

ChatResult? result = await api.Chat.CreateChatCompletion(new ChatRequest
{
    Model = ChatModel.OpenAi.Gpt4.VisionPreview,
    ResponseFormat = ChatRequestResponseFormats.Json,
    MaxTokens = 500,
    Messages = [
        new ChatMessage(ChatMessageRoles.System, 
            "Extract receipt information and return as JSON with fields: merchant, date, total, items."),
        new ChatMessage(ChatMessageRoles.User, [
            new ChatMessagePart(base64, ImageDetail.High),
            new ChatMessagePart("Extract the receipt data.")
        ])
    ]
});

Console.WriteLine(result?.Choices?[0].Message?.Content);
// Output: {"merchant": "...", "date": "...", "total": "...", "items": [...]}

Best Practices

1. Optimize Image Size

  • Resize large images before encoding
  • Use appropriate image formats (JPEG for photos, PNG for screenshots)
  • Compress images while maintaining quality
  • Consider image detail level vs cost tradeoff

2. Write Effective Prompts

  • Be specific about what you want to know
  • Ask one clear question per request
  • Provide context in system messages
  • Use structured output formats when needed

3. Handle Multiple Images

  • Group related images in single requests when appropriate
  • Reference specific images when asking questions
  • Consider context window limits
  • Balance detail level with number of images

4. Manage Costs

  • Use Low detail for simple tasks
  • Use High detail only when necessary
  • Cache vision results when appropriate
  • Batch similar image analyses

5. Error Handling

  • Validate image URLs are accessible
  • Check image file sizes and formats
  • Handle vision model unavailability
  • Implement fallbacks for unsupported images

Common Issues

Image Not Accessible

  • Issue: Model cannot access image URL
  • Solution: Ensure URL is publicly accessible or use base64
  • Prevention: Test URLs before sending to API

Image Too Large

  • Issue: Image file size exceeds limits
  • Solution: Resize or compress image before sending
  • Prevention: Check and limit image dimensions

Poor Recognition Quality

  • Issue: Model doesn't recognize image content well
  • Solution: Use higher detail level or better quality image
  • Prevention: Test with clear, high-quality images

Token Limit Exceeded

  • Issue: Response truncated due to token limit
  • Solution: Increase maxTokens parameter
  • Prevention: Set appropriate token limits for task

API Reference

ChatMessagePart (Image)

  • ChatMessagePart(Uri imageUrl) - Create from URL
  • ChatMessagePart(string base64, ImageDetail detail) - Create from base64
  • ImageUrl - Image URL or base64 string
  • Detail - Processing detail level

ImageDetail

  • Auto - Automatic detail selection
  • Low - Low detail processing
  • High - High detail processing

Vision-Capable Models

  • ChatModel.OpenAi.Gpt4.VisionPreview - GPT-4 Vision
  • ChatModel.OpenAi.Gpt4.O - GPT-4O multimodal
  • ChatModel.OpenAi.Gpt4.Turbo - GPT-4 Turbo with vision
  • ChatModel.Google.Gemini.Gemini15Pro - Gemini Pro Vision
  • ChatModel.Anthropic.Claude3Opus - Claude 3 Opus
  • ChatModel.Anthropic.Claude3Sonnet - Claude 3 Sonnet