Introduction
I am diving into the vision capabilities of large language models (LLMs) to see if they can accurately classify images, specifically focusing on spotting door handle positions to tell if they’re locked or unlocked. This experiment includes basic tests to evaluate accuracy, speed, and token usage, offering an initial comparison across models.
Scenario
Imagine using a webcam to monitor door security, providing images of door handles in different lighting conditions (day and night). The system’s goal is to classify the handle’s position—vertical (locked) or horizontal (unlocked)—and report the status in a parseable format like JSON. This could be a valuable feature in home automation systems. While traditional machine learning models, which require specific training, might achieve better performance, this experiment explores the potential of large language models (LLMs) in this task.
Approach
First, I took some pictures and fed them to the best LLMs, no code, through their websites (Claude 3 Opus, OpenAI GPT-4) to see if they could accurately classify door handle positions. Was this method viable or would it end up being a waste of time?
The initial results were encouraging, but I needed to verify if the models could consistently perform well. With a binary classifier, there’s a 50% chance of guessing correctly, so I wanted to ensure the accuracy was truly meaningful.
To ensure deterministic outputs, I used a prompt with a temperature setting of 0.0. To save on tokens and improve processing speed, I resized the images using the following command:
convert original_image.jpg -resize 200x200 resized.jpg
Next, I wrote a script to access Anthropic models, comparing the classification results to the actual positions indicated by the image filenames (v for vertical, h for horizontal).
./locks_classifier.js -m Haiku -v
🤖 Haiku
images/test01_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 794 ms
images/test02_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 1073 ms
images/test03_h.jpg ❌
📊 In: 202 tkn Out: 11 Time: 604 ms
Correct Responses: (12 / 20) 60%
Total In Tokens: 3976
Total Out Tokens: 220
Avg Time: 598 ms
The results for Haiku were somewhat underwhelming, while Sonnet performed even worse, albeit with similar speed.
I experimented with few-shot examples embedded in the prompt, but this did not improve the results.
Out of curiosity, I also tested OpenAI models, adapting my scripts to accommodate their slightly different APIs (it’s frustrating that there isn’t a standard yet, right?).
The results with OpenAI models were significantly better. Although slightly slower, they were much more accurate in comparison.
GPT-4-Turbo:
./locks_classifier.js -m GPT4 -v
Responses: (16 / 20) 80%
In Tokens: 6360 Out Tokens: 240
Avg Time: 2246 ms
The just released GPT-4o:
./locks_classifier.js -m GPT4o -v
Responses: (20 / 20) 100%
In Tokens: 6340 Out Tokens: 232
Avg Time: 1751 ms
What I learnt
1) LLM Performance: I was curious to see how the models would perform, and I am quite impressed by GPT-4o. It delivered high accuracy and reasonable speed. On the other hand, Haiku’s performance was somewhat disappointing, although its lower cost and faster response time make it appealing for many applications. There’s definitely potential to explore Haiku further.
2) Temperature 0.0: I was surprised by the varying responses even with the temperature set to 0.0, which should theoretically produce consistent results. This variability was unexpected and suggests that other factors may be influencing the outputs. Any ideas on why this might be happening?
🤖 Haiku *Run #1*
Responses: (5 / 11) 45%
In Tokens: 2222 Out Tokens: 121
Avg Time: 585 ms
🤖 Haiku *Run #2*
Correct Responses: (7 / 11) 64%
In Tokens: 2222 Out Tokens: 121
Avg Time: 585 ms
🤖 Haiku *Run #3*
Correct Responses: (4 / 11) 36%
In Tokens: 2222 Out Tokens: 121
Avg Time: 583 ms
3) Variability in Tokenization: There is significant variability in the number of tokens generated by different models for the same input. This variability impacts cost estimates and efficiency, as token usage directly influences the expense of using these models.
Model | In Tks | Out Tks | $/M In Tks | $/M Out Tks | Images per $1 |
---|---|---|---|---|---|
Haiku | 202 | 11 | $0.25 | $1.25 | 15,563 |
Sonnet | 156 | 11 | $3.00 | $15.00 | 1,579 |
GPT-4 | 318 | 12 | $5.00 | $15.00 | 565 |
GPT-4o | 317 | 12 | $10.00 | $30.00 | 283 |
4) Variability in Response Time: I did not expect the same model, given the same input size, to have such a wide range of response times. This variability suggests that there are underlying factors affecting the inference speed.
Model | Avg Res Time (ms) | Min Res Time (ms) | Max Res Time (ms) |
---|---|---|---|
Haiku | 598 | 351 | 1073 |
Sonnet | 605 | 468 | 1011 |
GPT-4 | 2246 | 1716 | 6037 |
GPT-4o | 1751 | 1172 | 4559 |
Overall, while the accuracy and results are interesting, they can vary significantly depending on the images used. For instance, would larger images improve the performance of models like Haiku and Sonnet?
Next steps
Here are a few ideas to dive deeper into:
1. Explore Different Challenges: Consider swapping the current challenge with a different task to further test the capabilities of LLMs in various scenarios.
2. Test Local Vision-Enabled Models: Evaluate models like Llava 1.5 7B running locally on platforms such as LM Studio or Ollama. Would a local LLM provide a viable option?
3.Compare with Traditional ML Models: Conduct tests against more traditional machine learning models to see how many sample images are needed to achieve similar or better accuracy.
Let me know if you have any comments or questions. I’d love to hear your suggestions on where to go next and what tests you’d like to see conducted!
Top comments (0)