Last updated on October 24, 2024
Claude 3.5 Haiku and the upgraded Claude 3.5 Sonnet are two newly released models in the Claude 3 family developed by Anthropic. These models showcase improved reasoning, coding, and visual processing capabilities. They also introduce notable advancements in tool use and autonomous task completion.
The upgraded Claude 3.5 Sonnet model's standout feature is its ability to use computers, interpreting screenshots of graphical user interfaces (GUIs) and generating tool calls to perform tasks. This innovation enables Claude to interact with websites and applications, completing multi-step tasks.
On the other hand, Claude 3.5 Haiku is a text-only model that excels in reasoning and instruction following, achieving comparable results to the Claude 3.5 Sonnet.
Enhanced Visual Processing: The upgraded Sonnet surpasses earlier models by processing visual data directly from screenshots, unlike prior models that relied solely on text-based inputs.
Autonomous Tool Use: While earlier Claude models demonstrated reasoning and coding skills, the new Sonnet model extends its abilities to autonomous GUI interaction, making it unique in handling real-world tasks on a computer interface.
Agentic Task Completion: Both models show improved performance in agentic tasks (tasks requiring decision-making and self-correction), particularly with coding and complex workflow automation.
Claude 3.5 Haiku’s Refinement: Despite being text-based, Claude 3.5 Haiku matches or even surpasses previous models, focusing on tasks requiring structured reasoning and instruction adherence.
Please fill out the vendor request form for Ant Equipment Co. using data from either the vendor spreadsheet or search portal tabs in window one. List & verify each field as you complete the form in window two.
Here's this prompt tested:
Category | Claude 3.5 Sonnet (New) - 15 steps | Claude 3.5 Sonnet (New) - 50 steps | **Human Success Rate ** |
---|---|---|---|
Success Rate | 95% CI | Success Rate | 95% CI |
OS | 54.2% [34.3, 74.1]% | 41.7% [22.0, 61.4]% | 75.00% |
Office | 7.7% [2.9, 12.5]% | 17.9% [11.0, 24.8]% | 71.79% |
Daily | 16.7% [8.4, 25.0]% | 24.4% [14.9, 33.9]% | 70.51% |
Professional | 24.5% [12.5, 36.5]% | 40.8% [27.0, 54.6]% | 73.47% |
Workflow | 7.9% [2.6, 13.2]% | 10.9% [4.9, 17.0]% | 73.27% |
Overall | 14.9% [11.3, 18.5]% | 22% [17.8, 26.2]% | 72.36% |
Task | Claude 3.5 Sonnet (New) | Claude 3.5 Sonnet | Claude 3 Opus | Claude 3 Sonnet | **GPT-4o ** | **Gemini 1.5 Pro ** |
---|---|---|---|---|---|---|
Visual Question Answering | 70.4% | 68.3% | 59.4% | 53.1% | 69.1% | 65.9% |
MathVista (Testmini) | 70.7% | 67.7% | 50.5% | 47.9% | 63.8% | 68.1% |
AI2D (Test) | 95.3% | 94.7% | 88.1% | 88.7% | 94.2% | — |
ChartQA (Test, Relaxed Accuracy) | 90.8% | 90.8% | 80.8% | 81.1% | 85.7% | — |
DocVQA (Test, ANLS Score) | 94.2% | 95.2% | 89.3% | 89.5% | 92.8% | — |
Task | Claude 3.5 Sonnet (New) | Claude 3.5 Sonnet | Claude 3 Opus | Claude 3 Sonnet | **GPT-4o ** | **Gemini 1.5 Pro ** | Llama 3.1 (405B) | GPQA (Diamond) |
---|---|---|---|---|---|---|---|---|
Graduate Level Q&A (0-shot CoT) | 65.0% | 59.4% | 50.4% | 40.4% | 53.6% | 59.1% | 51.1% | |
MMLU (5-shot CoT) | 90.5% | 90.4% | 88.2% | 81.5% | — | — | — | |
MMLU Pro (0-shot CoT) | 78.0% | 75.1% | 67.9% | 54.9% | — | 75.8% | 73.3% | |
MATH (0-shot CoT) | 78.3% | 71.1% | 60.1% | 43.1% | 76.6% | 86.5% | 73.8% | |
HumanEval (Python Coding Tasks) | 93.7% | 92.0% | 84.9% | 73.0% | 90.2% | — | 89.0% |
Task | Claude 3.5 Haiku | Claude 3 Haiku | GPT-4o mini | Gemini 1.5 Flash |
---|---|---|---|---|
Graduate Level Q&A (0-shot CoT) | 41.6% | 33.3% | 40.2% | 51.0% |
MMLU (General Reasoning 5-shot CoT) | 80.9% | 76.7% | — | — |
MMLU (General Reasoning 5-shot) | 77.6% | 75.2% | — | — |
MMLU (General Reasoning 0-shot CoT) | 80.3% | 74.0% | 82.0% | — |
MMLU Pro (General Reasoning 0-shot CoT) | 65.0% | 49.0% | — | 67.3% |
MATH (Mathematical Problem Solving 0-shot CoT) | 69.2% | 38.9% | 70.2% | 77.9% (4-shot CoT) |
HumanEval (Python Coding Tasks 0-shot) | 88.1% | 75.9% | 87.2% | — |
MGSM (Multilingual Math 0-shot CoT) | 85.6% | 75.1% | 87.0% | — |
DROP (Reading Comprehension, Arithmetic F1 Score, 3-shot) | 83.1 | 78.4 | 79.7 | — |
BIG-Bench Hard (Mixed Evaluations 3-shot CoT) | 86.6% | 73.7% | — | — |
AIME 2024 (High School Math Competition 0-shot CoT) | 5.3% | 0.8% | — | — |
Maj@64 (0-shot CoT) | 10.1% | 0.4% | — | — |
IFEval (Instruction Following) | 85.9% | 77.2% | — | — |