Claude 3.7 Sonnet vs. Grok 3 vs. o3-mini-high

Just a week after Grok’s release, we now have the Claude 3.7 Sonnet, which certainly has eaten into Grok’s hype pie. Grok was definitely one of the best models for coding, and now, with the new Claude, the equations might change.
Anthropic has been clear about where to position its model: to become the best at coding, for the developers, by the developers, of the developers.

However, I have been using Grok 3 and o3-mini-high, which have performed well on most coding tasks. So, I wanted to see how the new Sonnet compares with Grok 3 and OpenAI’s o3-mini-high.

This is in no way a thorough benchmark; it’s mostly a vibe check, but it’s enough to gauge each model’s coding potential.
So, let’s dive in.

TL;DR

If you want to jump straight to the conclusion, when compared against these three models, Claude 3.7 Sonnet is the clear winner when writing code.

Grok 3 and o3-mini-high are somewhat similar, but if I had to compare them, Grok 3 would generate slightly better code than the o3-mini-high model.

Brief on Claude 3.7 Sonnet

This AI Model was released a few days ago and is already the talk of the tech community. I’m emphasizing tech because this model is widely regarded as the best AI model for code, at least for now.

Claude 3.7 Sonnet supports up to 128K output tokens (beta), which is over 15x longer than before. This is especially handy when generating longer and high-quality code.

It’s eating all the other AI models on the SWE Benchmark with an average accuracy of 62.3%. Its accuracy can even exceed 70%, the highest of any AI model. This represents a 13–20% accuracy gap compared to top OpenAI models, the previous Anthropic model Claude 3.5, and open-source models like DeepSeek R1.

Despite this power, Claude 3.7 has an 18% reduction in total costs compared to its earlier models. It maintains consistent token-based pricing at $3 per million input tokens and $15 per million output tokens.

Not only that, but Anthropic released Claude Code along with it, an agentic AI CLI that understands your codebase, helps fix issues, answers questions, and, with its Git integration, gives you an idea of your project history.

Check out this video to get a brief understanding of it:

Comparing Coding Abilities

I have high hopes for the Claude 3.7 Sonnet. Let’s see if we can spot any significant differences in code between Sonnet and the other two models.

Let’s start with something interesting:

1. Build simple Minecraft using Pygame

The task is simple: all three LLMs are asked to build a simple Minecraft game using Pygame.

Prompt: Build me a very simple Minecraft game using Pygame in Python.

Response from Claude 3.7 Sonnet

Here’s the code it generated:

Here’s the output of the program:

Response from Grok 3

Here’s the code it generated:

Here’s the output of the program:

This program we got is pretty disappointing and definitely not what I expected. Nothing works as it should, except player movement. It looks way less like Minecraft and more like a snake game.

Response from OpenAI o3-mini-high

Here’s the code it generated:

Here’s the output of the program:

The output by the o3-mini-high model is pretty disappointing; we only got a blank screen with the added background colour.

Final Verdict: It’s pretty fair to say Claude 3.7 wins by a huge margin. Everything works just as expected. The overall game it built includes almost all the features I thought of.

2. Multiple Balls in a Spinning Hexagon

Let’s quickly test all these models with a standard question for judging different LLMs.

This is a modified version of the question where only one ball is spun up inside a spinning hexagon.

Prompt: Write me a Python script for 10 balls inside a fast-spinning hexagon.

Response from Claude 3.7 Sonnet

Here’s the code it generated:

The code has just a few minor issues, like how we are extracting values from the normal returned by the check_ball_collision method without checking if the value is None.

Other than that, everything looks quite good.

Here’s the output of the program:

There’s an issue: the balls are not supposed to be outside the hexagon.

When I asked it the standard question of one ball inside the spinning hexagon, it answered easily, but it couldn’t handle it when I tweaked it just a bit.

It is fair to say that this model could not handle this question well.

Response from OpenAI o3-mini-high

Here’s the generated code:

Final Verdict: On this question, I was pretty surprised that, except for the Grok 3 model, both Claude 3.7 and o3-mini-high got the answer correct. o3-mini-high seems to have outperformed Grok 3 here. However, Sonnet just blew them out of the water.

3. Build a real-time Browser-Based Markdown Editor

Considering how good Claude has been in both the tests of building a game and animation, and again with their solid claims on coding, let’s do a quick web app test on all three models.

Prompt: Build a tiny browser-based Markdown editor with syntax highlighting, export-to-PDF functionality, and a minimal UI using Tailwind in Next.js, with all the changes in a single file.

Response from Claude 3.7 Sonnet

The code is all good, except that code highlighting does not work correctly. This seems to be an issue because marked Recently introduced breaking changes to set up code highlighting.

The model may not be trained on the most recent data for the module.

Here’s the exported PDF:

Response from Grok 3

We have a couple of issues, the first one being that the headings don’t really work. There are also issues with the font contrast, and the exported PDF doesn’t render the Markdown correctly.

In the exported PDF, we get raw text instead of formatted Markdown, with no emoji support.

Response from OpenAI o3-mini-high

Here’s the generated code:

Verdict: Here, as well, Claude 3.7 is the clear winner compared to the other two models. Almost everything worked, but the other two models couldn’t get it right. There were some issues with text contrast and markdown rendering on the site and in the PDF.

4. Create a Code Diff Viewer

Let’s try a simple web application example to see if all of them get it right.

This is a pretty standard question and somewhat easy to implement. I have high hopes that all three of them should get it right (even o3-mini-high ). So far, I’m somewhat disappointed with it, though.

Prompt: Write a simple web application for a code diff viewer, a tool that takes two text inputs and highlights differences side by side.

Response from Claude 3.7 Sonnet

Response from Grok 3

Everything else seems to be working fine with this output as well, but in the diff, it is not taking the line’s indentation into account.

Response from OpenAI o3-mini-high

Here, we have an interesting result. It decided to use an external library diff and do highlighting per character and not per line.

Final Verdict: By functionality-wise, all three models got this problem correct (at least from testing at the surface level. There might be edge cases, though). I must say the overall code quality and output by o3-mini-high is comparatively better than both the Claude 3.7 Sonnet and Grok 3 model.

5. Manim code for Square to Pyramid Animation

Let’s end our test with a final Manim question. Most LLMs pretty much suck at writing Manim code, and it gets even harder when it’s a 3D scene.

Prompt: Create a Manim animation in Python where a 2D square smoothly lifts into 3D space to form a pyramid. Then, animate the pyramid shrinking back into a square.

Response from Claude 3.7 Sonnet

Response from Grok 3

Output

Response from OpenAI o3-mini-high

Here, this model really struggled with the 3D projection and failed to transform the square into a pyramid.

Final Verdict: The Claude 3.7 Sonnet and Grok 3 models got it completely correct, but in terms of animation, I prefer the output from the Grok 3 model. o3-mini-high failed completely and couldn’t even reach the solution.

Conclusion

It’s fair to say that Claude 3.7 is what it claims to be. Of the two questions we compared, it was the clear winner every time.

It doesn’t necessarily mean that Claude 3.7 is the answer to everything, but it defintely better at one thing and that is coding.

The race between the AI models will never stop, and the game’s always on!

Table of Contents

TL;DR

Brief on Claude 3.7 Sonnet

Comparing Coding Abilities

1. Build simple Minecraft using Pygame

2. Multiple Balls in a Spinning Hexagon

3. Build a real-time Browser-Based Markdown Editor

4. Create a Code Diff Viewer

5. Manim code for Square to Pyramid Animation

Conclusion