Qwen 2 72B
ChatGPT 4o

Testing the limits of new open-source star - Qwen 2.
How does it fare against the industry standard GPT 4o?

Benchmarks and specs


Let's get straight to business - knowledge cutoff, number of parameters, release dates, and the context windows for Qwen 2 and ChatGPT 4o.

Specification ChatGPT 4o Qwen 2 72B Instruct
Сontext Window 128K 128K
Knowledge cutoff October 2023 2023 (month unspecified)
Number of parameters in the LLM > 175B 72B
Release Data May 13, 2024 June 7, 2024

The more modern Qwen 2 boasts equally impressive context window, expected for an Instruct model. ChatGPT 4 omni knowledge cutoff is pretty early for a cutting-edge LLM. Still, its parameter number is much higher.


This benchmark is the combination of official release notes for both models (GPT 4 omni and Qwen2), as well as some open benchmarks.

Benchmark ChatGPT 4 omni Qwen 2 72B Instruct
Undergraduate level knowledge
88.7 82.3
Graduate level reasoning
53.6 42.4
Human Eval
90.2 86.0
Grade school math
90.5 91.1
Math problem-solving
76.6 59.7
MixEval 87.9 86.7

As usual with new models - the harder the benchmark, the wider the performance gap. Graduate level math on GPQA benchmark is 53.6 for GPT 4o against 42.4 for Qwen 2. The rest is pretty even, except for MATH.

Practical tests

We're diving deep into testing AI in tough, real-world-like scenarios. We'll throw in prompts covering chat, coding, and image analysis to see how each model stacks up.

Prompt 1:

Give me 10 sarcastic jokes similar to the one below:

Dad: Why are your eyes red son?
Son: I've been smoking weed
Dad: Don't lie, you have been crying because you're coding in Javascript!
Example joke is taken from a Fireship retweet.

ChatGPT 4o output

Qwen 2 output

 Sure, here are 5 sarcastic jokes in a similar style:

1. **Dad:** Why are you so tired?
   **Daughter:** I was at a party all 
   **Dad:** Don't lie, you've been 
   dealing with CSS alignment issues 
## we edited next outputs
## to just the punchlines

2. ... too busy fixing 
       your spaghetti code ...

3. ... pulling your hair out over 
       a merge conflict ...

4.  ... too busy dealing 
        with memory leaks 
        in your Java program ...

5. ... trying to make sense of 
       that legacy code ...
# here, the setups are different so we kept all of them
1. Mom: Why are you so tired today?
   Son: I stayed up all night 
   playing video games.
   Mom: Don't lie, you've been up all night 
   debugging your code because you thought 
   "for" meant "four" in Python.

2. Dad: Why do you have band-aids all over your fingers?
   Son: I got into a fight with a cactus.
   Dad: Don't lie, you've been trying to
   close a bracket in your code and accidentally 
   punched the keyboard too many times.

3. Mom: Why are your glasses all fogged up?
   Daughter: I just came in from the cold.
   Mom: Don't lie, you've been crying because
   you accidentally deleted your entire
   codebase and have to start from scratch.

4. Dad: Why do you have a bruise on your forehead?
   Son: I walked into a door.
   Dad: Don't lie, you've been banging your head
   against the wall because you can't figure out
   why your code isn't working.
5. Dad: Why do you have a bandage on your finger?
   Son: I cut myself while cooking.
   Dad: Don't lie, you've been trying
   to slice a string in Python and accidentally 
   sliced your finger instead.
Correct Answers: ✅
Good understanding of nuance, also kept the initial structure.
Correct Answers: 8/10 ✅
Absurd, yet funny. Some details are on point. That's avant-garde art.

Prompt 2:

A man has 53 socks in his drawer: 
21 identical blue, 15 identical black and 17 identical red. 
The lights are out and he is completely in the dark. 

How many socks must he take out 
to make 100 percent certain he has at least one pair of black socks?

ChatGPT 4o output

Qwen 2 output

To ensure that the man has at least one pair of black socks, we need to consider the worst-case scenario. In the worst-case scenario, he could potentially pick out all the socks of the other colors before picking any black socks. Here are the steps to determine the number of socks he needs to take out to be 100 percent certain of having at least one pair of black socks:

1. **Count the total number of non-black socks**:
   - Blue socks: 21
   - Red socks: 17
   - Total non-black socks = 21 (blue) + 17 (red) = 38
2. **Worst-case scenario**:
   - In the worst-case scenario, he could pick all 38 non-black socks first.
3. **Picking black socks**:
   - After picking the 38 non-black socks, every additional sock he picks will be black because there are only black socks left.

4. **Ensuring a pair**:
   - To ensure he has at least one pair of black socks, he needs to pick 2 black socks.
   - Thus, he needs to pick 38 (non-black socks) + 2 (black socks) = 40 socks.
Therefore, the man must take out
**40 socks** 
to be 100 percent certain he has at least one pair of black socks.
 To ensure 100% certainty of having at least one pair of black socks, we need to consider the worst-case scenario when picking socks from the drawer. Here's how we approach this:

1. **Worst-case scenario for picking socks without getting a pair of black socks:**
   - Pick all the blue socks: 21 socks.
   - Pick all the red socks: 17 socks.
   - Pick one black sock: 1 sock.

So far, 21 (blue) + 17 (red) + 1 (black) = 39 socks have been picked without forming a pair of black socks.

2. **Ensuring a pair of black socks:**
   - After picking all the blue and red socks and one black sock, to ensure a pair of black socks, he needs to pick one more black sock.
So, he must pick: 39 (from step 1) + 1 (additional black sock) = 
**40 socks** 
in total to make 100% certain he has at least one pair of black socks.
Correct Answer: ✅ 
Well-detailed, with clear explanation.
Correct Answer: ✅ 
Straight to the point, and the direction is clear from the first step.

Prompt 4:

Write me code for the snake game. 
Give me a score counter, 
an option to restart on C and quit on Q.
//the next line was accomplished only by gpt 4o
Add the menu to choose difficulty.

ChatGPT 4o output

Qwen2 output

snake game with a score counter, programmed by llama 3snake game, programmed by chat gpt 3.5
Correct Answer: ✅ 
Rose up to the occasion with pleasant design and proper task execution.
Correct Answer: 🤷 
Not too bad, nice controls - yet the prompt had to be cut short before we got the results.

Additional Features

The Qwen 2 model was pressure-tested against risk-high activity (such as explicit materials and fraud) across multiple languages. It demonstrated safety performance on par with the renowned GPT-4o.

Moreover, Qwen 2's capabilities were also tested for long-form context comprehension using the Needle in a Haystack test. The model was able to accurately extract details from text spanning up to 128K tokens without missing any critical information.

ChatGPT 4o is around 1.5x times faster than Qwen 2 and also has vision and speech recognition capabilities - an important feat. If you want a recap of GPT 4o announcement - check out this article here.


Qwen 2 AI is beaten by GPT 4o in benchmarks, and although most tests were close - coding showed some important differences. Most likely Qwen 72B Instruct will find its place in the research community with its good comprehension of the whole 128K tokens.

The Pricing model is given in AI/ML API tokens. As Qwen 2 is open-source, setting it up and maintaining locally would have different pricing.

1k AI/ML Tokens Qwen 2 72B Instruct ChatGPT 4o
Input price $0.00117 $0.0065
Output price $0.00117 $0.0195
Compare for yourself

Now that we've highlighted some key strengths, and run the tests - it's time for you to test out the models for your use case. Below, you'll find a code snippet that pits Qwen 2 against ChatGPT 4o using our API Key. Throw anything at them, and see how robust they are in your field!

import openai

def main():
    client = OpenAI(

    # Specify the two models you want to compare
    model1 = 'gpt-4o'
    model2 = 'Qwen/Qwen2-72B-Instruct'
    selected_models = [model1, model2]

    system_prompt = 'You are an AI assistant that only responds with jokes.'
    user_prompt = 'Why is the sky blue?'
    results = {}

    for model in selected_models:
            response = client.chat.completions.create(
                    {'role': 'system', 'content': system_prompt},
                    {'role': 'user', 'content': user_prompt}

            message = response.choices[0].message.content
            results[model] = message
        except Exception as error:
            print(f"Error with model {model}:", error)

    # Compare the results
    print('Comparison of models:')
    print(f"{model1}: {results.get(model1, 'No response')}")
    print(f"{model2}: {results.get(model2, 'No response')}")

if __name__ == "__main__":


Qwen 2 couldn't keep up with ChatGPT 4o, and unlike LLama 3, its speed couldn't compensate for the drawbacks. It's a good open-source alternative, still. The cheaper output price certainly makes the model worth consideration.

No matter which model you go with, prepare to be blown away by its capabilities. If you're feeling good after our comparison, go for ChatGPT 4o. If the pricing model sealed the deal for you, Qwen 2 is your best bet.

You can check our model lineup here - try any of them for yourself with our API Key.

