Qwen 2 72B
VS
ChatGPT 4o

Testing the limits of new open-source star - Qwen 2.
How does it fare against the industry standard GPT 4o?

Benchmarks and specs

Specs

Let's get straight to business - knowledge cutoff, number of parameters, release dates, and the context windows for Qwen 2 and ChatGPT 4o.

Specification	ChatGPT 4o	Qwen 2 72B Instruct
Сontext Window	128K	128K
Knowledge cutoff	October 2023	2023 (month unspecified)
Number of parameters in the LLM	> 175B	72B
Release Data	May 13, 2024	June 7, 2024

The more modern Qwen 2 boasts equally impressive context window, expected for an Instruct model. ChatGPT 4 omni knowledge cutoff is pretty early for a cutting-edge LLM. Still, its parameter number is much higher.

Benchmarks

This benchmark is the combination of official release notes for both models (GPT 4 omni and Qwen2), as well as some open benchmarks.

Benchmark	ChatGPT 4 omni	Qwen 2 72B Instruct
Undergraduate level knowledge MMLU	88.7	82.3
Graduate level reasoning GPQA	53.6	42.4
Code Human Eval	90.2	86.0
Grade school math GSM8K	90.5	91.1
Math problem-solving MATH	76.6	59.7
MixEval	87.9	86.7

As usual with new models - the harder the benchmark, the wider the performance gap. Graduate level math on GPQA benchmark is 53.6 for GPT 4o against 42.4 for Qwen 2. The rest is pretty even, except for MATH.

Get API Key

Practical tests

We're diving deep into testing AI in tough, real-world-like scenarios. We'll throw in prompts covering chat, coding, and image analysis to see how each model stacks up.

Prompt 1:

Give me 10 sarcastic jokes similar to the one below:

Dad: Why are your eyes red son?
Son: I've been smoking weed
Dad: Don't lie, you have been crying because you're coding in Javascript!

Example joke is taken from a Fireship retweet.

ChatGPT 4o output

Qwen 2 output

 Sure, here are 5 sarcastic jokes in a similar style:

1. **Dad:** Why are you so tired?
   **Daughter:** I was at a party all 
   night.
   **Dad:** Don't lie, you've been 
   dealing with CSS alignment issues 
   again.
   
## we edited next outputs
## to just the punchlines

2. ... too busy fixing 
       your spaghetti code ...

3. ... pulling your hair out over 
       a merge conflict ...

4.  ... too busy dealing 
        with memory leaks 
        in your Java program ...

5. ... trying to make sense of 
       that legacy code ...

# here, the setups are different so we kept all of them
1. Mom: Why are you so tired today?
   Son: I stayed up all night 
   playing video games.
   Mom: Don't lie, you've been up all night 
   debugging your code because you thought 
   "for" meant "four" in Python.

2. Dad: Why do you have band-aids all over your fingers?
   Son: I got into a fight with a cactus.
   Dad: Don't lie, you've been trying to
   close a bracket in your code and accidentally 
   punched the keyboard too many times.

3. Mom: Why are your glasses all fogged up?
   Daughter: I just came in from the cold.
   Mom: Don't lie, you've been crying because
   you accidentally deleted your entire
   codebase and have to start from scratch.

4. Dad: Why do you have a bruise on your forehead?
   Son: I walked into a door.
   Dad: Don't lie, you've been banging your head
   against the wall because you can't figure out
   why your code isn't working.
   
5. Dad: Why do you have a bandage on your finger?
   Son: I cut myself while cooking.
   Dad: Don't lie, you've been trying
   to slice a string in Python and accidentally 
   sliced your finger instead.

Correct Answers: ✅
Good understanding of nuance, also kept the initial structure.

Correct Answers: 8/10 ✅
Absurd, yet funny. Some details are on point. That's avant-garde art.

Prompt 2:

A man has 53 socks in his drawer: 
21 identical blue, 15 identical black and 17 identical red. 
The lights are out and he is completely in the dark. 

How many socks must he take out 
to make 100 percent certain he has at least one pair of black socks?

ChatGPT 4o output

Qwen 2 output

To ensure that the man has at least one pair of black socks, we need to consider the worst-case scenario. In the worst-case scenario, he could potentially pick out all the socks of the other colors before picking any black socks. Here are the steps to determine the number of socks he needs to take out to be 100 percent certain of having at least one pair of black socks:

1. **Count the total number of non-black socks**:
   - Blue socks: 21
   - Red socks: 17
   - Total non-black socks = 21 (blue) + 17 (red) = 38
   
2. **Worst-case scenario**:
   - In the worst-case scenario, he could pick all 38 non-black socks first.
   
3. **Picking black socks**:
   - After picking the 38 non-black socks, every additional sock he picks will be black because there are only black socks left.

4. **Ensuring a pair**:
   - To ensure he has at least one pair of black socks, he needs to pick 2 black socks.
   - Thus, he needs to pick 38 (non-black socks) + 2 (black socks) = 40 socks.
##THE ANSWER
Therefore, the man must take out
**40 socks** 
to be 100 percent certain he has at least one pair of black socks.

 To ensure 100% certainty of having at least one pair of black socks, we need to consider the worst-case scenario when picking socks from the drawer. Here's how we approach this:

1. **Worst-case scenario for picking socks without getting a pair of black socks:**
   - Pick all the blue socks: 21 socks.
   - Pick all the red socks: 17 socks.
   - Pick one black sock: 1 sock.

So far, 21 (blue) + 17 (red) + 1 (black) = 39 socks have been picked without forming a pair of black socks.

2. **Ensuring a pair of black socks:**
   - After picking all the blue and red socks and one black sock, to ensure a pair of black socks, he needs to pick one more black sock.
##THE ANSWER
So, he must pick: 39 (from step 1) + 1 (additional black sock) = 
**40 socks** 
in total to make 100% certain he has at least one pair of black socks.

Correct Answer: ✅
Well-detailed, with clear explanation.

Correct Answer: ✅
Straight to the point, and the direction is clear from the first step.

Prompt 4:

Write me code for the snake game. 
Give me a score counter, 
an option to restart on C and quit on Q.
//the next line was accomplished only by gpt 4o
Add the menu to choose difficulty.

ChatGPT 4o output

Qwen2 output

snake game with a score counter, programmed by llama 3

Correct Answer: ✅
Rose up to the occasion with pleasant design and proper task execution.

Correct Answer: 🤷
Not too bad, nice controls - yet the prompt had to be cut short before we got the results.
‍

Prompt 4:

Analyze the following image:

By Lennart Sikkema - Imported from 500px (archived version) by the Archive Team. (detail page), CC BY 3.0, Link

LLama 3 70B output

ChatGPT-3.5 output

Clever trick!
You still have 4 marbles, 
but they're no longer in the cup 
because you turned it upside down! 

They're probably scattered 
around on the floor or counter now!

You still have 4 marbles in the cup, 
even though it is now upside down and
in the freezer

Correct Answer: ✅
Trick question deserves a trick answer!
Good understading of nuance.

Correct Answer: ❌
Even Zero-shot Chain of Thought couldn't save it in tests.

Additional Features

The Qwen 2 model was pressure-tested against risk-high activity (such as explicit materials and fraud) across multiple languages. It demonstrated safety performance on par with the renowned GPT-4o.

Moreover, Qwen 2's capabilities were also tested for long-form context comprehension using the Needle in a Haystack test. The model was able to accurately extract details from text spanning up to 128K tokens without missing any critical information.

ChatGPT 4o is around 1.5x times faster than Qwen 2 and also has vision and speech recognition capabilities - an important feat. If you want a recap of GPT 4o announcement - check out this article here.

Conclusion

Qwen 2 AI is beaten by GPT 4o in benchmarks, and although most tests were close - coding showed some important differences. Most likely Qwen 72B Instruct will find its place in the research community with its good comprehension of the whole 128K tokens.

Get API Key

Pricing

The Pricing model is given in AI/ML API tokens. As Qwen 2 is open-source, setting it up and maintaining locally would have different pricing.

1k AI/ML Tokens	Qwen 2	ChatGPT 4o
Input price	$0.00117	$0.0065
Output price	$0.00117	$0.0195

Get API Key

Compare for yourself

Now that we've highlighted some key strengths, and run the tests - it's time for you to test out the models for your use case. Below, you'll find a code snippet that pits Qwen 2 against ChatGPT 4o using our API Key. Throw anything at them, and see how robust they are in your field!

import openai

def main():
    client = OpenAI(
      api_key='<YOUR_API_KEY>',
      base_url="https://api.aimlapi.com",
    )

    # Specify the two models you want to compare
    model1 = 'gpt-4o'
    model2 = 'Qwen/Qwen2-72B-Instruct'
    selected_models = [model1, model2]

    system_prompt = 'You are an AI assistant that only responds with jokes.'
    user_prompt = 'Why is the sky blue?'
    results = {}

    for model in selected_models:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {'role': 'system', 'content': system_prompt},
                    {'role': 'user', 'content': user_prompt}
                ],
            )

            message = response.choices[0].message.content
            results[model] = message
        except Exception as error:
            print(f"Error with model {model}:", error)

    # Compare the results
    print('Comparison of models:')
    print(f"{model1}: {results.get(model1, 'No response')}")
    print(f"{model2}: {results.get(model2, 'No response')}")

if __name__ == "__main__":
    main()

‍

Conclusion

Qwen 2 couldn't keep up with ChatGPT 4o, and unlike LLama 3, its speed couldn't compensate for the drawbacks. It's a good open-source alternative, still. The cheaper output price certainly makes the model worth consideration.

No matter which model you go with, prepare to be blown away by its capabilities. If you're feeling good after our comparison, go for ChatGPT 4o. If the pricing model sealed the deal for you, Qwen 2 is your best bet.

You can check our model lineup here - try any of them for yourself with our API Key.

Get API Key