Let's get straight to business - knowledge cutoff, number of parameters, release dates, and the context windows for Qwen 2 and ChatGPT 4o.
The more modern Qwen 2 boasts equally impressive context window, expected for an Instruct model. ChatGPT 4 omni knowledge cutoff is pretty early for a cutting-edge LLM. Still, its parameter number is much higher.
This benchmark is the combination of official release notes for both models (GPT 4 omni and Qwen2), as well as some open benchmarks.
As usual with new models - the harder the benchmark, the wider the performance gap. Graduate level math on GPQA benchmark is 53.6 for GPT 4o against 42.4 for Qwen 2. The rest is pretty even, except for MATH.
We're diving deep into testing AI in tough, real-world-like scenarios. We'll throw in prompts covering chat, coding, and image analysis to see how each model stacks up.
Give me 10 sarcastic jokes similar to the one below:
Dad: Why are your eyes red son?
Son: I've been smoking weed
Dad: Don't lie, you have been crying because you're coding in Javascript!
Example joke is taken from a Fireship retweet. Sure, here are 5 sarcastic jokes in a similar style:
1. **Dad:** Why are you so tired?
**Daughter:** I was at a party all
night.
**Dad:** Don't lie, you've been
dealing with CSS alignment issues
again.
## we edited next outputs
## to just the punchlines
2. ... too busy fixing
your spaghetti code ...
3. ... pulling your hair out over
a merge conflict ...
4. ... too busy dealing
with memory leaks
in your Java program ...
5. ... trying to make sense of
that legacy code ...
# here, the setups are different so we kept all of them
1. Mom: Why are you so tired today?
Son: I stayed up all night
playing video games.
Mom: Don't lie, you've been up all night
debugging your code because you thought
"for" meant "four" in Python.
2. Dad: Why do you have band-aids all over your fingers?
Son: I got into a fight with a cactus.
Dad: Don't lie, you've been trying to
close a bracket in your code and accidentally
punched the keyboard too many times.
3. Mom: Why are your glasses all fogged up?
Daughter: I just came in from the cold.
Mom: Don't lie, you've been crying because
you accidentally deleted your entire
codebase and have to start from scratch.
4. Dad: Why do you have a bruise on your forehead?
Son: I walked into a door.
Dad: Don't lie, you've been banging your head
against the wall because you can't figure out
why your code isn't working.
5. Dad: Why do you have a bandage on your finger?
Son: I cut myself while cooking.
Dad: Don't lie, you've been trying
to slice a string in Python and accidentally
sliced your finger instead.
A man has 53 socks in his drawer:
21 identical blue, 15 identical black and 17 identical red.
The lights are out and he is completely in the dark.
How many socks must he take out
to make 100 percent certain he has at least one pair of black socks?
To ensure that the man has at least one pair of black socks, we need to consider the worst-case scenario. In the worst-case scenario, he could potentially pick out all the socks of the other colors before picking any black socks. Here are the steps to determine the number of socks he needs to take out to be 100 percent certain of having at least one pair of black socks:
1. **Count the total number of non-black socks**:
- Blue socks: 21
- Red socks: 17
- Total non-black socks = 21 (blue) + 17 (red) = 38
2. **Worst-case scenario**:
- In the worst-case scenario, he could pick all 38 non-black socks first.
3. **Picking black socks**:
- After picking the 38 non-black socks, every additional sock he picks will be black because there are only black socks left.
4. **Ensuring a pair**:
- To ensure he has at least one pair of black socks, he needs to pick 2 black socks.
- Thus, he needs to pick 38 (non-black socks) + 2 (black socks) = 40 socks.
##THE ANSWER
Therefore, the man must take out
**40 socks**
to be 100 percent certain he has at least one pair of black socks.
To ensure 100% certainty of having at least one pair of black socks, we need to consider the worst-case scenario when picking socks from the drawer. Here's how we approach this:
1. **Worst-case scenario for picking socks without getting a pair of black socks:**
- Pick all the blue socks: 21 socks.
- Pick all the red socks: 17 socks.
- Pick one black sock: 1 sock.
So far, 21 (blue) + 17 (red) + 1 (black) = 39 socks have been picked without forming a pair of black socks.
2. **Ensuring a pair of black socks:**
- After picking all the blue and red socks and one black sock, to ensure a pair of black socks, he needs to pick one more black sock.
##THE ANSWER
So, he must pick: 39 (from step 1) + 1 (additional black sock) =
**40 socks**
in total to make 100% certain he has at least one pair of black socks.
Write me code for the snake game.
Give me a score counter,
an option to restart on C and quit on Q.
//the next line was accomplished only by gpt 4o
Add the menu to choose difficulty.
Analyze the following image:
Clever trick!
You still have 4 marbles,
but they're no longer in the cup
because you turned it upside down!
They're probably scattered
around on the floor or counter now!
You still have 4 marbles in the cup,
even though it is now upside down and
in the freezer
The Qwen 2 model was pressure-tested against risk-high activity (such as explicit materials and fraud) across multiple languages. It demonstrated safety performance on par with the renowned GPT-4o.
Moreover, Qwen 2's capabilities were also tested for long-form context comprehension using the Needle in a Haystack test. The model was able to accurately extract details from text spanning up to 128K tokens without missing any critical information.
ChatGPT 4o is around 1.5x times faster than Qwen 2 and also has vision and speech recognition capabilities - an important feat. If you want a recap of GPT 4o announcement - check out this article here.
Qwen 2 AI is beaten by GPT 4o in benchmarks, and although most tests were close - coding showed some important differences. Most likely Qwen 72B Instruct will find its place in the research community with its good comprehension of the whole 128K tokens.
The Pricing model is given in AI/ML API tokens. As Qwen 2 is open-source, setting it up and maintaining locally would have different pricing.
Now that we've highlighted some key strengths, and run the tests - it's time for you to test out the models for your use case. Below, you'll find a code snippet that pits Qwen 2 against ChatGPT 4o using our API Key. Throw anything at them, and see how robust they are in your field!
import openai
def main():
client = OpenAI(
api_key='<YOUR_API_KEY>',
base_url="https://api.aimlapi.com",
)
# Specify the two models you want to compare
model1 = 'gpt-4o'
model2 = 'Qwen/Qwen2-72B-Instruct'
selected_models = [model1, model2]
system_prompt = 'You are an AI assistant that only responds with jokes.'
user_prompt = 'Why is the sky blue?'
results = {}
for model in selected_models:
try:
response = client.chat.completions.create(
model=model,
messages=[
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': user_prompt}
],
)
message = response.choices[0].message.content
results[model] = message
except Exception as error:
print(f"Error with model {model}:", error)
# Compare the results
print('Comparison of models:')
print(f"{model1}: {results.get(model1, 'No response')}")
print(f"{model2}: {results.get(model2, 'No response')}")
if __name__ == "__main__":
main()
Qwen 2 couldn't keep up with ChatGPT 4o, and unlike LLama 3, its speed couldn't compensate for the drawbacks. It's a good open-source alternative, still. The cheaper output price certainly makes the model worth consideration.
No matter which model you go with, prepare to be blown away by its capabilities. If you're feeling good after our comparison, go for ChatGPT 4o. If the pricing model sealed the deal for you, Qwen 2 is your best bet.
You can check our model lineup here - try any of them for yourself with our API Key.