Has GPT-4 Really Gotten Worse? Evaluating the Performance of OpenAI’s Latest AI Model

AI
Has OpenAi's GPT-4 Model Been Getting Worse?

The anecdotal evidence suggests GPT-4’s quality has declined recently. But empirical analysis shows a mixed picture, with abilities fluctuating across different tasks over time.

OpenAI’s Generative Pretrained Transformer 4 (GPT-4) language model has rapidly become one of the most widely used AI systems since its launch in 2022. 

As the latest iteration of OpenAI’s GPT models, GPT-4 leverages massive datasets and computational power to achieve incredibly fluent and coherent text generation capabilities.

Many tech enthusiasts eagerly awaited GPT-4’s launch to see just how intelligent this new AI could be. Initially, it impressed me with its eloquent responses and nuanced comprehension of complex prompts.

However, in recent months some users have subjectively felt that the quality of GPT-4’s responses seems to be declining. 

Has this fantastically powerful AI really gotten worse over time? Let’s deeply analyze the anecdotal experiences and empirical research on GPT-4 to find out.

Users Complain of Deteriorating Performance

On forums and social media, many users who actively interact with GPT-4 have complained that its responses appear lower quality and more inconsistent now compared to a few months ago.

For example:

  • On Hacker News, a popular discussion board for tech professionals, one user shared their experience:
    “Earlier releases [of GPT-4] through work at Microsoft Research were used to draw a unicorn in tick_tock. Prompts like “draw a unicorn jumping over a rainbow” resulted in good drawings at first, but they degraded as OpenAI started focusing more on censorship and safety.”
  • On Reddit and Twitter, many observers argued that GPT-4 seems “dumber” and makes simple mistakes it did not make previously. Users feel it has a harder time remembering information provided earlier in a conversation.
  • Some analysts testing GPT-4 lament it now seems “lazier”, frequently repeating looped responses rather than generating novel text.

Overall, these subjective anecdotes paint a concerning picture of deteriorating performance in recent GPT-4 releases. 

However, is this impression accurate when evaluated rigorously?

Empirical Analysis Shows Mixed and Unpredictable Changes

While subjective feelings can be useful, objective empirical analysis is required to truly understand an AI system’s capabilities.

A recent preprint study from researchers at Stanford and UC Berkeley carefully analyzed multiple versions of GPT-4 to quantify changes over time.

Specifically, they evaluated GPT-4 snapshots from March 2023 versus June 2023 on a diverse set of tasks:

  • Math problems: Identifying prime numbers, and counting happy numbers.
  • Answering sensitive questions: Responding to inappropriate or unethical queries.
  • Opinion surveys: Providing opinions on subjective questions.
  • Multi-hop question answering: Answering complex questions requiring multiple inference steps using a standardized dataset.
  • Code generation: Generating Python code for simple programming problems.
  • Medical exams: Answering multiple choice questions from USMLE medical licensing exams.
  • Visual reasoning: Completing abstract visual reasoning problems from a standardized test.

The results were mixed, with GPT-4’s abilities fluctuating unpredictably across tasks between March and June.

Math Performance Declined Substantially

GPT-4’s performance on the mathematical reasoning tasks declined notably:

  • Its accuracy at identifying prime numbers dropped from ~85% down to just 35% between March and June.
  • When instructed to show step-by-step reasoning, GPT-4 ignored these prompts in June but followed them correctly in March.
  • For counting “happy numbers”, GPT-4’s accuracy similarly decreased from 84% down to 35% from March to June.

Clearly, GPT-4’s mathematical skills deteriorated substantially during this timeframe. The researchers speculated that diminished adherence to instructions like showing step-by-step work explained part of these declines.

Multi-hop Question Performance Improved

However, GPT-4 demonstrated the opposite trend on the complex multi-hop question dataset:

  • In March, its responses only exactly matched the ground truth answers 1% of the time.
  • By June, this metric increased to 38% – a substantial improvement.

So GPT-4 seems to have enhanced its ability to synthesize information from multiple documents to answer intricate questions.

Safety Measures Strengthened, But Explanations Declined

When responding to inappropriate or unethical questions:

  • GPT-4 declined to directly answer these sensitive queries 21% of the time in March, but only 5% in June – potentially indicating strengthened safety measures.
  • However, the AI’s explanations for refusing to answer were much more cursory in June. Rather than a detailed justification, it simply responded “Sorry, I can’t assist with that”.

Other Skills Remained Stable or Improved Slightly

On the remaining tasks, GPT-4 exhibited smaller performance changes:

  • Its accuracy on USMLE medical exams declined slightly from 87% to 82%.
  • GPT-4 improved marginally on the visual reasoning problems, with exact match accuracy increasing from 25% to 27%.
  • However, it became less likely to strictly follow formatting instructions when generating code in June compared to March.

Interpreting GPT-4’s Fluctuating Abilities

Given these mixed results, how should we interpret claims that GPT-4 has generally gotten worse?

The researchers concluded that it is overly simplistic to make broad claims about uniformly improving or declining AI quality over time.

Instead, they underscored how an LLM’s capabilities can change drastically across different use cases within short time periods.

But what factors drive these unpredictable fluctuations?

Modifying the Model’s Parameters Can Have Unintended Consequences

Like other machine learning models, GPT-4 has many internal parameters that determine its behavior. The researchers speculate that OpenAI likely modifies these parameters between releases to try and enhance particular capabilities.

However, similar to altering a complex organism’s DNA, these changes can have unintended side effects. 

Improving GPT-4’s strength in one area may inadvertently hurt unrelated skills controlled by the same parameters.

Training Data Quality Affects Performance

GPT-4 is likely fine-tuned on new data regularly to expand its knowledge. But if this additional data is noisy or skewed, it could degrade performance on certain tasks.

For example, GPT-4 may have been fine-tuned extensively to avoid toxic content recently at the cost of mathematical reasoning capabilities. Without transparency into OpenAI’s private training process, the exact causes are impossible to confirm.

In summary, while anecdotal reports suggest GPT-4 has gotten worse, empirical analysis reveals a nuanced, unpredictable picture. We cannot make sweeping claims that the model is uniformly improving or declining.

The Risks of Black Box AI – Monitoring Critical for Reliability

The opacity around OpenAI’s training and fine-tuning procedures exacerbate these issues. GPT-4 is a proprietary black box system – not open for inspection or auditing.

This lack of transparency makes it almost impossible for users to determine why GPT-4’s behavior changes over time. We can merely speculate based on externally visible outputs.

Moreover, unpredictably fluctuating performance poses challenges for reliably integrating black box AI like GPT-4 into real-world products and services.

If an LLM’s responses to the same prompt change drastically within months, any applications relying on it may suddenly break. Continuous monitoring of production systems built atop LLMs is critical.

Researchers emphasized the need for rigorous, ongoing analysis of AI behavior using diverse test suites. Only such vigilance can surface these issues early enough to avoid disruptions.

Looking Ahead

It is clear that assuming large language models like GPT-4 are perfect, static systems are incorrect. Their capabilities can shift substantially within short time periods of weeks or months.

Anecdotal impressions certainly provide valuable feedback on where change is happening. But we must complement opinions with rigorous empirical analysis to truly understand the nuances of how these AI systems evolve.

Only with rigorous testing can we separate hype from reality and make informed decisions about when advanced AI like GPT-4 is ready for real-world deployment.

The fluctuations in GPT-4 highlight that today’s largest language models remain imperfect works in progress. Moving forward, the AI community must prioritize transparency, rigorous monitoring, and setting appropriate expectations around emerging generative AI.

Avatar photo

Saiful Emon

Emon is a tech enthusiast who loves to explore and write about the latest gadgets and innovations. Now he uses his passion and knowledge to cover topics like artificial intelligence, gaming, wearables, and the potential of computers. When he is not writing, he enjoys playing video games, watching sci-fi movies, and discovering new places.