GPT-4o IELTS Writing Evaluation Limits: What OpenAI Misses

GPT-4o · Rubric gaps · May 2026

Direct answer

GPT-4o is the most-used IELTS Writing evaluator—and one of the least calibrated. It produces detailed TR/CC/LR/GRA commentary but systematically over-rewards fluent grammar, under-penalises partial prompt answers, ignores Task 1 overview requirements, and assigns Band 7 to Band 6 essays. Session-to-session band drift is common. Use GPT-4o for ideas and structure—not exam booking decisions.

Five evaluation limits

Politeness bias Avoids harsh TR penalties examiners apply
Grammar weighting Polished sentences mask off-topic body paragraphs
Task 1 blindness Overview and key-feature selection often unchecked
Band drift Same essay rescored differently across sessions
Template blindness Memorised frames scored as "good structure"

GPT-4o vs examiner scoring

ScenarioGPT-4o typicalExaminer typical
Partial TR essayBand 7Band 6 capped by TR
Task 1 no overviewBand 6.5+Task Achievement cap ~5–6
Connector-heavy CC"Good cohesion"Band 6 if logic weak
Under 250 wordsOften ignoredTR development penalty

See ChatGPT vs BAND9AI and why ChatGPT scores feel inaccurate.

Safer GPT-4o prompt pattern

  1. "List Task Response gaps only—no band score."
  2. "Did I address every part of the prompt? Quote missing parts."
  3. "Task 1: is there a clear overview sentence?"
  4. Cross-check answers on IELTS-calibrated tool.

Key takeaways

  • GPT-4o commentary ≠ examiner band.
  • Fluency and grammar bias inflates scores 0.5–1.5 bands.
  • Task 1 overview and partial TR are the biggest misses.
  • Prompt without bands; validate on calibrated mocks.

FAQ

Plausible commentary, not calibrated bands—often 0.5–1.5 optimistic.
Politeness and grammar bias; under-penalises TR gaps and templates.
Brainstorm and TR-gap lists only—validate on IELTS mocks before booking.

See what GPT-4o missed on your last essay.

Get Writing Reality Check →