Why AI and Examiner IELTS Scores Disagree

Score calibration · Examiner rubrics · May 2026

Direct answer

AI and examiner scores disagree because they are not measuring the same construct under the same constraints. AI optimizes transcript-level proxies—length, vocabulary diversity, grammar error rate—while examiners apply holistic band descriptors, penalize memorization, and score performance under novelty and time pressure. Psychologically, students anchor on whichever number feels better. The fix is mapping which criterion each scorer actually evaluated.

Score disagreement psychology: why both numbers feel true

When AI says Band 7 and a mock examiner says Band 6, your brain seeks coherence. Confirmation bias makes you trust the higher score if you felt fluent. Neither feeling is evidence.

Optimism anchor AI praise after practice becomes your real level
Threat anchor One harsh mock outweighs ten mild AI checks
Resolution error Averaging AI + human hides the leaking criterion

What each scorer is actually measuring

DimensionAI proxyExaminer focus
FluencyPace, fillersDevelopment, spontaneity
Task responseKeyword overlapPosition, coverage
PenaltiesOften absentMemorization, templates

See why AI overestimates band scores.

Protocol when scores disagree by 0.5–1.0 band

  1. Re-score once per criterion before looking at overall band.
  2. Repeat on a fresh prompt; structural gaps persist.
  3. Log delivery praise vs task-response flags.
  4. Calibrate with anchor scripts.

Key takeaways

  • Disagreement is usually criterion mismatch, not random harshness.
  • Psychology pushes you toward the number that matches your mood.
  • Never average AI and human; diagnose per descriptor.
  • Fresh-prompt retests reveal true calibration gaps.

FAQ

No—see false AI confidence and isolate criteria.
Assume certified mocks are closer until blind re-tests prove otherwise.
±0.5 after one task is common; 1.0+ on fresh tasks signals structural mismatch.

Stop guessing which score to believe—map your real criterion leaks.

Get Band Reality Check →