Why AI and Examiner IELTS Scores Disagree
Score calibration · Examiner rubrics · May 2026
AI and examiner scores disagree because they are not measuring the same construct under the same constraints. AI optimizes transcript-level proxies—length, vocabulary diversity, grammar error rate—while examiners apply holistic band descriptors, penalize memorization, and score performance under novelty and time pressure. Psychologically, students anchor on whichever number feels better. The fix is mapping which criterion each scorer actually evaluated.
Score disagreement psychology: why both numbers feel true
When AI says Band 7 and a mock examiner says Band 6, your brain seeks coherence. Confirmation bias makes you trust the higher score if you felt fluent. Neither feeling is evidence.
What each scorer is actually measuring
| Dimension | AI proxy | Examiner focus |
|---|---|---|
| Fluency | Pace, fillers | Development, spontaneity |
| Task response | Keyword overlap | Position, coverage |
| Penalties | Often absent | Memorization, templates |
Protocol when scores disagree by 0.5–1.0 band
- Re-score once per criterion before looking at overall band.
- Repeat on a fresh prompt; structural gaps persist.
- Log delivery praise vs task-response flags.
- Calibrate with anchor scripts.
Key takeaways
- Disagreement is usually criterion mismatch, not random harshness.
- Psychology pushes you toward the number that matches your mood.
- Never average AI and human; diagnose per descriptor.
- Fresh-prompt retests reveal true calibration gaps.
FAQ
Stop guessing which score to believe—map your real criterion leaks.
Get Band Reality Check →