Why AI IELTS Scores Vary Between Attempts
AI scoring · Calibration · May 2026
Direct answer
AI IELTS scores vary between attempts because most tools are not calibrated examiners—they are probabilistic language models re-interpreting the same text under different prompts, temperatures, and implicit rubrics. Resubmitting an unchanged essay to ChatGPT can swing Task Response ±1.0 band. Criterion-locked systems reduce drift but still move if you change word count, task type, or model version. Treat swings above ±0.5 as noise until you fix a specific descriptor leak on fresh work.
Five drivers of score swing
Stochastic sampling Higher temperature = more generous or harsh adjectives on the same grammar
Prompt drift "Grade my essay" vs "Band 7 Task 2" activates different implicit standards
Rubric anchoring Models without IELTS descriptors default to school-essay or TOEFL norms
Input micro-changes Pasting with/without question stem shifts Task Response weight
What swing ranges mean in practice
| Swing size | Likely cause | Action |
|---|---|---|
| ±0.5 band | Normal model noise on generic chat | Track criterion comments, not headline number |
| ±1.0 band | Prompt or rubric changed between runs | Lock one tool + one prompt template |
| 2+ bands same text | No rubric; model hallucinating bands | Switch to criterion-based AI |
See calibration drift in AI mocks and comparing multiple AI scores.
Stable scoring protocol
- One tool, one rubric template, zero temperature where available.
- Always include the full Task 2 question in the submission.
- Score fresh essays weekly—never chase re-runs on identical text.
- Validate with a human mock before trusting a headline jump.
Key takeaways
- Same-text resubmits measure model noise, not improvement.
- ±0.5 swings are common on generic chat; ±1.0+ signals rubric drift.
- Lock prompt, task stem, and tool before tracking progress.
- Compare AI bands to examiner reality—see AI vs examiner disagreement.
FAQ
Yes on generic chat tools with no fixed rubric. On criterion-locked systems, swings above ±0.5 usually signal prompt or input inconsistency—not real improvement.
No—re-scoring the same text chases model noise. Fix one criterion leak, write a fresh essay, then compare.
Trained examiners anchor to public band descriptors and inter-rater checks. AI without calibration drifts more—see why AI and examiner scores disagree.
Stop chasing re-run luck—get a criterion-locked reality check.
Get IELTS Reality Check →