Why AI IELTS Scores Vary Between Attempts

AI scoring · Calibration · May 2026

Direct answer

AI IELTS scores vary between attempts because most tools are not calibrated examiners—they are probabilistic language models re-interpreting the same text under different prompts, temperatures, and implicit rubrics. Resubmitting an unchanged essay to ChatGPT can swing Task Response ±1.0 band. Criterion-locked systems reduce drift but still move if you change word count, task type, or model version. Treat swings above ±0.5 as noise until you fix a specific descriptor leak on fresh work.

Five drivers of score swing

Stochastic sampling Higher temperature = more generous or harsh adjectives on the same grammar
Prompt drift "Grade my essay" vs "Band 7 Task 2" activates different implicit standards
Rubric anchoring Models without IELTS descriptors default to school-essay or TOEFL norms
Input micro-changes Pasting with/without question stem shifts Task Response weight

What swing ranges mean in practice

Swing sizeLikely causeAction
±0.5 bandNormal model noise on generic chatTrack criterion comments, not headline number
±1.0 bandPrompt or rubric changed between runsLock one tool + one prompt template
2+ bands same textNo rubric; model hallucinating bandsSwitch to criterion-based AI

See calibration drift in AI mocks and comparing multiple AI scores.

Stable scoring protocol

  1. One tool, one rubric template, zero temperature where available.
  2. Always include the full Task 2 question in the submission.
  3. Score fresh essays weekly—never chase re-runs on identical text.
  4. Validate with a human mock before trusting a headline jump.

Key takeaways

  • Same-text resubmits measure model noise, not improvement.
  • ±0.5 swings are common on generic chat; ±1.0+ signals rubric drift.
  • Lock prompt, task stem, and tool before tracking progress.
  • Compare AI bands to examiner reality—see AI vs examiner disagreement.

FAQ

Yes on generic chat tools with no fixed rubric. On criterion-locked systems, swings above ±0.5 usually signal prompt or input inconsistency—not real improvement.
No—re-scoring the same text chases model noise. Fix one criterion leak, write a fresh essay, then compare.
Trained examiners anchor to public band descriptors and inter-rater checks. AI without calibration drifts more—see why AI and examiner scores disagree.

Stop chasing re-run luck—get a criterion-locked reality check.

Get IELTS Reality Check →