Resources·AI for Teachers
AI Grading and Feedback: What to Hand Off and What to Sign
A Sunday stack of 55 essays with AI in the loop: rubric-anchored feedback prompts, an anonymization workflow, and the grading line AI never crosses.
Can AI grade student work?
AI can draft feedback on student writing — comments anchored to your rubric that you edit and sign. It should not assign the grade. Automated essay scoring has documented reliability and bias problems, and a grade is a professional judgment with a name attached: yours.
Our AI for teachers guide draws this line in one sentence: AI can draft the comment; it doesn't decide the number. This article is that sentence stretched over a worked example. Take a realistic stack: 55 ninth-grade persuasive essays, second drafts, due back Monday, one Sunday evening to get through them. Each section below is a stage of that stack going down: rubric setup, anonymization, the AI first pass, your read, the grade decision. A little over four hours, instead of the usual two weeknights plus a stolen planning period.
Stage one: the rubric goes in before any essay does
The difference between AI feedback that helps and AI feedback that wastes a Sunday is whether the model has your actual rubric. Without it you get universal writing advice — "consider strengthening your thesis" — which fits every essay ever written and improves none of them. So the session starts with no student work at all:
Rubric setup: "Here is my 4-band rubric for a ninth-grade persuasive essay [paste the full text of every band descriptor, not a summary]. Before I give you any essays: restate each band in one sentence, and tell me where bands 2 and 3 will be hardest to tell apart. Don't evaluate anything yet."
That last demand is the useful one. If the model can't articulate where your bands blur, its comments won't respect the boundary either — better to learn that now than thirty essays in. Correct anything it gets wrong, the way you'd calibrate with a co-teacher before splitting a stack.
If the prompt mechanics feel unfamiliar — why the full descriptor text goes in, how to push back on a bad response — the fundamentals live in our ChatGPT for teachers guide and carry over directly.
Stage two: how to anonymize student essays before AI sees them
A persuasive essay is one of the most identifying documents a fourteen-year-old produces. The name is in the MLA header, the school shows up in the hook, a cousin appears in the anecdote — and the topic itself (the essay arguing for cheaper insulin, written by the kid who wears a glucose monitor) can identify a student as surely as a name. None of that belongs in a general-purpose AI tool. Before any essay gets pasted:
- Work on copies, never originals. One folder, named so you can't confuse it with the real set.
- Find-and-replace each student's name with a code — E01 through E55 in gradebook order — and keep the key in your gradebook or on paper. Never inside the AI tool.
- Then catch what find-and-replace misses: headers and footers, self-references in the text ("as a sophomore at Jefferson"), and other people's names — classmates, teachers, siblings. Swap in [my school], [a friend], [a family member].
- Some essays stay out entirely. The one about a parent's arrest, the one disclosing a diagnosis — coded or not, those get handled by hand. In a class set of 55, expect one or two.
Why the caution runs deeper than habit — data terms, district requirements, the legal lines — is covered in our guide to AI and student data privacy. For a class set of digital essays, budget about twenty-five minutes for this pass; longer if you're working from scans.
Stage three: the first pass — where AI grading actually saves time
With the rubric loaded and the names gone, calibrate on one essay you've already graded by hand:
Calibration pass: "Here's one essay (E07). Write three comments: one strength and one priority fix, each tied to a named rubric band and quoting the line it's about, plus the one thing this writer should do in the first ten minutes of revision. Write to the student in plain second person. Do not suggest a score."
Compare the output against the comments you actually wrote on E07. If the register is off — too formal, too gushing, hedged into mush — fix it once and make it stick:
Register fix: "Too stiff. I write to students in second person, I ask one genuine question per essay, and I never praise anything I can't quote. Redo E07's comments in that voice and keep it for the rest of this session."
Then run the stack in batches:
Per-essay comments: "Same format for every essay I paste: one strength tied to a named band, quoting the line; one priority fix tied to a band; one next step the writer can act on in the first ten minutes of revision. No score, no grade, ever — if you produce a number I will ignore it."
Fifty-five essays' worth of first-pass comments takes about fifty minutes of pasting and skimming. The comments come back consistent, genuinely rubric-tied, quoting real lines — that's the value. They also inflate praise (by essay twenty, everything is "compelling"), they converge (two essays with very different evidence — a CDC citation and an uncle's story — can draw near-identical evidence comments), and the grammar feedback is the least trustworthy part — expect it to flag intentional fragments as errors and slide right past real comma splices. Don't try to prompt all of that away; the rest is what the next stage is for.
Stage four: your read
This is the stage that doesn't compress, only speeds up. Every essay gets read — actually read — with the draft comment beside it, and each resolves one of three ways: sign it, fix it, or scrap it. Expect a split along these lines: about half the comments go out as drafted, a third get edited, a handful get rewritten from scratch — including the essays that never touched the tool.
What the edits catch is what a model can't. The writer who abandoned the five-paragraph structure on purpose — and made it work — gets an AI "priority fix" that would coach the risk right back out of her. The draft with two comma splices, down from eleven, makes "watch your comma splices" accurate and wrong; the comment worth signing names the improvement first. The model has only ever seen this draft. You've seen the last three.
Before closing the session, harvest the stack for Tuesday:
Pattern scan: "Across all the essays I pasted, what are the four most common weaknesses, each tied to a rubric band, with a count of how many essays show it? I'm planning a fifteen-minute revision mini-lesson."
When it turns out half the stack never addressed a counterargument, that's Tuesday's mini-lesson, found in ninety seconds.
Stage five: why AI never assigns the grade — the accuracy evidence
The grade is the one stage where AI gets no vote — and not just as district-policy caution. Automated essay scoring is the oldest, best-studied corner of AI in education, and its track record is why the line exists.
In a 2012 study in Applied Measurement in Education, ETS researchers Bridgeman, Trapani, and Attali compared the e-rater scoring engine against human raters across roughly 132,000 GRE and TOEFL essays. On average, machine and human scores agreed about as well as two humans do — and underneath that average, the machine scored essays from some language groups measurably higher than human raters did, and others lower. The aggregate looked fine while subgroups drifted. A teacher can be biased too, but a teacher can be asked why; a scoring model just returns the number.
Newer models haven't closed the case. A 2024 study in Heliyon tested GPT-4, GPT-3.5, and Claude 2 on 1,730 seventh-grade essays: even GPT-4's best agreement with human scorers (a quadratic weighted kappa of about 0.57) fell short of the human-to-human baseline (0.66), and its weakest dimension by far was conventions — grammar and mechanics, at 0.22 — the very thing that looks most automatable. Which matches what the comma splices showed in stage three. Even ETS, by its own description, builds review into machine scoring: e-rater automatically flags off-topic or inconsistent responses "so that they can be set aside for review."
So the number works like this: for each essay, you've now read the writing and a comment that quotes its evidence against the bands. You place it in a band yourself, and the grade goes in the gradebook under your name. The AI made that judgment faster by pre-surfacing evidence. It never made the judgment.
What you tell them Monday
Say it plainly, before the essays go back: "Your written feedback started as an AI draft against our rubric. I read every essay and every comment, changed about half, and every word you're getting is one I signed. The grade came from me."
The case for disclosure isn't only ethics — hiding it forfeits the best AI lesson you'll teach all year. Most of these students are required to disclose their own AI use on this very assignment; if their teacher grades with undisclosed AI while collecting their disclosure forms, the lesson that actually lands is that the rule is theater. And showing them one comment you rewrote — here's what the model said, here's why it was wrong about your essay — is a live demonstration of reviewing AI output instead of swallowing it, which is the exact skill their own essays keep failing to show.
Expect the hand: "So a robot graded me?" No — a robot drafted some sentences, and the person who has read your writing since September decided what was true.
When fast beats perfect — and when it doesn't
The stack in this walkthrough is second drafts, with a final due in ten days. That detail decides the whole trade.
For formative work — drafts, revisions, anything a student will act on — feedback has a shelf life. Comments that arrive Tuesday, while the revision is live, get used; immaculate comments that arrive after the final is submitted get recycled with the paper they're stapled to. AI-drafted, teacher-signed comments in 48 hours beat handcrafted ones in two weeks, not because they're better but because they exist in time to matter. (For the even faster end of formative — daily checks that aren't essays at all — see AI exit tickets.)
For summative work, the trade flips. On a final draft, most students read the number, maybe the first comment, and stop — so comment speed buys almost nothing, and judgment accuracy is the entire job. Grades of record, portfolio pieces, anything contested in a parent conference: slow read, your own comments, no pressure to involve AI at all. If you're rationing energy, spend the AI on the drafts and yourself on the finals.
How the stack ends
Add up the Sunday: thirty minutes of rubric setup and calibration, twenty-five minutes of anonymizing, fifty minutes of first-pass batching, and about two and a half hours of reading — roughly three minutes per essay instead of the usual seven or eight. Fifty-five grades, all assigned by hand against the rubric. Most comments signed as drafted, a third edited, a handful written by hand, two essays never pasted anywhere. The E-numbers go back to being names, the grades go into the gradebook, and the evening ends around 11:30 instead of bleeding into Wednesday. The essays go back Monday morning.
Frequently asked questions
Can teachers use AI to grade essays?
AI can draft feedback comments on essays — anchored to your rubric and edited by you — but it should not assign the grade. Automated essay scoring has documented reliability and bias problems, and a grade of record is a professional judgment a teacher signs, not a model output.
Is AI grading accurate?
Not reliably enough to trust with grades. A 2024 study in Heliyon found GPT-4's agreement with human essay scores stayed below the agreement between two human raters, and it was weakest on grammar and mechanics. A 2012 study of ETS's e-rater found the machine scored some language groups measurably differently than human raters did, even while averages looked fine.
Should I tell students if AI helped write their feedback?
Yes. A one-sentence disclosure works: the comments started as AI drafts against the rubric, you read and revised every one, and the grade came from you. If you require students to disclose their AI use, grading their work with undisclosed AI teaches them the rule isn't real.
How do I anonymize student essays before using AI?
Work on copies, find-and-replace each name with a code (E01, E02...), and keep the key in your gradebook — never in the AI tool. Then check what find-and-replace misses: names in MLA headers, self-identifying details inside the essay, and classmates named in anecdotes. Deeply personal essays should stay out of the tool entirely.
Is grading essays with AI faster than grading by hand?
For a full class set, yes — but not at first. Rubric setup, calibration, and anonymizing cost 45–60 minutes up front. The savings come per essay: reading with a draft comment already on screen and editing it is roughly half the time of composing every comment from scratch, so the break-even is around one class set.
Your next lesson is a sentence away.
Planning Partner drafts standards-aligned, differentiated lessons — then hands you the controls.
Start free