Monday, March 16, 2026

Essay Grading - Human vs. Machine

For my daughter’s high school, I volunteered to read and score scholarship applications that were submitted by graduating seniors. There were 10 categories of applications including Academic Excellence, Arts, Athletics, Leadership, School Service, and others. All applications consisted of an essay, and some of the categories required the submission of supplemental information such as photos, videos, or other information to support the applicant’s scholarship candidacy. Parent volunteers were placed in groups of 3, with each group asked to review 4 or 5 applications. Parents were provided with a grading rubric and were asked to independently evaluate each student’s submission. To reduce the chance of bias, parents were asked to be reassigned to another group if they knew the student.

The grading rubric consisted of 5 dimensions for a total of 20 points:

Followed Directions
2 points – Followed most or all directions
1 point – Followed some directions
0 points – Followed no directions

Answered Essay Prompt
3 points – Answered the prompt completely
2 points – Mostly answered the prompt
1 point – Somewhat answered the prompt
0 points – Essay has nothing to do with the prompt

Well-Written and Use of Good Grammar
5 points – Essay is well-written and almost all of the grammar is correct
4 points – Essay is somewhat well-written and most of the grammar is correct
3 points – Essay is adequately written and the grammar is somewhat correct
2 points – Essay is sloppily written and has numerous grammatical errors
1 point – Essay is poorly written and has many grammatical errors
0 points – Essay is incomprehensible

Provided Examples of Supporting Evidence
5 points – Completely supported essay with examples of evidence
4 points – Mostly supported essay with examples of evidence
3 points – Somewhat supported essay with examples of evidence
2 points – Provided a few examples to support essay
1 point – Did not provide enough examples to support essay
0 points – Provided no examples to support essay

Impact of Essay
5 points – Essay was outstanding and made the reader feel invested in the student’s essay
4 points – Essay was good and the reader felt connected to the student’s essay
3 points – Essay was okay and the reader understood what the student was trying to express
2 points – Essay had a point and the reader didn’t lose interest while reading the essay
1 point – Essay was poor and the reader had to work to engage with the essay
0 points – Essay was disjointed and the reader was unable to connect with the essay

Up to 2 bonus points were also given for applications that required supplemental information, but I’ve omitted those criteria for brevity.

After submitting my scores, I wondered how my scores compared to those of other parents. Because I was the first volunteer in my group to complete my assignment I did not have visibility into how the other 2 parents scored the students’ applications. However, I was able to externally validate my scores against those of various large language models (LLMs).

METHODS

There are too many LLMs to count nowadays, so I consulted the 7 that I was most familiar with, and I’ve listed the most probable models that each one is likely to have used as of the time of this writing. Some LLMs are more transparent with the identification and versioning of their free and paid models. For all 7 models, I used the free tier.

  • ChatGPT: Default model: GPT-5.2 Instant; Fallback model: GPT-5.2 Mini or similar lightweight version if you exceed limits
  • Claude: Sonnet 4.6
  • Copilot: Copilot model, built by Microsoft
  • DeepSeek: DeepSeek-V3.2
  • Gemini: Gemini 3
  • Grok: Grok 4.20 beta, Auto (Fast or Expert)
  • Perplexity: model not shown or configurable on free plan

I used the exact same prompt for all 7 LLMs and all 4 students:

You are a parent of a high school student who has volunteered to evaluate scholarship applications. Students who apply for a scholarship under the category of SCHOOL SERVICE are given the following essay prompt: “What contributions have you made to our high school as someone who serves this community?” Students who apply for a scholarship under the category of LEADERSHIP are given the following essay prompt: “Would others consider you a leader and why?” OR “What is your definition of a leader and how do you embody those characteristics?”

The grading rubric is provided in the attached “Essay Scoring Guidelines.pdf” file. Provide scores as whole numbers for the following dimensions in accordance with the scoring guidelines:

1. Followed Directions (0-2 points)
2. Answered Essay Prompt (0-3 points)
3. Well-Written and Use of Good Grammar (0-5 points)
4. Provided Examples of Supporting Evidence (0-5 points)
5. Impact of Essay (0-5 points)

Ignore the “Bonus Points” dimension in the scoring guidelines because the scoring of that dimension may involve evaluation of photos or videos. The student’s essay is attached. Provide the score for each of the 5 dimensions along with a brief justification for each score.

For each model, I pasted the prompt and attached the essay scoring guidelines in a PDF file along with a PDF file the essay for student 1. I continued using the same chat thread, so I only attached the PDF files of the essays for students 2-4, as re-attaching the scoring guidelines repeatedly for each student would have been redundant. For privacy reasons, I have de-identified the student names and am not sharing the actual student essays.

RESULTS

My ratings, along with those of the 7 LLMs, are as follows (click the image to enlarge):

Although the LLMs did provide brief justifications for their scores, I’ve included only the numeric results but could easily furnish the complete LLMs responses upon request.

Overall, there was general agreement between my ratings and the average ratings from the 7 LLMs.   In terms of rank order, I gave the highest scores to Student 1 (19 points), followed by Student 4 (17), Student 2 (15), and Student 3 (12). Using the average of all 7 LLMs, the highest score went to Student 1 (19.7), followed by a 2-way tie between Students 2 and 4 (19.1), and then Student 3 (15.4). In other words, the LLMs agreed with my ratings for the best and worst applications, although they did not draw a distinction between the two applications in the middle of the pack.

Across the board, I was equally or more critical of the essays than the LLMs, as the LLMs generally gave the same or higher scores in each of the 5 dimensions of the grading rubric. Upon examining the total number of points allocated across LLMs, the 3 most “lenient” graders were Grok (78 total points awarded), Perplexity (77), and Copilot (76), while the “strictest” graders were Claude (69), ChatGPT (70), and Gemini (70).

DISCUSSION

All 7 LLMs were up to the task of grading the essays in accordance with the grading rubric. I considered the possibility that some LLMs might not completely follow directions, but all of them adhered precisely to the grading criteria and listed scores that were concordant with the criteria. Some LLMs even tallied up the total scores for each student even though I did not specifically request it in my prompt, and when they did so, they performed addition without any errors.

There are several possible explanations for the differences between my ratings and the LLM ratings. First, it is possible that I’m a tough grader. I went into this activity thinking that these were all brilliant students, and it would not be helpful if all the students clustered around near-perfect scores. In fact, this is exactly the outcome that was observed with the LLMs, as students 2 and 4 were deadlocked in a tie. Second, it is possible that the LLMs were lenient graders. After all, sycophancy in LLMs has been well-documented and researched, and many companies have made concerted efforts to tone down the level of sycophancy as they introduced new versions of their models.

This experiment validates that LLMs can be used to assess the quality of written text when evaluated against a custom rubric. This is probably not surprising to many readers who have already engaged with LLMs in similar ways, including myself. However, this is the first time I’ve quantified my findings. Another key takeaway is that LLMs can be used to critically appraise a body of written text so the author has a chance to make revisions based on the feedback. In academic settings, the mere usage of LLMs is not tantamount to cheating. It’s the way in which an LLM is used that constitutes whether the LLM serves as a learning aid or if it is used to cheat. In work settings, I encourage professionals to take full advantage of LLMs to enhance learning, spark creativity, and optimize productivity. As long as LLMs are used in a way that they do not substitute critical thinking, I think we have a lot to gain.

No comments:

Post a Comment