Monday, March 16, 2026

Essay Grading - Human vs. Machine

For my daughter’s high school, I volunteered to read and score scholarship applications that were submitted by graduating seniors. There were 10 categories of applications including Academic Excellence, Arts, Athletics, Leadership, School Service, and others. All applications consisted of an essay, and some of the categories required the submission of supplemental information such as photos, videos, or other information to support the applicant’s scholarship candidacy. Parent volunteers were placed in groups of 3, with each group asked to review 4 or 5 applications. Parents were provided with a grading rubric and were asked to independently evaluate each student’s submission. To reduce the chance of bias, parents were asked to be reassigned to another group if they knew the student.

The grading rubric consisted of 5 dimensions for a total of 20 points:

Followed Directions
2 points – Followed most or all directions
1 point – Followed some directions
0 points – Followed no directions

Answered Essay Prompt
3 points – Answered the prompt completely
2 points – Mostly answered the prompt
1 point – Somewhat answered the prompt
0 points – Essay has nothing to do with the prompt

Well-Written and Use of Good Grammar
5 points – Essay is well-written and almost all of the grammar is correct
4 points – Essay is somewhat well-written and most of the grammar is correct
3 points – Essay is adequately written and the grammar is somewhat correct
2 points – Essay is sloppily written and has numerous grammatical errors
1 point – Essay is poorly written and has many grammatical errors
0 points – Essay is incomprehensible

Provided Examples of Supporting Evidence
5 points – Completely supported essay with examples of evidence
4 points – Mostly supported essay with examples of evidence
3 points – Somewhat supported essay with examples of evidence
2 points – Provided a few examples to support essay
1 point – Did not provide enough examples to support essay
0 points – Provided no examples to support essay

Impact of Essay
5 points – Essay was outstanding and made the reader feel invested in the student’s essay
4 points – Essay was good and the reader felt connected to the student’s essay
3 points – Essay was okay and the reader understood what the student was trying to express
2 points – Essay had a point and the reader didn’t lose interest while reading the essay
1 point – Essay was poor and the reader had to work to engage with the essay
0 points – Essay was disjointed and the reader was unable to connect with the essay

Up to 2 bonus points were also given for applications that required supplemental information, but I’ve omitted those criteria for brevity.

After submitting my scores, I wondered how my scores compared to those of other parents. Because I was the first volunteer in my group to complete my assignment I did not have visibility into how the other 2 parents scored the students’ applications. However, I was able to externally validate my scores against those of various large language models (LLMs).

METHODS

There are too many LLMs to count nowadays, so I consulted the 7 that I was most familiar with, and I’ve listed the most probable models that each one is likely to have used as of the time of this writing. Some LLMs are more transparent with the identification and versioning of their free and paid models. For all 7 models, I used the free tier.

  • ChatGPT: Default model: GPT-5.2 Instant; Fallback model: GPT-5.2 Mini or similar lightweight version if you exceed limits
  • Claude: Sonnet 4.6
  • Copilot: Copilot model, built by Microsoft
  • DeepSeek: DeepSeek-V3.2
  • Gemini: Gemini 3
  • Grok: Grok 4.20 beta, Auto (Fast or Expert)
  • Perplexity: model not shown or configurable on free plan

I used the exact same prompt for all 7 LLMs and all 4 students:

You are a parent of a high school student who has volunteered to evaluate scholarship applications. Students who apply for a scholarship under the category of SCHOOL SERVICE are given the following essay prompt: “What contributions have you made to our high school as someone who serves this community?” Students who apply for a scholarship under the category of LEADERSHIP are given the following essay prompt: “Would others consider you a leader and why?” OR “What is your definition of a leader and how do you embody those characteristics?”

The grading rubric is provided in the attached “Essay Scoring Guidelines.pdf” file. Provide scores as whole numbers for the following dimensions in accordance with the scoring guidelines:

1. Followed Directions (0-2 points)
2. Answered Essay Prompt (0-3 points)
3. Well-Written and Use of Good Grammar (0-5 points)
4. Provided Examples of Supporting Evidence (0-5 points)
5. Impact of Essay (0-5 points)

Ignore the “Bonus Points” dimension in the scoring guidelines because the scoring of that dimension may involve evaluation of photos or videos. The student’s essay is attached. Provide the score for each of the 5 dimensions along with a brief justification for each score.

For each model, I pasted the prompt and attached the essay scoring guidelines in a PDF file along with a PDF file the essay for student 1. I continued using the same chat thread, so I only attached the PDF files of the essays for students 2-4, as re-attaching the scoring guidelines repeatedly for each student would have been redundant. For privacy reasons, I have de-identified the student names and am not sharing the actual student essays.

RESULTS

My ratings, along with those of the 7 LLMs, are as follows (click the image to enlarge):

Although the LLMs did provide brief justifications for their scores, I’ve included only the numeric results but could easily furnish the complete LLMs responses upon request.

Overall, there was general agreement between my ratings and the average ratings from the 7 LLMs.   In terms of rank order, I gave the highest scores to Student 1 (19 points), followed by Student 4 (17), Student 2 (15), and Student 3 (12). Using the average of all 7 LLMs, the highest score went to Student 1 (19.7), followed by a 2-way tie between Students 2 and 4 (19.1), and then Student 3 (15.4). In other words, the LLMs agreed with my ratings for the best and worst applications, although they did not draw a distinction between the two applications in the middle of the pack.

Across the board, I was equally or more critical of the essays than the LLMs, as the LLMs generally gave the same or higher scores in each of the 5 dimensions of the grading rubric. Upon examining the total number of points allocated across LLMs, the 3 most “lenient” graders were Grok (78 total points awarded), Perplexity (77), and Copilot (76), while the “strictest” graders were Claude (69), ChatGPT (70), and Gemini (70).

DISCUSSION

All 7 LLMs were up to the task of grading the essays in accordance with the grading rubric. I considered the possibility that some LLMs might not completely follow directions, but all of them adhered precisely to the grading criteria and listed scores that were concordant with the criteria. Some LLMs even tallied up the total scores for each student even though I did not specifically request it in my prompt, and when they did so, they performed addition without any errors.

There are several possible explanations for the differences between my ratings and the LLM ratings. First, it is possible that I’m a tough grader. I went into this activity thinking that these were all brilliant students, and it would not be helpful if all the students clustered around near-perfect scores. In fact, this is exactly the outcome that was observed with the LLMs, as students 2 and 4 were deadlocked in a tie. Second, it is possible that the LLMs were lenient graders. After all, sycophancy in LLMs has been well-documented and researched, and many companies have made concerted efforts to tone down the level of sycophancy as they introduced new versions of their models.

This experiment validates that LLMs can be used to assess the quality of written text when evaluated against a custom rubric. This is probably not surprising to many readers who have already engaged with LLMs in similar ways, including myself. However, this is the first time I’ve quantified my findings. Another key takeaway is that LLMs can be used to critically appraise a body of written text so the author has a chance to make revisions based on the feedback. In academic settings, the mere usage of LLMs is not tantamount to cheating. It’s the way in which an LLM is used that constitutes whether the LLM serves as a learning aid or if it is used to cheat. In work settings, I encourage professionals to take full advantage of LLMs to enhance learning, spark creativity, and optimize productivity. As long as LLMs are used in a way that they do not substitute critical thinking, I think we have a lot to gain.

Wednesday, March 4, 2026

Chicken Al Pastor and Oxford Commas

I was driving my wife home after she had a medical procedure, and she asked me to buy her some food from Chipotle. I listened with apprehension as she rattled off a litany of food items and ingredient customizations, as I knew there would be no way that I’d get all the details right. You see, my wife has very particular preferences when it comes to food. So to ensure that I had the best chance of getting her order correct, I asked her to text the instructions to me. She initially refused, saying that I make no effort to remember her preferences. I said that if I have to remember more than 2 or 3 things about her order, I will screw it up and she will be upset. Besides, I was driving and trying to find the restaurant so wasn’t able to pay enough attention to commit her customizations to memory. So she relented and sent me the following text message (verbatim and therefore in quotes):

“Chicken Al pastor, brown black beans corn, green sauce and salsa on side”

And while I was still driving, she verbally told me to take her phone and redeem an offer for free queso by scanning a QR code provided in the app and also scan her rewards number so she could earn points. I felt that I could remember those last 2 instructions because they were the last things she mentioned, and all the other details were in the text message. I had never heard of chicken al pastor, nor did I know that Chipotle had that on their menu. It turns out that it is a time-limited offer. Note that the link may not work when the offer expires, but here’s a screenshot from that site.

After struggling to find parking, my wife stayed in the car while I entered the store and read my wife’s text message to the server. I was asked, “Burrito or bowl?” to which I requested a bowl which my wife usually gets (BTW, I received no credit for knowing the answer to this question). I also had the intuition to know that “brown black beans” meant “brown rice and black beans” despite the instructions being technically incomplete/illogical (no credit for that one either). After carefully crafting my wife’s gourmet meal, I scanned the QR code for the free queso offer and scanned the QR code for my wife’s rewards program and paid. Mission accomplished, or so I thought (foreshadowing).

When we got home, my wife asked where the green sauce was. I told her that I saw them put the green sauce in the bowl. She complained that she wanted the green sauce on the side, NOT IN THE BOWL.

----- Begin Side Conversation About Oxford Comma -----

One could argue that there was no Oxford comma in her text message, so it should have been clear that both the green sauce and salsa needed to be put on the side. However, if you read the entire text message, the punctuation is wildly inconsistent, so no reasonable person could definitively conclude that the absence of an Oxford comma necessarily meant that the green sauce should have been put on the side. Plus, I am pretty sure that my wife does not know what an Oxford comma is.

----- End Side Conversation About Oxford Comma -----

Anyway, the situation was quite upsetting to her, as she continued to complain that I never try to understand her. I found the situation to be somewhat amusing actually because not only did I anticipate that this would happen, I also called it out and tried to prevent it from happening, and it happened anyway. It’s not that I don’t try to understand my wife as a person, I just have low tolerance for complexity when it comes to fast food, so I try to shift the burden of perfecting an order back on her, and therefore I think she is partially correct on that criticism of me. Also, I think I have been conditioned to just accept that whatever I do, it will be wrong, and I will be blamed anyway.

When I order food, I’ll usually accept whatever normally comes with the dish, or in the case of a build-your-own dish scenario, I’ll just have everything. Honestly I don’t really care that much if I get white or black rice, brown or black beans, or green or red salsa. I certainly don’t need things put on the side, just dump everything in and save a plastic container from taking up space in landfills. Besides, I will eventually mix it all together and everything will come out the other end looking the same regardless of how it was prepared. And if someone orders food for me, I will say “thank you” and happily eat the food. No complaints, no drama.

I am not saying that people should not have detailed food preferences. I just think they should not impose their expectations on others and get upset when people fall short of those expectations. Also, a clearer text message such as this one could have prevented the snafu:

“Chicken al pastor in bowl, brown rice, black beans, corn, green sauce on side, salsa on side. In Chipotle app, redeem offer for free queso and scan rewards code.”

It is specific and understandable, and I just demonstrated how an Oxford comma in combination with other clear communication could have saved the day. Oh what could have been!