For my daughter’s high school, I volunteered to read and score scholarship applications that were submitted by graduating seniors. There were 10 categories of applications including Academic Excellence, Arts, Athletics, Leadership, School Service, and others. All applications consisted of an essay, and some of the categories required the submission of supplemental information such as photos, videos, or other information to support the applicant’s scholarship candidacy. Parent volunteers were placed in groups of 3, with each group asked to review 4 or 5 applications. Parents were provided with a grading rubric and were asked to independently evaluate each student’s submission. To reduce the chance of bias, parents were asked to be reassigned to another group if they knew the student.
The grading rubric consisted of 5 dimensions for a total of 20 points:
Followed Directions2 points – Followed most or all directions1 point – Followed some directions0 points – Followed no directionsAnswered Essay Prompt3 points – Answered the prompt completely2 points – Mostly answered the prompt1 point – Somewhat answered the prompt0 points – Essay has nothing to do with the promptWell-Written and Use of Good Grammar5 points – Essay is well-written and almost all of the grammar is correct4 points – Essay is somewhat well-written and most of the grammar is correct3 points – Essay is adequately written and the grammar is somewhat correct2 points – Essay is sloppily written and has numerous grammatical errors1 point – Essay is poorly written and has many grammatical errors0 points – Essay is incomprehensibleProvided Examples of Supporting Evidence5 points – Completely supported essay with examples of evidence4 points – Mostly supported essay with examples of evidence3 points – Somewhat supported essay with examples of evidence2 points – Provided a few examples to support essay1 point – Did not provide enough examples to support essay0 points – Provided no examples to support essayImpact of Essay5 points – Essay was outstanding and made the reader feel invested in the student’s essay4 points – Essay was good and the reader felt connected to the student’s essay3 points – Essay was okay and the reader understood what the student was trying to express2 points – Essay had a point and the reader didn’t lose interest while reading the essay1 point – Essay was poor and the reader had to work to engage with the essay0 points – Essay was disjointed and the reader was unable to connect with the essay
Up to 2 bonus points were also given for applications that required supplemental information, but I’ve omitted those criteria for brevity.
After submitting my scores, I wondered how my scores compared to those of other parents. Because I was the first volunteer in my group to complete my assignment I did not have visibility into how the other 2 parents scored the students’ applications. However, I was able to externally validate my scores against those of various large language models (LLMs).
METHODS
There are too many LLMs to count nowadays, so I consulted the 7 that I was most familiar with, and I’ve listed the most probable models that each one is likely to have used as of the time of this writing. Some LLMs are more transparent with the identification and versioning of their free and paid models. For all 7 models, I used the free tier.
- ChatGPT: Default model: GPT-5.2 Instant; Fallback model: GPT-5.2 Mini or similar lightweight version if you exceed limits
- Claude: Sonnet 4.6
- Copilot: Copilot model, built by Microsoft
- DeepSeek: DeepSeek-V3.2
- Gemini: Gemini 3
- Grok: Grok 4.20 beta, Auto (Fast or Expert)
- Perplexity: model not shown or configurable on free plan
I used the exact same prompt for all 7 LLMs and all 4 students:
You are a parent of a high school student who has volunteered to evaluate scholarship applications. Students who apply for a scholarship under the category of SCHOOL SERVICE are given the following essay prompt: “What contributions have you made to our high school as someone who serves this community?” Students who apply for a scholarship under the category of LEADERSHIP are given the following essay prompt: “Would others consider you a leader and why?” OR “What is your definition of a leader and how do you embody those characteristics?”
The grading rubric is provided in the attached “Essay Scoring Guidelines.pdf” file. Provide scores as whole numbers for the following dimensions in accordance with the scoring guidelines:
1. Followed Directions (0-2 points)
2. Answered Essay Prompt (0-3 points)
3. Well-Written and Use of Good Grammar (0-5 points)
4. Provided Examples of Supporting Evidence (0-5 points)
5. Impact of Essay (0-5 points)Ignore the “Bonus Points” dimension in the scoring guidelines because the scoring of that dimension may involve evaluation of photos or videos. The student’s essay is attached. Provide the score for each of the 5 dimensions along with a brief justification for each score.
For each model, I pasted the prompt and attached the essay scoring guidelines in a PDF file along with a PDF file the essay for student 1. I continued using the same chat thread, so I only attached the PDF files of the essays for students 2-4, as re-attaching the scoring guidelines repeatedly for each student would have been redundant. For privacy reasons, I have de-identified the student names and am not sharing the actual student essays.
RESULTS
My ratings, along with those of the 7 LLMs, are as follows (click the image to enlarge):
Although the LLMs did provide brief justifications for their scores, I’ve included only the numeric results but could easily furnish the complete LLMs responses upon request.
Overall, there was general agreement between my ratings and the average ratings from the 7 LLMs. In terms of rank order, I gave the highest scores to Student 1 (19 points), followed by Student 4 (17), Student 2 (15), and Student 3 (12). Using the average of all 7 LLMs, the highest score went to Student 1 (19.7), followed by a 2-way tie between Students 2 and 4 (19.1), and then Student 3 (15.4). In other words, the LLMs agreed with my ratings for the best and worst applications, although they did not draw a distinction between the two applications in the middle of the pack.
Across the board, I was equally or more critical of the essays than the LLMs, as the LLMs generally gave the same or higher scores in each of the 5 dimensions of the grading rubric. Upon examining the total number of points allocated across LLMs, the 3 most “lenient” graders were Grok (78 total points awarded), Perplexity (77), and Copilot (76), while the “strictest” graders were Claude (69), ChatGPT (70), and Gemini (70).
DISCUSSION
All 7 LLMs were up to the task of grading the essays in accordance with the grading rubric. I considered the possibility that some LLMs might not completely follow directions, but all of them adhered precisely to the grading criteria and listed scores that were concordant with the criteria. Some LLMs even tallied up the total scores for each student even though I did not specifically request it in my prompt, and when they did so, they performed addition without any errors.
There are several possible explanations for the differences between my ratings and the LLM ratings. First, it is possible that I’m a tough grader. I went into this activity thinking that these were all brilliant students, and it would not be helpful if all the students clustered around near-perfect scores. In fact, this is exactly the outcome that was observed with the LLMs, as students 2 and 4 were deadlocked in a tie. Second, it is possible that the LLMs were lenient graders. After all, sycophancy in LLMs has been well-documented and researched, and many companies have made concerted efforts to tone down the level of sycophancy as they introduced new versions of their models.
This experiment validates that LLMs can be used to assess the quality of written text when evaluated against a custom rubric. This is probably not surprising to many readers who have already engaged with LLMs in similar ways, including myself. However, this is the first time I’ve quantified my findings. Another key takeaway is that LLMs can be used to critically appraise a body of written text so the author has a chance to make revisions based on the feedback. In academic settings, the mere usage of LLMs is not tantamount to cheating. It’s the way in which an LLM is used that constitutes whether the LLM serves as a learning aid or if it is used to cheat. In work settings, I encourage professionals to take full advantage of LLMs to enhance learning, spark creativity, and optimize productivity. As long as LLMs are used in a way that they do not substitute critical thinking, I think we have a lot to gain.