Article 723BF AI Favors Texts Written by Other AIs, Even When They're Worse Than Human Ones

AI Favors Texts Written by Other AIs, Even When They're Worse Than Human Ones

by
hubie
from SoylentNews on (#723BF)

upstart writes:

AI favors texts written by other AIs, even when they're worse than human ones:

As many of you already know, I'm a university professor. Specifically, I teach artificial intelligence at UPC.

Each semester, students must complete several projects in which they develop different AI systems to solve specific problems. Along with the code, they must submit a report explaining what they did, the decisions they made, and a critical analysis of their results.

Obviously, most of my students use ChatGPT to write their reports.

So this semester, for the first time, I decided to use a language model myself to grade their reports.

The results were catastrophic, in two ways:

  1. The LLM wasn't able to follow my grading criteria. It applied whatever criteria it felt like, ignoring my prompts. So it wasn't very helpful.
  2. The LLM loved the reports clearly written with ChatGPT, rating them higher than the higher-quality reports written by students.

In this post, I'll share my thoughts on both points. The first one is quite practical; if you're a teacher, you'll find it useful. I'll include some strategies and tricks to encourage good use of LLMs, detect misuse, and grade more accurately.

The second one... is harder to categorize and would probably require a deeper study, but I think my preliminary observations are fascinating on their own.

[...] If you're a teacher and you're thinking of using LLMs to grade assignments or exams, it's worth understanding their limitations.

We should think of a language model as a "very smart intern": fresh out of college, with plenty of knowledge, but not yet sure how to apply it in the real world to solve problems. So we must be extremely detailed in our prompts and patient in correcting its mistakes-just as we would be if we asked a real person to help us grade.

In my tests, I included the full project description, a detailed grading rubric, and several elements of my personal judgment to help it understand what I look for in an evaluation.

[...] The usual hallucinations began-the kind I thought were mostly solved in newer model versions. But apparently not: it was completely making up citations from the reports.

[...] Soon after, it started inventing its own grading criteria. I couldn't get it to follow my rubric at all. I gave up and decided to treat its feedback simply as an extra pair of eyes, to make sure I wasn't missing anything.

[...] Instead of asking the LLM to identify AI-written texts, which it doesn't do very well, I decided to compare my own quality ratings of each project with the LLM's ratings. Basically, I wanted to see how aligned our criteria were.

And I found a fascinating pattern: the AI gives artificially high scores to reports written with AI.

The models perceive LLM-written reports as more professional and of higher quality. They prioritize form over substance.

Read more of this story at SoylentNews.

External Content
Source RSS or Atom Feed
Feed Location https://soylentnews.org/index.rss
Feed Title SoylentNews
Feed Link https://soylentnews.org/
Feed Copyright Copyright 2014, SoylentNews
Reply 0 comments