What is good writing?
Is the quality of writing determined by how the piece ripples through one’s soul, or how the ideas ripple through society? What can a lone evaluator—gazing through the fogs of uncertainty—even tell about the latter? What is good writing, and what do we want good writing to be?
Someone recently decided to start a writing competition, where the judging is done by LLMs. The idea is to test how well LLMs can replicate human judgment, and whether or not it can effectively be used as a filtering mechanism to get more eyes on writing that people will appreciate.
The early experiments they ran suggested that LLMs are quite capable of doing this, which is perfectly in alignment with my personal intuitions. Assuming that they are, this actually makes them very interesting windows into what constitutes good writing from a human perspective.
So I did the obvious thing, I copy pasted a bunch of essays, had LLMs perform judgments and then examined their judgment through additional prompting. The initial prompt I used was the same one Philosophy Bear mentioned in his original experiment:
You are a judge in a writing contest. You will be shown two essays and asked to score each one.
Score each essay on a scale of 0-10, where 5 represents a typical, competent contest submission — neither impressive nor poor. Scores below 5 are for essays that fall short of typical contest quality in some way. Scores above 5 are for essays that exceed it. Scores of 8 or above should be rare and reserved for essays that genuinely surprised you. Scores of 2 or below should likewise be rare, reserved for essays with serious problems.
You must use the full range. In a large contest, some essays will score 2 or below and some will score 8 or above. Clustering your scores between 5 and 7 is a scoring error.
Reply in this exact format and no other:
Essay A: [score]
Essay B: [score]
The actual strength of this prompt is how unconstrained it is, it simply mentions a writing contest. There are no additional scoring criteria significantly biasing the LLM aside from the line mentioning genuine surprise, which probably results in overvaluing novelty. Aside from that though, the interpretation of what constitutes “good writing” in a writing competition is entirely left up to the LLM. This might be seen as a weakness, but if your goal is to mimic human judgment, making it a window into it, I actually think it’s a strength. The LLM is trained on human data, by not specifying you let the actual messy associations—the compressed signals—from the full corpus do the weighting. There are naturally “corrupting” forces in play, models are fed curated data to emphasize certain things, this gets further amplified with various RL methods, and the data contains many things other than content on “good writing”. One could certainly produce a cleaner window into what constitutes good writing, by making a model specifically for that, but it’s still a decent compression of the behemoth that is everything that goes into answering the question of “what is good writing?”
So I played around with it, looking at how different models, mostly Gemini Flash/Pro and Claude Sonnet, evaluated some essays. I looked at how toggling between standard and extended thinking impacted the assessments of Gemini, noticed some patterns, then proceeded to start prompting it for justifications; to see how well its rationalizations tracked some of the actual patterns in evaluation that seemed to emerge.
The most obvious pattern that emerged was a split between essays as a form of art and essays as analytical work. The smaller the model, and the less thinking resources allocated to it, the more it leaned in the direction of essays as art. The larger models, that are specifically optimized for reasoning, predictably display more of a bias for analytical essays—but interestingly not sufficiently to override the general association between “writing competition” and essay as a form of art. It still heavily evaluated the complexity and expressivity of the prose itself, the language doing lifting beyond the actual argument. We’re not just talking elegant in terms of sentence structure here, but also about emotional resonance and the form itself; the writing being an actual instantiation of the argument itself.
This is, undeniably, craftmanship. But it introduces an interesting point of friction. This style of writing is perfect for the task of making relatively simple ideas resonate deeply within the reader, to demonstrate just how complex and profoundly meaningful even simple observations can be. The intellectual task here, in other words, is generating complexity from simplicity. The written word rippling through the reader, hooking its roots in the dark cavities of the conscious mind, resonating beyond the grasps of reason itself. This is art, this is craftsmanship, but it’s but one side of an essay.
The other side, that we might refer to as analytical work, faces the opposite pressure. Here the job of the prose is to make the complex seem simple. When the ideas themselves are complex and hard to follow, the last thing you want is for the writing to complicate things further. Emotional resonance obfuscates, complex sentence structures distract and most formalization is axiomatically atemporal so rhythm be damned.
The rhythm part is, of course, an exaggerated joke. But in the context of LLMs evaluating more analytically rigorous writing, it is simply the case that they exhibit a massive “formalization boner”. There is nothing more intellectually profound, epistemically calibrated—or indeed linguistically impressive—than formalization. A simple idea competently formalized is simply in a different league from a complex idea explored competently without formalization. It is an amusing quirk, a quite reliable bias in these reasoning models, that can absolutely be exploited as a hack to make them rate the writing higher.
But how about analytical writing that does not formalize? This is the more interesting middle ground, since the “hacks” aren’t being exploited. It is precisely where we can see the substantial bias in associating “writing competition” with writing as an artform. It will penalize the analytical work for simplistic language, even though it would, ironically, be incompetent to use complex, poetic language in a very dense analytical piece. Naturally this is not to say these can never be married, you can certainly do both, but they are in friction and the areas where they actually add to each other are quite narrow. The last thing I want to do when explaining the messy entanglements of heritability studies is use evocative language. Human intuition is, in fact, the enemy here. The goal is to strip them and force the reader to look exclusively at the mechanics, to dismantle the errors in the intuitive leaps. The writing in this case is not dry due to a lack of craftmanship, it’s necessarily dry because wetness is the very enemy being fought.
So we’re left with a fascinating pressure point: the more complex and counterintuitive the idea, the drier the prose generally has to be. This is not absolutely true, there are exceptions, where evocative prose is actually necessary; because what is being demonstrated is the limits of reasoning, the paradoxical nature of insight that is nevertheless useful. This is a pretty specific class of problems though—and they’re definitionally not mechanistic—so needless to say we’re talking about arguments where the very cogs and wheels that turn the machinery are the focus. The more intricate the machinery, the more counterintuitive its workings, the more stripped the language must be.
The friction here, as far as evaluating good writing goes, is that there is nothing pleasurable about it. The content is intellectually demanding, one must keep track of a host of variables simultaneously, many of which are counterintuitive, to follow the argument. The reward only happens if you successfully meet the cognitive demand, otherwise you are left confused. It is only a profound journey if you can reach the destination, otherwise it’s pedantic torture. Some people enjoy pedantic torture, the puzzle is the reward, but for the most part we want the puzzle to feel meaningful before reaching the destination. We want an engaging, profound journey that crescendos into an even more resonant destination.
But notice what is actually happening here: writing as an artform is, unquestionably winning out. I think this is accurate for how we work, and that the LLM judgments are, at least partially, true reflections of what truly, pragmatically, constitutes good writing to us. However, when one summarizes the insight into a sentence such as:
“Good writing is making simple ideas resonate as profound truths.”
We wince.
This is not what we want good writing to mean. The fact that an article that turns out to be paradigm shifting could accurately be considered a worse essay than a brilliant show of resonant craftsmanship just doesn’t sit right. We want it to strike a balance between both, for the writing to be truly evaluated based on its function, but alas we are imperfect judges.
So what is good writing to us really? I don’t think there’s a simple answer to that question, but I think the truth sits closer to artistic resonance than it does to competent exploration of complex ideas. I suspect this remains true even in communities largely focused on analytical rigor.
Finally, I’d just like to sit with the irony of this piece. It is, in many ways, an incredibly unnecessarily convoluted exploration of what is to most a quite obvious and simple truth. We are biased towards poetic, emotional resonance? What a shocking and profound statement indeed!

