How AI Detectors Actually Work: Burstiness, Perplexity, and Beyond

In this article

What is burstiness and why do detectors measure it?
How does perplexity reveal machine-generated text?
Why vocabulary diversity patterns matter to detection systems
What role does text length play in detection accuracy?
How do detectors handle paraphrasing and humanized text?
Which detectors use different methodologies, and what does that mean?
Can human-written text be flagged as AI, and vice versa?
Closing thoughts

AI detectors work by counting, not reading. Tools like GPTZero, Turnitin, and Originality.ai measure statistical properties of your text, chiefly burstiness (how predictably each word follows the last), perplexity (how surprised a language model is by your word choices), and vocabulary diversity, then compare the numbers against patterns learned from known AI output. I spent months living inside these signals while building the detector that runs in HumanizeAI, and your writing has probably been through at least one system like it already: teachers run them, content platforms run them, publishers run them. What follows is the actual methodology, signal by signal, including the part vendors mention less often: detection gets unreliable on short texts, false positives hit real people, and the whole field is locked in an update race with the models it polices.

What is burstiness and why do detectors measure it?

Burstiness describes how predictably one word follows another, and it is the first thing most detectors measure.

A language model like GPT-4 writes by prediction: given everything so far, pick a likely next token, then repeat. The output flows because every token was a statistically safe choice. Humans don't write that way. We backtrack, reach for an obscure adjective mid-sentence, follow a textbook phrase with slang. Our predictability spikes and crashes; the model's stays level.

Human passage

Predictability spikes and crashes
Typos, regional phrasing, stylistic swerves
High-probability runs get interrupted constantly

Model passage

Long runs of statistically safe tokens
Every choice from the same probability machinery
Smooth, level, uninterrupted

What a burstiness check is actually looking at

Detectors quantify this by checking how often high-probability tokens appear in a row. A long run of tokens that all sat in the top slice of a model's probability distribution is a strong machine signal, because human writing interrupts those runs constantly, with typos, regional phrasing, or a deliberate stylistic swerve. GPTZero has published methodology showing it analyzes token probabilities to catch exactly this clustering. When I implemented a burstiness check myself, the surprise was how little code it takes; the signal really is that loud.

One constraint matters for anyone interpreting a score: the signal needs volume. A 50-word answer gives the statistics almost nothing to grip; a 1,500-word essay gives them plenty. Burstiness-based detection is meaningfully more reliable on essays and articles than on tweets and short answers.

How does perplexity reveal machine-generated text?

Perplexity is the partner metric: how surprised a language model is when it reads your text, computed from the probability the model assigns each word and averaged across the passage. Low perplexity means predictable. High means surprising.

The naive assumption, AI text must be low-perplexity because an AI wrote it, misses the real signature. Models train on messy human internet text, so their output isn't uniformly bland. What distinguishes machine writing is consistency. A human passage swings: one sentence is utterly predictable, the next contains a rare construction or a weird verb. AI text holds a moderate, even perplexity from start to finish.

The tell

Flatness, not blandness

Detectors run your text through a reference model, record per-word probabilities, and examine the variance, not just the average. They are hunting machine steadiness, and steadiness is very hard to fake one synonym at a time.

That is why detectors typically use a reference model (GPT-2-class transformers are common) and study the shape of the probability curve. Originality.ai and Copyleaks both fold perplexity analysis into their scoring. Building HumanizeAI's perplexity score taught me the same lesson from the other side: when I tried to cheat my own metric with synonym swaps, the curve barely moved.

Why vocabulary diversity patterns matter to detection systems

The third signal is the easiest to grasp: who repeats what, and how.

Humans repeat carelessly. You might use "important" five times in an essay and never notice, or lean on "in order to" in every third paragraph. Those habits are individual, which is exactly why they read as human: your repetition pattern differs from everyone else's.

Language models repeat in the opposite direction. Trained to avoid obvious duplication, they rotate through "crucial," "vital," "essential," and "paramount", four words doing one job. The variety is technically impressive and slightly artificial. A person would just say "important" twice and move on.

Detectors also watch for vocabulary that is too uniformly correct. Real writers misuse "literally," botch a semicolon, write "the reason is because." Machine text deploys sophisticated vocabulary with eerie consistency and almost no errors. Systems trained on authentic human writing learn that a certain level of imperfection is normal, and its complete absence is itself a signal.

What role does text length play in detection accuracy?

Length is the variable nobody advertises: longer text is easier to classify, and the difference is large.

Every signal above is statistical, and statistics need samples. One GPT-4 sentence is indistinguishable from a human sentence; the burstiness and perplexity patterns only become reliable with volume.

< 100 words

Most detectors perform poorly. Treat any score here with heavy skepticism.

500 words

Enough volume for the statistics to grip. Scores start meaning something.

2,000 words

High confidence territory for every major detector.

Why I built HumanizeAI to warn on very short inputs: the math genuinely cannot see them.

Turnitin's history illustrates the constraint. When it launched AI detection in April 2023, it acknowledged lower confidence on brief submissions, and teachers reported incorrect flags on student work during the early deployment. The system improved with updates, but no update repeals the underlying math: five sentences cannot establish a pattern.

The practical rule: trust detector verdicts roughly in proportion to text length. A score on a discussion-board post or a short answer deserves heavy skepticism. A score on a full article is working with enough data to mean something.

How do detectors handle paraphrasing and humanized text?

Here is where it gets practical: what happens when someone deliberately breaks the patterns?

Manual editing works on the same dials the detectors read. Rewrite AI sentences in your own words and you inject your own quirks, raising perplexity where the model was smooth and breaking up runs of high-probability tokens. The more substantive the edit, the weaker the AI signal.

Humanizer tools industrialize the process. I build one, so I can describe the mechanism without mystique: find the statistical tells, then vary sentence structure, plant rare vocabulary, and shift pacing until the numbers move. The mechanism is sound; whether it beats a given detector on a given day is genuinely open, because detectors retrain on examples of humanized text and learn its tells too. Some systems now specifically flag unnatural vocabulary insertions, the scar tissue of crude humanization.

It is an arms race with no permanent winner. Detectors update, humanizers adapt, detectors update again. I watch this cycle from both sides of the fence, and it is the reason no honest tool promises that your text will pass a specific detector indefinitely, and the reason any tool that does promise it should worry you.

Which detectors use different methodologies, and what does that mean?

Detectors disagree with each other, and the disagreement is structural, not sloppy.

GPTZero weights burstiness and perplexity heavily and has published research on how passage entropy compares against baseline language models. Turnitin runs ensemble methods inside its plagiarism platform, almost certainly trained on real student submissions, a corpus nobody else has. Originality.ai builds for content publishers and uses proprietary scoring. Copyleaks, Sapling, and ZeroGPT each mix their own signals, training data, and thresholds.

85%

What one detector can score a passage as likely-AI

30%

What another detector can score the exact same passage

Detectors you should run before trusting any verdict that matters

Same text, different training data, different thresholds. Single-detector verdicts are weak evidence.

So the same passage can score 85 percent likely-AI on one tool and 30 percent on another. That spread doesn't make detection useless; it makes single-detector verdicts weak evidence. If a decision rides on the result, run more than one tool and compare.

Transparency helps here too. A binary flagged/clean verdict teaches you nothing, which is why I built HumanizeAI.chat to expose the breakdown, burstiness, vocabulary, and perplexity scored separately, so you can see which pattern triggered the flag and judge for yourself whether it's legitimate or a false alarm.

Can human-written text be flagged as AI, and vice versa?

Both failure modes are real, and both have victims.

False positives, human writing flagged as AI, cluster in predictable places. Formulaic styles trip the metrics: technical writing values precision over flair, so it reads statistically flat. Non-native English speakers often write in the careful, grammatically conservative register that language models were trained toward, and get flagged for it. Anyone who drafts in smooth, template-driven prose can replicate AI patterns by accident. I hear from these people regularly: users who paste their own writing into my detector and watch sentences they typed themselves light up as machine-like. It is the single most common confused-and-upset message I get.

False negatives run the other way. Heavily edited machine text, output from smaller models with different statistical signatures, and text from evasion-focused tools all slip through. Even casual cherry-picking, regenerating a passage five times and keeping the most natural one, measurably reduces detectability.

The policy stakes are concrete. A teacher who treats a detector score as sole proof of cheating will eventually punish an innocent student; a platform that trusts detection to block AI content will let plenty through. No detector is 100 percent accurate, mine included. The responsible reading, increasingly adopted by educators, is one signal among many, never a verdict on its own.

Closing thoughts

AI detectors measure three things: how predictable your token sequences are (burstiness), how evenly surprising your word choices run (perplexity), and how your vocabulary distributes. GPTZero, Turnitin, Originality.ai, and Copyleaks each weight those signals differently, which is why one passage can score 85 on one tool and 30 on another. The same math defines the limits: scores mean little under 100 words, false positives hit technical and non-native writers hardest, and every detector is mid-update in a race with the models it polices. I build these scoring functions for a living, and my honest advice is to read any single score as evidence, not verdict. If you want to see the signals on your own writing, HumanizeAI.chat shows live burstiness, perplexity, and vocabulary scores as you work, the same breakdown this post just explained, applied to your actual text.

See your detection score live

Open HumanizeAI, paste your text, and watch the score update with every rewrite. 3 anonymous uses per day, no signup needed.

Try HumanizeAI free