GEO · 8 min read ·
We Re-Weighted Our GEO Score Against the Evidence, and llms.txt Lost
Most GEO tools score on folklore. We rebuilt our scoring around published evidence from Google, Zyppy, and Dejan. Here is what changed, and why llms.txt got demoted.
There is a quiet problem running through most of the new GEO tools, the ones that promise to score how ready your site is for AI search. Almost none of them will show you why their score is weighted the way it is. You get a number, a color, maybe a checklist. What you do not get is the evidence behind any of it.
We think that is backwards, especially for a category built on top of systems nobody fully understands yet. So we did something a little uncomfortable. We went back through our own GEO scoring, checked each weight against the best published evidence we could find, and changed the ones that did not hold up. This post is what we changed and why.
The headline: the most popular GEO "best practice" of the last year, the llms.txt file, lost most of its weight. Here is how we got there.
The principle: no folklore in the score
When we started, GEO advice was mostly vibes. Someone would notice they got cited in ChatGPT, guess at why, write it up, and it would spread. A lot of today's "AI optimization checklists" are still built on that chain of guesses.
We made one rule for our score: every weighted check has to be tied to something observable, not to folklore. That sounds obvious until you apply it, because when you actually go looking for evidence, some of the most repeated advice has nothing underneath it.
To pressure test our weights we leaned on three sources that do real work rather than repeat each other:
- Google's own guidance on optimizing for its generative AI features, published and updated this year, which is as close to a primary source as this field has.
- Zyppy's evidence based analysis, which scored AI citation factors across 54 experiments, patents, and case studies.
- Dejan's grounding research, which reverse engineers how the models actually retrieve and cite, down to what they can and cannot see when they fetch a page.
Where those three agreed, we trusted it. Where our score disagreed with all three, our score was the thing that was wrong.
What lost weight
llms.txt
This is the big one. The llms.txt file, an emerging idea that you add a special text file to tell AI models what matters on your site, had become one of the most recommended GEO tactics around. Half the tools in the space will generate one for you.
The evidence for it is essentially nonexistent. Zyppy's analysis rated llms.txt at the very bottom of the factors it looked at, finding no credible experiments showing it influences AI citations at all. And Google could not have been blunter. In its own guidance, it lists creating unnecessary AI text files like llms.txt among the things you can simply ignore, alongside other tactics it calls GEO hacks.
When the company whose AI surfaces you are trying to win, and an independent analysis of 54 sources, both tell you the same thing, that is not a close call. We dropped llms.txt from a heavily weighted check to a near zero, informational one. We will still tell you if you have it, because it does no harm. But it will no longer meaningfully move your score, because the evidence says it should not.
FAQ schema
FAQ schema was carrying more weight in our score than the evidence justified. Structured data does appear to help. Across the studies that look at it, the relationship between schema and citations is consistently positive. But it is also consistently small, and Google's own guidance is careful to say structured data is not required for generative AI search and there is no special markup you must add, even while recommending you keep using it for regular rich results.
A small but consistent positive effect deserves a small but real weight, not a large one. So we trimmed FAQ schema down to match the size of the effect the evidence actually shows. It still counts. It just stopped counting for more than it earns.
What gained weight, or got added
Re-weighting is not only about taking away. The same evidence that demoted llms.txt pointed clearly at things we were underrating.
Whether AI crawlers can actually reach you
The single highest value factor in Zyppy's analysis was URL accessibility, simply whether the AI crawlers can get to your page at all. This sounds basic, but it has become a real and growing problem, because more sites are blocking AI crawlers at the CDN or firewall level, often without realizing it. A plain robots.txt check misses this entirely, since the block happens above robots.txt.
So we made per crawler reachability a first class, heavily weighted check. Not "does your robots.txt mention GPTBot," but "can each major AI crawler actually fetch this page." If an assistant's crawler cannot reach you, nothing else you do matters, which is exactly why it deserves the most weight.
Whether your answer is near the top of the page
This one comes straight from Dejan's grounding research, and it is the kind of finding you only get from looking at the machinery. When some models go to ground an answer, they do not read your whole page. They work from a shallow slice: the title, the URL, and a short snippet a few hundred characters long. Zyppy's work points the same direction, rating "answer near the top" as a high evidence factor.
The implication is concrete. If the actual answer to the question lives halfway down your page, under three paragraphs of preamble, the model may never see it. So we added a check for whether a clear, complete answer shows up near the top of the page, in the first stretch of content rather than buried below the fold. It is a deterministic thing to look for, and it reflects how retrieval actually works rather than how we wish it worked.
Whether your canonical tags are actually consistent
Glenn Gabe's work on canonical tags surfaced a failure mode we were not catching: canonical conflicts cascade into AI search. A canonical tag that points somewhere Google overrules can lead to the wrong page being cited downstream. Our old check only confirmed a canonical tag existed. We extended it to catch the obvious conflicts, like a canonical pointing to a page that does not return a clean response, or pointing off to a different domain.
Why we are telling you all this
The honest reason is that this is the whole point of a trust first tool. If we are going to ask you to act on a score, you should be able to see why the score is built the way it is, and you should be able to push back when the evidence changes. The evidence in this field will keep changing. When it does, the weights should move again, and we will write up that change too.
The less honest but also true reason is that almost nobody else in this space will do this, because doing it means admitting which of your checks were folklore. We would rather show our work and be corrected than hand you a confident number with nothing underneath it.
If you want to see your own site scored this way, on weights tied to published evidence rather than guesses, that is what Potatometer is for. And if you think we got a weight wrong, tell us. That is sort of the entire idea.