GEO / Search Strategy · 7 min read · 2026-06-22

We scanned every YC Spring 2026 batch startup. AI can reach them; it can't tell what they are.

We ran all 197 companies in YC's Spring 2026 batch through our live scanner. The scary headline you would expect, that they are all invisible single-page apps, is false. The real gap is quieter and far more common: AI can reach most of them, but they do not label what they are in machine-readable terms.

To analyze the batch we first had to list it, so we fetched YC's own directory at ycombinator.com/companies. It came back as about 33 kilobytes of HTML containing one line of visible text: the page title. Every company, every link, loads afterward as JavaScript. To a program that reads HTML and does not run JavaScript, the YC directory is a blank page.

That is a fitting place to start, because a small slice of YC's own companies have exactly that problem. Most do not. We scanned all 197 to find out precisely how many, and which, and the answer splits into two very different questions a founder can ask about their own site.

Can an AI reach you? Mostly yes. Can an AI tell what you are? Mostly no.

Most of the batch is reachable. That part is fine.

Here is the result up front, because the honest version is more useful than the alarming one. Of the 195 sites we could evaluate (two returned server errors at scan time and were set aside), 164 serve real content directly in their HTML. A crawler that does not run JavaScript, which describes most of what feeds AI answers today, reads a median of 682 words on these sites before a single script runs. They are readable. The "everyone is an invisible SPA" story is simply not true for this batch.

1 in 11 is a real site a crawler cannot see

Seventeen of the 195, about 1 in 11, are empty shells. Their front-door HTML is near empty, but their rendered page is full. The content exists. It is locked behind client-side JavaScript, invisible to any crawler that does not execute it. One site serves a single word of text to a crawler while its rendered homepage runs past 900 words. Another serves nine and renders past 1,200. These are not parked domains. They are real products whose copy an AI crawler reading raw HTML would never reach.

We counted a site as a shell only when it is JavaScript-rendered, serves under 50 words at the front door, and renders 100 or more once the script runs. That last condition is the point: the content is really there, just hidden. We measured it with the scanner's own front-door-versus-rendered word count, the same number any user sees. The gap between the two is exactly what a non-JS crawler misses. It is a deterministic fact, not a prediction about ranking.

A separate handful simply has not launched

Eleven more sites are content-light both at the front door and after rendering. These are not a crawler problem. They are pre-launch: coming-soon pages, waitlists, stealth splash screens. We counted them apart and kept them out of the figure above on purpose, because folding them in would inflate it. Three more were thin static pages and two were off-host redirects, also set aside. The shell number counts only real content that is hidden, nothing else.

Now the harder question: can an AI tell what you are?

Reaching a page is not the same as understanding it. An AI that fetches one of these homepages gets a wall of prose and, usually, no machine-readable labels telling it whether it is looking at a fintech API, a healthcare app, or a dev tool. The text is there to read. The structured signals that let a machine sort, quote, and act on it mostly are not.

What the batch labels, and what it skips

Share of 195 sites, each axis a single measured check: Crawler access 91%, HTTPS 97%, Sitemap 68%, Canonical 56%, Image alt 54%, Schema (any) 50%, On-page FAQ 28%, FAQ schema 19%.

The top of the wheel (reach: crawler access, HTTPS) is strong. The bottom (labeling: schema, FAQ markup) is where the batch thins out. Every value is a yes/no check counted across 195 sites, not a composite score.

The labeling gap, in plain numbers. Half the batch, 50 percent, carries any structured-data markup at all; only 41 percent carry a type a machine can actually use. FAQ markup, the most answer-friendly format there is, appears on 19 percent. A canonical tag, which tells a crawler which URL is the real one, is set on 56 percent. None of this is exotic, and none of it is hard. It is the machine-readable labeling that turns "a page an AI can read" into "a page an AI can quote correctly."

To be precise about the claim, since it is easy to overstate: a modern model can read prose perfectly well and infer a great deal from it. Missing schema does not make a page incomprehensible. It makes a machine guess at things it could instead be told, the product type, the FAQ, the price, the answer, which matters more as AI shopping and answer agents start acting on that information rather than just summarizing it.

The rest of the picture

A few other things we checked, since every site was already loaded.

Blocking AI crawlers. Eighteen sites, about 9 percent, block at least one major AI crawler in robots.txt. GPTBot and ClaudeBot are blocked most often, PerplexityBot close behind, and it is the same eighteen sites each time, a wholesale block rather than a considered one. None block everything. In most of these cases the block reads like an unreviewed framework or template default rather than a decision, which is worth a thirty-second look if you are one of them.

The basics, mostly handled. Ninety-seven percent serve HTTPS. Crawlers can reach 91 percent without a robots block. Two-thirds publish a sitemap. The weak spots inside the basics are canonical tags at 56 percent and image alt text at 54 percent, both easy, both half-done.

One number to read carefully. Only 4 percent of the sites we had performance data for clear the strict Core Web Vitals bar (largest-contentful-paint under 2.5 seconds, plus low layout shift and low blocking time, all three at once). That sounds catastrophic and mostly is not. These are lab measurements taken on a cold load with no caching, which run pessimistic against what real users experience, and it is a single low-weight check in our scoring. Read it as "almost nobody clears the strict lab threshold," not "96 percent of these sites feel slow."

The five that did both

We are naming the best, not the worst. These five reach and label: they serve substantial content directly to crawlers, set a clean canonical, and ship real structured data, while skipping the accidental own-goals in robots.txt. Results are shown for the exact URL we scanned, so if you check one yourself, scan the same address. Each scored identically on the apex and www forms, so there is nothing to explain away.

#CompanyScanned URLWhat it gets right
1Silmarilsilmaril.devFull content served to crawlers, complete structured data, clean canonical
2Tasklettasklet.aiContent-rich front door, strong schema coverage
3Trellistrellistech.comServed content plus complete schema coverage
4BentoLabs AIbentolabs.aiContent served directly, structured data present
5RentAHumanrentahuman.aiContent served, structured data across multiple page types

The common thread is dull, and that is the point. They render their content where a crawler can see it, describe it with schema, and do not close the door in robots.txt. The one thing nearly all of them still trip on is the strict Core Web Vitals bar, the same check almost the entire batch fails. None of this is hard. It is the default you get from treating the marketing site like a document rather than an app.

The afternoon fix list

If you are in the batch, or building anything like it, this is the short version. Most of it is an afternoon.

If your front-door HTML is a shell, pre-render or server-render the marketing pages so the page a crawler reads is the page a person reads. Add Organization plus your real product type (SoftwareApplication or Product) schema, the single biggest gap in the batch. Add a genuine FAQ with FAQPage markup, the most answer-friendly format and the rarest, at 19 percent. Grep your robots.txt for an accidental GPTBot or ClaudeBot block. Put your actual answer in the first 150 words. And if you have been told to ship an llms.txt for Google's sake, skip it: Google has said on the record it does not use the file for Search, and named it directly in its 2026 AI guidance as a tactic that does not help. It is cheap, agent-readable infrastructure that some coding tools and assistants do fetch, so it is not worthless, it is just not a Google lever.

How we did this, and what to distrust

This is the part that matters most, so read it before you argue with the numbers.

Every figure comes from one source: our live scanning engine, the same one anyone can run on their own site. No number here was estimated, modeled, or written by an AI. The engine fetches each site, renders JavaScript-heavy ones with a real headless browser, and reports deterministic facts, how many words a crawler sees, whether robots.txt blocks a given bot, which schema types are present. Re-run it and you get the same answer within a point or two; the only run-to-run variable is third-party performance data.

The batch. YC Spring 2026, enumerated from YC's own directory on 2026-06-22. 197 companies, each scanned at the exact website it lists in that directory. We evaluated 195; two (Gravy and Drip) returned server errors at scan time, so we marked them not evaluated rather than guess.

The thresholds, stated so you can disagree. A site is content-light if a crawler sees fewer than 50 words at the front door, content-rich if the rendered page has at least 100. The 1-in-11 shell figure counts only sites that are JavaScript-rendered, content-light at the front door, and content-rich once rendered, that is, real content that is hidden. Move the thresholds and the exact count shifts a little. The shape does not.

The limits. We scanned each company's listed homepage, not its whole site, and scores are specific to the URL scanned, a site that serves different content on its apex and www forms can score differently on each. Where rendering failed we marked the site not evaluated rather than score a blank. Core Web Vitals are lab metrics and run pessimistic against real-user data. That is the whole method.

If you want to know which of these groups your own site is in, scan it at potatometer.com. It takes about thirty seconds and shows you exactly what a crawler sees before any JavaScript runs.

Sources

  • Y Combinator startup directory (Spring 2026 batch)
  • Google Search Central, 2026 AI features optimization guidance (on llms.txt)
  • Potatometer scanner (the engine behind every number here)

← Back to Blog