crawlfix.ai
Sign inRun free scan
All posts
white paperMay 7, 2026 · 18 min read

Crawlfix white paper: SEO for the rendered web

A reference for engineers and founders shipping React, Next.js, Vue, and other JavaScript-rendered sites. Covers the mechanics of two-pass indexing, AI search eligibility, and a practical detection checklist.

By Crawlfix Labs

A reference document for engineering and founder audiences. Covers how modern search and AI answer engines actually consume JavaScript-rendered sites, the failure modes we see most often in production, and a concrete detection and remediation checklist. Citations are to Google's published documentation and to industry research on AI answer engines. Where evidence is mixed, we say so.

1. Why this paper exists

Search has split into two systems with overlapping but not identical requirements. Traditional Google search still drives the largest share of organic traffic for most sites. AI answer engines, including Google's own AI Overviews, ChatGPT, Perplexity, and Bing's chat experience, drive an increasing share of high-intent queries. Both systems crawl the web. Both systems index. The signals they weight are similar but not the same.

For sites built as single-page applications, the technical baseline of "is this page indexable" is no longer obvious. The page returns 200. The bundle hydrates in the browser. Whether the content ever reaches the index, and in what form, depends on a chain of decisions made by the search system after the first fetch.

This paper is the reference we wish existed when we were first triaging these failures. It is written for engineers who own deploys, founders who own the bottom line, and anyone in between trying to understand why a healthy-looking site is invisible.

2. The mechanics of two-pass indexing

Modern search systems index JavaScript-rendered content using a two-pass model. The mechanics are well-documented by Google and consistent with what we observe in production.

2.1 First pass: raw HTML

When a search system fetches a URL, the first response is the raw HTTP body returned by the origin. This pass is fast, cheap, and run against essentially every URL the system discovers. The first pass is what most legacy SEO tools simulate when they "crawl" a site.

For a server-rendered page, the raw HTML contains the rendered content. The first pass alone is sufficient to extract the title, description, headings, body text, and structured data. The page is eligible for indexing immediately.

For a client-rendered page, the raw HTML contains the application shell: a small wrapper, a few script tags, and a placeholder element. The actual content does not exist yet. The first pass cannot extract anything meaningful.

2.2 Second pass: rendering queue

Pages whose raw HTML is insufficient are queued for rendering. The system runs the page in a real browser, waits for hydration, and then re-evaluates the rendered DOM. Indexing decisions are made against the rendered output.

The rendering queue is not free. Google has stated publicly that rendering capacity is finite and prioritized. Pages with strong external signals (backlinks, brand authority, internal link weight) are rendered faster. Pages with weak signals can wait days or weeks. Some never finish.

This is the source of most "invisible site" incidents. The page is not blocked. It is not excluded. It is queued, and the queue is full, or the render failed silently, or the system decided the rendered output was not worth the cost of re-evaluating.

2.3 Implications

Three implications follow directly from the two-pass model.

  1. The first-pass HTML is load-bearing. Anything you want indexed quickly should be present in the raw response.
  2. Pages with weak external signals depend disproportionately on raw HTML, because they will not get rendered fast.
  3. Server-side rendering or static generation is not a "performance optimization." It is the difference between guaranteed first-pass indexability and best-effort second-pass eligibility.

3. AI answer engines: the same problem, different stakes

AI answer engines crawl the web the same way traditional search does. Most use a render pipeline that resembles Google's. Some, especially newer entrants, are less tolerant of pages that require rendering. The signals AI engines weight when deciding whether to cite a passage are not identical to traditional ranking signals.

Based on Google's published guidance and on industry research from teams studying answer-engine behavior, the picture is consistent.

3.1 What AI engines look for

  • Direct answers in the first passage of a section. Models extract passages, not pages. The first sentence after a heading carries most of the weight.
  • Question-shaped headings. Headings phrased as user questions match against retrieval queries with higher precision.
  • Verifiable trust signals. Named authors, bios, citations to primary sources, visible publish and update dates.
  • Entity consistency. The same name for the brand, product, and key concepts across the page.
  • Structured data where it fits. Article, FAQ, Product, and Organization schemas help models map content to entities.
  • Content reachable in raw HTML, or via a fast and reliable render.

3.2 What does not work

Several categories of "AI optimization" advice circulate that are not supported by either Google's documentation or by observed answer-engine behavior. Examples include AI-specific meta tags, hidden prompt-injection text, and attempts to game AI-detection scores. These either do nothing or, in cases where they are detected, harm eligibility.

3.3 Implication

The technical and editorial requirements for AI search eligibility are a strict superset of traditional SEO requirements. A page that fails first-pass indexing in traditional search will almost always fail AI citation as well. Solving the rendering problem solves both surfaces at once.

4. The failure modes we see most often

The following are the most common failure patterns we encounter in scans. They are listed in approximate order of impact.

4.1 Title and meta description set client-side

The page returns a fallback <title> (often "App", the project name, or the domain) in the raw HTML. The real title is set by a React effect, a head-management library, or a router transition.

Effect: the first-pass index records the wrong title. Search snippets show the fallback for weeks. AI engines cite the page (if at all) under the wrong topic.

4.2 Empty body before hydration

The raw HTML body contains only a root element. All visible content is rendered client-side.

Effect: first-pass indexing returns thin-content signals. The page is queued for rendering, which may or may not happen quickly, and almost certainly will not happen for pages with weak external signals.

4.3 Internal links injected by JavaScript

Navigation, footer links, and contextual links inside content are produced by client-side rendering. The raw HTML contains zero or very few internal links.

Effect: the search system cannot follow the link graph from the entry point. Crawl depth collapses. Interior pages are not discovered, or are discovered only via the sitemap, which is a much weaker signal than an editorial link.

4.4 Structured data set client-side

JSON-LD blocks are inserted by a client effect. The raw HTML has no schema.

Effect: rich result eligibility is delayed or lost. AI engines that extract entities from JSON-LD do not see the entities.

4.5 Canonical tag set client-side or pointing to a parameterized URL

The canonical link tag is missing in raw HTML and is set after hydration, or it points to a URL that includes session or tracking parameters.

Effect: duplicate content signals fire before the canonical resolves. Index consolidation goes to the wrong URL.

4.6 SSR flag accidentally disabled

A configuration flag controlling server rendering is flipped to client-only. This often happens during refactors that are framed as performance work.

Effect: every route on the site simultaneously moves from first-pass indexable to second-pass dependent. Rankings degrade across the entire site within days.

4.7 Mixed status codes on rendered pages

The page returns a 200 in raw HTML, but the rendered page is a "not found" or error state because the data fetch fails after hydration.

Effect: the search system sees a "soft 404": a 200 response with content that should have been an error. Indexing of the URL is preserved, but it is associated with junk content.

5. A concrete detection checklist

Run this against any URL you depend on for organic or AI traffic. Each item maps to a check Crawlfix performs on every scan, but the checks are also straightforward to script yourself.

5.1 Indexability

  • The raw HTTP response (no JavaScript) contains the page's actual title, not a fallback.
  • The raw HTTP response contains the meta description.
  • The raw HTTP response contains the H1.
  • The raw HTTP response contains a canonical tag pointing to the intended URL.
  • The raw HTTP response contains a meaningful body word count for the page's purpose. (For a marketing landing page, plausibly 200 words minimum. For a blog post, the actual article body.)
  • The HTTP status is 200 for indexable pages. 301/302 chains are short. Soft 404s do not exist.
  • robots.txt does not block the URL. noindex is not present unless intended.

5.2 Crawl reachability

  • The raw HTTP response contains internal links to important interior pages.
  • A sitemap exists and lists canonical URLs only.
  • Important pages are reachable within three clicks from the home page.
  • There are no orphan pages with zero incoming internal links.

5.3 AI citability

  • The first sentence after each H2 is a direct answer to a question a user might ask.
  • At least one heading per page is question-shaped where it makes sense.
  • A visible author byline links to a real bio.
  • Publish date and last update date are visible and machine-readable.
  • Acronyms are defined on first use. Brand and product names are used consistently.
  • At least one factual claim cites a primary source.

5.4 Performance and rendering health

  • Largest Contentful Paint is under 2.5 seconds on a representative device profile.
  • Interaction to Next Paint is under 200 milliseconds.
  • Cumulative Layout Shift is under 0.1.
  • The page renders successfully in a headless browser without console errors that block hydration.

6. Remediation order

When a site fails multiple categories at once, work in this order.

  1. Make the raw HTML correct. Server-render or pre-render the routes with organic traffic. This single move solves the largest cluster of issues.
  2. Move metadata to the server. Title, description, canonical, structured data, and Open Graph tags all belong in the server-rendered response.
  3. Make navigation real anchor tags. Internal navigation should be in the raw HTML, not injected by a router after mount.
  4. Audit the gap on remaining client-rendered routes. Decide which can stay client-rendered (logged-in dashboards, transient UI) and which need conversion (any route a user might land on from search or a link).
  5. Wire raw vs. rendered diff into CI. Prevent regressions during future refactors. The first-pass HTML is a contract; treat it that way.
  6. Then improve content quality. Answer-first prose, question-shaped headings, real authorship, citations. None of these matter if the page is not indexable.

7. What changes when you cannot SSR

Some teams cannot ship server rendering on a short timeline. Common reasons include legacy infrastructure, build-time data dependencies, or organizational constraints. Three partial mitigations help.

  • Pre-render the highest-traffic routes at build time. Static export of marketing pages is usually achievable even when the rest of the app cannot be SSR'd.
  • Use a hybrid renderer for the rest. A serverless function that runs the page through a headless browser on cache miss and returns the rendered HTML to the search crawler. This is more complex than SSR and adds latency, but it works.
  • Make the loading state minimally informative. Render the H1, title, and at least one paragraph of static fallback content in the raw HTML. Replace it client-side after hydration. This is a poor substitute for SSR but better than an empty body.

8. How Crawlfix fits

Crawlfix is built around the diff. Every scan does two fetches of the same URL: a raw HTTP request, and a real Chromium render. The diff is the report. We surface the issues, the evidence (the literal raw and rendered values for the field that broke), a fix recipe, and an optional prompt that an AI coding agent can use to apply the fix.

We do not modify your code. The scanner is read-only. The output is a JSON report and a markdown plan. We integrate with Model Context Protocol so an AI coding agent in your editor can pull the report and act on it directly. Pricing is structured so the scanner is free for one URL, and full reports unlock with a one-time payment or subscription.

We are biased about the value of the diff. We are also right about it. The mechanism of two-pass indexing has not changed. The cost of getting the first pass wrong has only gone up as AI answer engines join the pool of systems crawling your site.

9. References

This paper draws on Google's published documentation on JavaScript indexing, mobile-first indexing, structured data, and AI Overviews; on industry research into answer-engine behavior from Senso, the HOTH, and others; and on patterns observed across scans run on Crawlfix during 2025 and 2026. Where the evidence is consensus rather than confirmed, we have flagged it. Where evidence conflicts, we have reported the conflict rather than pick a side.

If you want to go deeper, the Google documentation on rendering, the Search Central blog, and the Search Off the Record podcast cover most of the official guidance. For AI search, the Senso "How LLMs Choose Sources" series and Google's own AI Overviews developer docs are the most rigorous public sources we have found.


Want a free scan of your own site? It runs in your browser and takes about 30 seconds.
Run free scan