March 28, 2026·8 min read

How our audit works: crawling and scoring your site for AI readiness

A technical walkthrough of how we fetch pages, compare HTML and Markdown responses, and turn seven weighted checks into a single readiness score.

engineeringagentscrawlerscoring

When you submit a URL to AgentReady.dev, a few things happen quickly. We fetch your pages, compare how they look to an AI agent versus a browser, and produce a score. The result feels instant. The engineering behind it is worth explaining.

This is a walkthrough of exactly how the crawler and scorer work.

The audit pipeline

Every audit runs through four stages: crawl, fetch both formats, score each page, and aggregate.

We start with your root URL and discover links. Then we fetch every page twice — once as a browser would, once as an AI agent would. We score each pair of responses against seven checks. Finally, we compute a single site score, weighted by page importance.

Here's the skeleton of that workflow:

async function runAudit(jobId: string) {
  // 1. Fetch root page, discover same-domain links
  // 2. For each page: fetch HTML + Markdown
  // 3. Score each page
  // 4. Aggregate: home page counts 2x
}

Simple on the surface. The interesting parts are in the details.

Step 1: Crawling

We fetch your root URL with a standard Accept: text/html header and parse the response with Cheerio to extract all same-domain links.

A few things we handle here that aren't obvious:

Locale-prefixed paths get deprioritized. Many sites serve the same content under /en-gb/, /de-de/, and so on. We detect locale patterns and push those links to the back of the queue, so the canonical English paths get audited first. If a site only has locale-prefixed URLs, we'll still crawl them — they're not excluded, just ranked lower.

We cap crawls at 10 pages. This isn't about being conservative — it's about signal. The first few pages of a site tell you nearly everything about how well it's prepared for agents. A 100-page crawl adds noise, not signal.

We enforce hard safety limits per request. Each fetch has a 15-second timeout and a 5MB response size cap. We also validate URLs against SSRF protection before fetching — no private IPs, no internal hostnames.

Once we have the link list, the crawler moves to the scoring phase.

Step 2: Fetching both formats

This is the core premise of the whole audit. For every page, we make two requests:

GET /page
Accept: text/html

GET /page
Accept: text/markdown, text/html, */*

The first request is what a browser sends. The second is what Claude Code, Perplexity, and most modern agent frameworks actually send. If your server treats both requests identically and returns the same HTML either way, that's your first failing check.

We intentionally send the Accept header exactly as agents send it — text/markdown first, HTML as fallback — rather than requesting only Markdown. This catches sites that partially support content negotiation, serving Markdown to explicit-only requests but not to the real-world agent header.

Step 3: Scoring

Every page runs through seven checks. Each check returns a status of pass, partial, or fail, which maps to a score of 1.0, 0.5, or 0.0. The checks are weighted, and the page's final score is a weighted average on a 0–100 scale.

Check	Weight	What it tests
Markdown Response	20	Did the server respond differently to the Markdown request?
Valid Markdown	20	Is the Markdown response actually Markdown, not HTML?
Navigation Stripped	15	Were nav bars, headers, and footers removed?
YAML Frontmatter	15	Does the response include structured metadata?
Sitemap / Index	10	(Home only) Does the root page act as a link directory?
Link Quality	10	Are the links in the Markdown well-formed?
Size Delta	10	Is the Markdown meaningfully smaller than the HTML?

Let's go through each one.

Markdown Response (weight: 20)

The most fundamental check. We compare the Content-Type header of the HTML response versus the Markdown response. If both return text/html, the site doesn't support content negotiation. No further Markdown checks can pass.

This check is binary: pass or fail.

Valid Markdown (weight: 20)

A server can return Content-Type: text/markdown and still serve raw HTML. This happens more than you'd expect — misconfigured servers, middleware that rewrites headers without changing the body. We validate the response body directly.

We scan for positive Markdown signals: headings (#), links ([text](url)), lists, code blocks, blockquotes, bold text. Then we check for disqualifying HTML signals: <!DOCTYPE>, <html>, <head>, <body>. If the response looks like HTML, it fails — regardless of what the header says.

Navigation Stripped (weight: 15)

An agent-friendly Markdown response should contain only the page's content. Navigation, header, footer, scripts, and styles are browser concerns. If they appear in the Markdown response, the server is likely just converting HTML to Markdown verbatim, which defeats most of the benefit.

We check for <nav>, <header>, <footer>, nav-class and nav-id attributes, <script>, and <style> tags. Any of these in the response body triggers a fail.

Note

This is one of the most common partial implementations we see. A site will correctly set the Content-Type header and return something that looks like Markdown, but it's actually the entire HTML document run through a converter. The token count is almost the same. Agents get all the noise.

YAML Frontmatter (weight: 15)

Clean Markdown content is useful. Markdown with metadata is much more useful. Frontmatter lets a page communicate its title, description, date, author, and canonical URL in a structured format that agents can parse directly without inferring from the content.

We check that the response starts with ---, and that the frontmatter block contains at least two meaningful fields from: title, description, date, author, tags, url, canonical. A response that opens with --- but only has one field gets a partial.

Sitemap / Index (weight: 10, home page only)

This check only runs on the root URL. The question: does the home page markdown serve as a useful entry point for an agent that wants to understand what your site contains?

We're looking for a link directory — five or more organized markdown links in list format pointing to major sections of the site. When an agent starts exploring a new site, it often reads the root page to decide where to go next. A standard HTML homepage gives it cards and hero images it can't use. A structured index gives it a roadmap.

Link Quality (weight: 10)

Markdown that's full of broken or useless links is worse than no links at all. We analyze every link in the response against three failure patterns:

Empty link text — [](url) tells an agent nothing about the destination
Generic text — "click here", "read more", "link" are meaningless out of context
Non-functional hrefs — #, javascript:, and empty hrefs can't be followed

This check produces a continuous score from 0.0 to 1.0 based on the proportion of problematic links. A page with no link issues scores 1.0. A page where half the links are generic anchor text scores 0.5.

Size Delta (weight: 10)

If your Markdown response is almost the same size as your HTML response, something's wrong. The whole point of content negotiation is token efficiency.

Pass    → Markdown is less than 50% of HTML size
Partial → Markdown is 50–80% of HTML size
Fail    → Markdown is more than 80% of HTML size

This check often catches the verbatim-HTML-to-Markdown conversion pattern: the response looks like Markdown at a glance, but the file size tells the real story.

Step 4: Aggregating the site score

Once every page has a 0–100 score, we compute a single site score. The home page is weighted 2x. Everything else is weighted 1x.

const HOME_PAGE_WEIGHT = 2;

function computeSiteScore(pageScores: PageScore[]): number {
  let totalWeight = 0;
  let weightedSum = 0;

  for (const page of pageScores) {
    const weight = page.isHomePage ? HOME_PAGE_WEIGHT : 1;
    weightedSum += page.score * weight;
    totalWeight += weight;
  }

  return Math.round(weightedSum / totalWeight);
}

The home page weighting isn't arbitrary. The root URL is the most important signal for agent discoverability — it's where most agent frameworks start exploration, and it's the page that earns you (or costs you) the Sitemap/Index check.

The final score maps to one of three readiness states:

70+: Agent-ready
40–69: Partial support
Below 40: Not ready

What the score tells you

A score of 100 means your site responds to the agent Accept header with clean Markdown, strips browser chrome, includes structured frontmatter, has a useful index at the root, and produces a response that's dramatically smaller than the HTML equivalent.

A score of 0 means every request returns the same thing regardless of Accept, the response body is raw HTML, and agents are getting the worst possible experience.

Most sites land somewhere in the middle. The partial scores are often the most useful to look at — they indicate servers that have started implementing content negotiation but haven't completed it. A site that passes Markdown Response but fails Navigation Stripped is a few lines of middleware away from a much better score.

Run your own audit

The best way to understand how your site scores is to run it through the auditor. Enter a URL and you'll get a full breakdown across all seven checks, with specific remediation guidance for anything that fails.

It's free. No signup. The results page links to documentation for each check if you want to dig into the implementation details.