llms.txt vs content negotiation: what you actually need
Everyone's talking about llms.txt. But there's a whole landscape of signals that tell AI agents how to access your site. Here's what each one does, what it doesn't, and what to implement first.
There's a lot of noise right now about making your site "AI-ready." You may have heard about llms.txt. Maybe you've seen posts about robots.txt for bots, or structured data for richer search results. And if you've gone deeper, you've run into content negotiation, which is about serving different formats based on what the client requests.
These are all real things. They're also frequently confused with each other.
This post maps the full landscape, explains what each signal actually solves, and gives you a concrete prioritization so you're not spending time on the wrong thing.
The four signals
There are four distinct mechanisms your site can use to communicate with AI agents. Each solves a different problem.
| Signal | Solves | Scope |
|---|---|---|
robots.txt | Access control | Which pages can be fetched at all |
llms.txt | Site overview | What your product is and where to find things |
| Structured data | Semantic metadata | What individual pages mean |
| Content negotiation | Format efficiency | What agents receive when they fetch a page |
None of these replaces the others. The confusion comes from people treating them as competing approaches to the same problem. They're not.
robots.txt: access control
robots.txt has been around since 1994. It tells crawlers which parts of your site they're allowed to fetch.
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
User-agent: GPTBot
Disallow: /
This is a blunt instrument. It's about access, not format. You can block a bot entirely, or block specific paths. What you can't do is tell a bot "fetch this page, but please read it efficiently." That's not what robots.txt is for.
If you want to block AI crawlers from training on your content, robots.txt is the right tool. If you want AI agents to access your site better, you need something else.
llms.txt: the site overview
llms.txt is a proposed convention, not a web standard, but gaining real adoption. The idea: a plaintext file at /llms.txt that describes your site for AI systems.
# Acme Docs
> Developer documentation for Acme's API platform.
## Docs
- [Getting Started](/docs/getting-started): Authentication and first API call
- [API Reference](/docs/api): Full endpoint reference
- [SDKs](/docs/sdks): Client libraries for Python, TypeScript, Go
## Support
- [Status](https://status.acme.com): Current API status
- [GitHub](https://github.com/acme/acme): Open source SDKs
Think of it as a structured README for agents. It answers: "What is this site? Where does the important content live?"
This is genuinely useful. When an agent starts a research session on a topic, it often reads a root page to orient itself before deciding where to go. A structured llms.txt gives it that orientation much faster than parsing a marketing homepage.
Note
llms.txt is a static file. It tells agents what exists on your site, but it says nothing about what they'll receive when they actually fetch those pages. A site can have a perfect llms.txt and still dump 500KB of HTML on every agent request.
Structured data: semantic metadata
Structured data (JSON-LD, typically using Schema.org) embeds machine-readable metadata directly in your HTML. It's the reason Google shows star ratings in search results.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "How to deploy with zero downtime",
"datePublished": "2026-03-15",
"author": { "@type": "Person", "name": "Jane Smith" }
}
</script>
This helps agents understand what a page is about without parsing all the prose. It's been useful for search engines for years and it's becoming more relevant for AI agents that process web content at scale.
Structured data is worth having. But like llms.txt, it's metadata layered on top of HTML. It doesn't change what agents receive. They still get the whole document.
Content negotiation: format efficiency
This is where most sites fall short, and it's the highest-leverage thing you can implement.
HTTP has supported content negotiation since the early days. The Accept header lets a client declare what format it wants. Most browsers send Accept: text/html. Claude Code, Perplexity, and many modern agent frameworks send:
Accept: text/markdown, text/html, */*
If your server respects this, it returns clean Markdown instead of a full HTML document. No navigation. No scripts. No cookie banners. Just the content, with YAML frontmatter for metadata.
HTTP/1.1 200 OK
Content-Type: text/markdown; charset=utf-8
Vary: Accept
---
title: How to deploy with zero downtime
date: 2026-03-15
author: Jane Smith
---
# How to deploy with zero downtime
Your actual content here...
The Vercel team measured a typical blog post at ~500KB as HTML. The same content as Markdown: ~2KB. That's a 99.6% reduction. At scale, an agent researching a topic across 20 pages, the difference is 10MB of context versus 40KB.
Warning
The most common failure mode isn't servers that ignore the Accept header. It's servers that set Content-Type: text/markdown but return the same HTML. Or servers that return Markdown that's just the HTML document run through a converter, preserving all the nav and footer markup. The token savings disappear; the content type header lies.
What to implement first
Here's the prioritization that makes sense for most sites, roughly in order of impact:
1. Content negotiation: This is the highest-leverage change. It directly reduces the token cost of every agent interaction with your site. A site that supports content negotiation properly makes every subsequent agent visit cheaper and more reliable. The implementation is a few lines of middleware on most frameworks: check the Accept header, strip browser chrome, add frontmatter, return with the right Content-Type.
2. llms.txt: Once your pages are efficient, give agents a good entry point. A well-structured llms.txt at your root helps agents navigate your site without fetching every page to understand its structure. Especially useful for documentation sites and multi-section products.
3. Structured data: Worth having, especially for content types that map to Schema.org types (articles, products, FAQs, how-tos). It's additive and relatively low-effort if you're already generating pages from structured CMS content.
4. robots.txt: You probably already have this. Review it to make sure you're not accidentally blocking legitimate agent traffic while trying to block training crawlers. The user agent strings are different.
Where sites actually fail
We've audited a lot of sites with AgentReady.dev. The patterns are consistent:
No content negotiation at all: The most common case. Every request returns the same HTML regardless of Accept. Agents are forced to parse DOM soup.
Content type set, body unchanged: A server returns Content-Type: text/markdown but the body is still raw HTML. Often a middleware misconfiguration.
Navigation not stripped: The server returns Markdown, but it's the entire HTML document converted verbatim. Nav bars, headers, footers, scripts. The size delta is minimal; agents get all the noise.
llms.txt exists, content negotiation doesn't: A site has a polished llms.txt and zero support for the Accept header. The overview is good; every page visit is still inefficient.
The last one is the most common misunderstanding. llms.txt is visible and easy to implement, so it gets done first. Content negotiation requires server-side changes, so it gets deferred. But the payoff ratio is reversed.
The complete picture
If you want to think about agent readiness as a stack:
robots.txt → Can agents access your site?
llms.txt → Can agents understand what your site contains?
Structured data → Can agents understand what individual pages mean?
Content negotiation → Can agents read your pages efficiently?
All four layers matter. But the bottom of the stack (per-page format efficiency) is where most sites have the most room to improve, and where improvement has the most direct effect on how agents experience your content.
See where you stand
AgentReady.dev audits your site against the content negotiation layer specifically: the checks that tell you whether agents are getting clean Markdown or HTML bloat when they fetch your pages. Enter a URL and you'll get a breakdown across seven checks, with specific remediation guidance for anything that fails.
It's free. No signup. Takes about 30 seconds.