Useful SEO Agent Skills Need Clear Operating Boundaries

Introduction

I've built 10+ SEO agent skills in 34 days. Six worked on the first try. The other four taught me everything I'm about to show you about the folder structure most LinkedIn posts about AI SEO skills gloss over. What makes these agents reliable isn't better prompts. It's the architecture behind them. Here's how to build an agent from scratch, test it, fix it, and ship it with confidence.

Why most AI SEO skills fail

Here's what a typical "AI SEO prompt" looks like on LinkedIn: That's it. One prompt. Maybe some formatting instructions. The person posts a screenshot of the output, gets 500 likes, and moves on. The output looks professional. It reads well. It's also 40% wrong. I know because I tried this exact approach. Early in the build, I pointed an agent at a website and said, "find SEO issues." It came back with 20 findings. Eight.

Build SEO agent skills as workspaces

Every agent in our system has a workspace. Think of it like a new hire's desk, stocked with everything they need. Here's what the workspace looks like for the agent that crawls websites and maps their architecture: Six components. One prompt file would cover maybe 20% of this.

AGENTS.md is the instruction manual

I wrote thousands of words of methodology into AGENTS.md. Instead of "crawl the site," I laid out the steps: "Start with the sitemap. If no sitemap exists, check /sitemap.xml, /sitemap_index.xml, and robots.txt for sitemap references. Respect crawl delay. Use a browser user agent string, never a bare request. If you get 403s, note the pattern and try with different headers before reporting it as a block."

Scripts are the agent's tools

The agent calls node crawl_site.js -url to analyze website data. It doesn't write curl commands from scratch every time. That's the difference between giving someone a toolbox and telling them to forge their own wrench.

References are the judgment calls

This contains criteria for what counts as an issue. Known false positives to watch for. Edge cases that took me 20 years to learn. The agent reads these when it encounters something ambiguous.

Memory is institutional knowledge

The next execution benefits from the last.

Templates enforce consistency

This is where I get specific about the output I want: "Use this exact structure. These exact fields. This severity scale." Output templates are the difference between getting the same quality in run 14 as you did in run 1.

Walkthrough: Building the crawler from scratch

Let me show you exactly how I built the crawler. It maps a site's architecture, discovers every page, and reports what it finds.

Version 1: The naive approach

I provided the instruction: "Crawl this website and list all pages." The agent wrote its own HTTP requests, used bare curl, and got blocked by the first site it touched. Every modern CDN blocks requests without a browser user agent string, so it was dead on arrival.

Version 2: Added a script

I built crawl_site.js using Playwright. This version used a headless browser and a real user agent. The agent calls the script instead of writing its own requests. This worked on small sites, but it crashed on anything over 200 pages. Because there was no rate limiting and no resume capability, it hammered servers until they blocked us. A useful companion note is Technical SEO Problems That Quietly Limit Growth, because it looks at a nearby part of the same system.

Version 3: Introducing rate limiting and resume

I added throttling with a two requests per second default and never every two seconds for CDN protected sites. The agent reads robots.txt and adjusts its speed without asking permission. I also added checkpoint files so a crashed crawl can resume from where it stopped. This worked on most sites, but it failed on sites that require JavaScript rendering.

Version 4: JavaSript rendering

This time, I added a browser rendering mode. The agent detects whether a site is a single page app (React, Next.js, Angular) and automatically switches to full browser rendering. It also compares rendered HTML against source HTML, and I found real issues this way: Sites where the source HTML was an empty shell but the rendered page was full of content. Google might or might not render it properly. Now we check both. This.

Version 5: Time for templates and memory

For this version, I added templates/output.md with exact fields: URL count, sitemap coverage, blocked paths, response code distribution, render mode used, and issues found. This way every run produces the same structure. I also added memory/runs.log. The agent appends a summary after every execution. Next time it runs, it reads the log and can compare results, like "Last crawl found 485 pages. This crawl found 487. Two new.

Equip agents with the right tools

This is the most important architectural decision I made. When you write "use curl to fetch the sitemap" in your instructions, the agent generates a curl command from scratch every time. Sometimes it adds the right headers. Sometimes it doesn't. Sometimes it follows redirects. Sometimes it forgets. When you give the agent a script called parse_sitemap.sh, it calls the script. The script always has the right headers, always.

Progressive disclosure: Don't dump everything at once

Here's a mistake I made early: I put everything in AGENTS.md. Every rule. Every edge case. Every gotcha. Thousands of words. The agent got confused. It had too much context and it started prioritizing obscure edge cases over common tasks. It would spend time checking for hash routing issues on a WordPress blog. Core rules that affect the 80% case go in AGENTS.md. This is what the agent needs to know for every single run. Edge.

The 10 gotchas: Failure modes that will burn you

Every one of these lessons cost me hours. They're now encoded in our agents' references/gotchas.md files so they can't happen again.

Agents hallucinate data they can't verify

I asked the research agent to find law firms and count their attorneys. It made every number up. It had never visited any of their websites. Only ask agents to produce data they can actually fetch and verify. Separate what they know (training data) from what they can prove (fetched data). The same pattern also shows up in Working Framework, where the practical question is how the signal becomes visible.

Knowledge doesn't transfer between agents

This fix I figured out on day one (use a browser user agent string to avoid CDN blocks) had to be re taught to every new agent. Day 34, a brand new agent hit the exact same problem. Agents don't share memories. Encode shared lessons in a common gotchas file that multiple agents can reference.

Output format drifts between runs

The same prompt can result in different field names: "note" vs. "assessment." "lead_score" vs. "qualification_rating." If you run it twice, get two different schemas. The fix: Create strict output templates with exact field names. Not "write a report." "Use this exact template with these exact fields."

Agents confidently report issues that don't exist

The first three audits delivered false positives with total confidence. The fix wasn't a better prompt. It was a better boss. A dedicated reviewer agent whose only job is to verify everyone else's work. The same reason code review exists for human developers.

Bare HTTP requests get blocked everywhere

Every modern CDN blocks requests without a browser user agent string. The crawler learned this on audit number two when an entire site returned 403s. All it required was a one line fix, and now it's in the gotchas file. Every new agent reads it on day one.

Don't guess URL paths

Agents love to construct URLs they think should exist: /about us, /blog, /contact. Half the time, those URLs 404. My rule is: Fetch the homepage first, read the navigation, follow real links. Never guess.

'Done' vs. 'in review' matters

Agents marked tasks as "done" when posting their findings. Wrong. "Done" means approved. "In review" means waiting for human verification. This small distinction has a huge impact on workflow clarity when you have 10 agents posting work simultaneously.

Categories must be hyper specific

"Fintech" is useless for prospecting because it's too broad. "PI law firms in Houston" works. Every company in a category should directly compete with every other company. My first attempt at sales categories was "Personal finance & fintech." A crypto exchange doesn't compete with a budgeting app. Lesson learned in 20 minutes.

Never ask an LLM to compile data

Unless you want fabricated results. I asked an agent to summarize findings from five separate reports into one document. It invented findings that weren't in any of the source reports. Always build data compilations programmatically. Script it. Never prompt it.

Agents will try things you never planned

The research agent tried to call an API we never set up. It assumed we had access because it knew the API existed. The fix: Be explicit about what tools are available. If a script doesn't exist in the scripts folder, the agent can't use it. Boundaries prevent creative failures.

Build the reviewer first

This is counterintuitive. When you're excited about building, you want to build the workers. The crawler. The analyzers. The fun parts. Build the reviewer first. Without a review layer, you have no way to measure quality. You ship the first audit and it looks great. But 40% of the findings are wrong. You don't know that until a client or a colleague spots it. Our review agent reads every finding from every specialist agent.

The validation standard (Our unfair advantage)

The reviewer catches technical errors. But there's a higher bar than "technically correct." We have a real SEO agency with real clients and a team with 50 years of combined experience. Every agent finding gets validated against one question: "Would we stake our reputation on this?" Would we actually send this to a client, put our name on the report, and tell the developer to build it? Below are four tests we use for every.

Sandbox testing: Train on planted bugs

You don't train an agent on real client sites. You build a test environment where you KNOW the answers. We built two sandbox websites with SEO issues we planted on purpose: A WordPress style site with 27+ planted issues: missing canonicals, redirect chains, orphan pages, duplicate content, broken schema markup. A Node.js site simulating React/Next.js/Angular patterns with ~90 planted issues: empty SPA shells, hash routing,.

Consistency: The unsexy secret

Nobody writes about this because it's boring. But consistency is what separates a demo from a product. Three things that make output consistent: Templates: Every agent has an output template in templates/output.md: Exact fields, structure, and severity scale. If the output looks different every run, you don't need a better prompt. You need a template file. Run logs: After every execution, the agent appends a summary to.

The stack that makes it work

A quick note on infrastructure, because the tools matter. Our agents run on OpenClaw. It's the runtime that handles wake ups, sessions, memory, and tool routing. Think of it as the operating system the agents run on. When an agent finishes one task and needs to pick up the next, OpenClaw handles that transition. When an agent needs to remember what it did last session, OpenClaw provides that memory. Paperclip is the company.

The result

This process resulted in 14+ audits completed with 12 to 20 developer ready tickets per audit, including exact URLs and fix instructions. All produced in hours, not weeks. We have a 99.6% approval rate on internal linking recommendations on 270 links across two sites, verified by a dedicated review process. We completed more than 80 SEO checks mapped across seven specialist agents. Each check has expected outcomes, evidence.