How to Check If AI Bots Can Crawl Your Site: Step-by-Step Guide
TL;DR: Most AI visibility problems start with a simple technical issue: AI bots can’t access your content. Check robots.txt for AI user agent blocks, test with curl commands, verify your CDN isn’t blocking bots, and ensure content renders as server-side HTML. This 30-minute diagnostic process can transform your AI search visibility.
Why Is Checking AI Bot Access the First Thing to Do?
The most common reason websites are invisible to AI search engines is that AI crawlers are blocked from accessing the content. This single technical issue overrides everything else — no amount of content optimization matters if bots can’t see your pages. This relates closely to what we cover in Landing Pages for AI-Referred Visitors.
Studies show that approximately 60% of websites have some form of AI crawler blocking, either intentionally or accidentally. Many businesses added AI crawler blocks during the 2023-2024 period when AI training controversies were prominent, then forgot to remove them when they wanted AI search visibility.
Checking AI bot access takes 30 minutes and can be the highest-ROI activity in your entire GEO strategy. If you find and fix a blocking issue, you immediately unlock AI visibility for your entire site.
How Do You Check Your robots.txt for AI Blocks?
Step one is always robots.txt. Navigate to yourdomain.com/robots.txt in your browser and look for rules targeting AI crawlers.
AI crawler user agents to look for:
GPTBot— OpenAI (ChatGPT)OAI-SearchBot— OpenAI searchChatGPT-User— ChatGPT browsingPerplexityBot— Perplexity AIClaudeBot— Anthropic (Claude)anthropic-ai— Anthropic trainingGoogle-Extended— Google AI featuresCCBot— Common CrawlBytespider— ByteDance
If you see any of these with a Disallow: / rule, that bot is blocked from your entire site. A Disallow: with specific paths blocks only those paths.
Example of blocked robots.txt:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
Example of properly configured robots.txt:
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
If your robots.txt has no mention of AI crawlers at all, the default is to allow access (assuming no wildcard disallow rules block all bots). For more on this, see our guide to Why JavaScript Kills Your AI Visibility.
Watch for wildcard rules:
User-agent: *
Disallow: /
This blocks ALL crawlers including AI bots. Only use wildcard disallow if you specifically want to prevent all crawling.
How Do You Test AI Crawler Access with curl?
The curl command lets you simulate an AI crawler visit and see what response your server returns.
Basic test for GPTBot:
curl -A "GPTBot" -I https://yoursite.com/your-important-page/
The -A flag sets the user agent string, and -I returns only headers. Look at the HTTP status code:
- 200 — Access granted. The bot can reach your page.
- 403 — Forbidden. Your server is actively blocking this user agent.
- 429 — Rate limited. Your server is throttling the bot.
- 301/302 — Redirect. Follow the redirect to see if the final destination returns 200.
- 503 — Service unavailable. Server issue or intentional bot blocking.
Test for content delivery:
curl -A "GPTBot" https://yoursite.com/your-important-page/ | head -100
This shows the actual HTML content the bot receives. Check whether your main content appears in the HTML. If you see an empty <div id="app"></div> or minimal HTML, your content is JavaScript-rendered and invisible to the bot.
Test multiple AI crawlers:
for agent in "GPTBot" "PerplexityBot" "ClaudeBot" "CCBot"; do
echo "=== $agent ==="
curl -s -A "$agent" -o /dev/null -w "%{http_code}" https://yoursite.com/
echo ""
done
This script tests all major AI crawlers and shows the status code for each. If any return non-200 codes, investigate that specific bot’s blocking.
How Do You Check CDN and Firewall Settings?
CDNs and firewalls are a hidden cause of AI crawler blocking. Your robots.txt might allow bots, but your CDN might block them at the network level.
CloudFlare: Navigate to Security > WAF (Web Application Firewall) and Security > Bots. Check for rules that block or challenge automated traffic. CloudFlare’s “Bot Fight Mode” can block AI crawlers. If enabled, add exceptions for known AI crawler user agents. Check Security > Events for any blocked requests from AI user agents.
Akamai: Check your Bot Manager configuration for rules that might block AI crawlers. Akamai’s aggressive bot protection can inadvertently block legitimate AI bots. Our GEO for SaaS: How to Get Your Product Recommended by AI guide covers this in detail.
AWS CloudFront / WAF: Check your WAF rules for user agent or rate-based rules that might affect AI crawlers. AWS WAF’s bot control managed rule group can block AI crawlers.
General CDN checklist:
- Check bot management/protection settings
- Look for rate limiting rules that might affect bots
- Check for IP-based blocking rules
- Review CAPTCHA/challenge settings for bots
- Look for custom WAF rules targeting specific user agents
- Check event logs for blocked AI crawler requests
How Do You Verify Content Renders for AI Bots?
Even if AI bots can access your server, they may not see your content if it’s rendered via client-side JavaScript.
Test 1: View Page Source. In your browser, right-click on your page and select “View Page Source.” Search for a distinctive phrase from your content. If it’s not in the source HTML, bots can’t see it. As we discuss in How to Write Answer Units — Paragraphs AI Can Quote, this is a critical factor.
Test 2: Disable JavaScript. Open Chrome DevTools (F12) > Settings > Debugger > check “Disable JavaScript.” Reload your page. If the content disappears, it requires JavaScript to render and is invisible to most AI bots.
Test 3: Google’s Mobile-Friendly Test. Enter your URL — the tool shows a rendered screenshot and the HTML it received. If the rendered page is empty or missing content, there’s a rendering issue.
Test 4: Google Search Console URL Inspection. Submit your URL and check the “View Crawled Page” option. This shows exactly what Google’s crawler sees. While this tests Googlebot specifically, the HTML content visible here is likely what AI crawlers also see. If you want to go deeper, robots.txt for AI Crawlers — Complete Setup Guide breaks this down step by step.
If content is JavaScript-rendered, your options are:
- Implement Server-Side Rendering (SSR) — best long-term solution
- Implement Static Site Generation (SSG) — best for content that doesn’t change frequently
- Use a pre-rendering service (Prerender.io) — quick fix
- Switch to a framework that supports SSR out of the box (Next.js, Nuxt, SvelteKit)
How Do You Monitor AI Crawler Visits Over Time?
Setting up ongoing monitoring catches blocking issues before they impact AI visibility.
Server log monitoring. Search your access logs for AI crawler user agents weekly. Track crawl frequency, pages visited, and response codes.
Google Search Console. Monitor the Crawl Stats report for changes in crawl rate and response codes. While this shows Googlebot, similar patterns likely affect AI crawlers.
Uptime monitoring. Set up a monitor that periodically requests your pages with AI user agent strings and alerts you if the response code changes. Tools like UptimeRobot or Pingdom can do this with custom headers.
Periodic manual testing. Monthly, run your curl tests against 10-20 important pages. This catches configuration drift (someone changed a firewall rule, a plugin update added blocks, etc.).
What to set up now:
- Weekly server log check for AI crawler activity
- Monthly curl testing for all major AI crawlers
- Quarterly CDN/firewall configuration review
- Alert on any robots.txt changes (use a change monitoring tool)
What’s the Complete Diagnostic Checklist?
Run through this checklist to comprehensively verify AI crawler access.
- robots.txt checked — no AI crawler blocks
- curl test returns 200 for GPTBot
- curl test returns 200 for PerplexityBot
- curl test returns 200 for ClaudeBot
- Content visible in raw HTML (not JavaScript-only)
- CDN bot protection doesn’t block AI crawlers
- Firewall rules don’t block AI user agents
- No rate limiting affecting AI crawler requests
- No CAPTCHA/challenge walls for bots
- Sitemap referenced in robots.txt
- Sitemap submitted to Google Search Console
- Sitemap submitted to Bing Webmaster Tools
- Server response time under 2 seconds for AI user agents
- No meta robots noindex on important pages
- HTTPS implemented (no mixed content)
Complete this checklist once, then re-verify monthly. Any single failed item can block AI search visibility.
Key Takeaways
- AI crawler blocking is the #1 cause of AI search invisibility — check it first
- Test robots.txt, curl responses, CDN settings, and JavaScript rendering
- 60% of websites have some form of AI crawler blocking
- Fix blocking issues and your entire site immediately becomes eligible for AI citations
- Set up monthly monitoring to catch configuration drift
- The 30-minute diagnostic checklist can be your highest-ROI GEO activity