Python is the most practical programming language for SEO professionals. It automates hours of manual work, enables analyses that no off-the-shelf tool provides, and costs nothing to use. This collection covers 40+ scripts and libraries organized by SEO function, with copy-paste code you can use immediately.
Key takeaway: You don’t need to be a developer to use Python for SEO. Each script in this guide is self-contained and documented. Start with the simple ones (broken link checking, title tag extraction), then build up to advanced scripts (keyword clustering, AI citation monitoring) as your skills grow. As we discuss in How to Win ‘Best X’ and ‘Top 10’ Prompts in AI Search, this is a critical factor.
What Python Libraries Do You Need for SEO?
Install these core libraries to cover 90% of SEO automation tasks:
pip install requests beautifulsoup4 pandas lxml scrapy sentence-transformers scikit-learn google-auth google-api-python-client advertools
Core libraries explained:
| Library | Purpose | SEO Use Case |
|---|---|---|
| requests | HTTP requests | Fetch pages, check status codes, test redirects |
| BeautifulSoup | HTML parsing | Extract titles, headings, meta tags, links |
| pandas | Data analysis | Process CSV exports, analyze keyword data |
| lxml | Fast XML/HTML parsing | Parse sitemaps, large HTML files |
| Scrapy | Web crawling framework | Full site crawls, structured data extraction |
| sentence-transformers | NLP embeddings | Keyword clustering, content similarity |
| scikit-learn | Machine learning | Clustering algorithms, classification |
| advertools | SEO-specific utilities | Sitemap parsing, robots.txt analysis, SERP analysis |
| google-api-python-client | Google APIs | Search Console data, Analytics data |
What Are the Best Python Scripts for Technical SEO?
Script 1: Bulk status code checker
Check HTTP status codes for a list of URLs. Essential for finding broken links, redirect chains, and server errors.
import requests
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
def check_url(url):
try:
r = requests.head(url, timeout=10, allow_redirects=True)
return {
'url': url,
'status': r.status_code,
'final_url': r.url,
'redirected': url != r.url,
'redirect_count': len(r.history)
}
except requests.RequestException as e:
return {'url': url, 'status': 'Error', 'final_url': str(e),
'redirected': False, 'redirect_count': 0}
urls = open('urls.txt').read().strip().split('\n')
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(check_url, urls))
df = pd.DataFrame(results)
df.to_csv('status_report.csv', index=False)
print(f"Checked {len(urls)} URLs")
print(df['status'].value_counts())
Script 2: Title tag and meta description extractor
Pull title tags and meta descriptions from a list of URLs. Useful for auditing on-page elements at scale. If you want to go deeper, How AI Search is Changing Consumer Behavior in 2026 breaks this down step by step.
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract_meta(url):
try:
r = requests.get(url, timeout=10)
soup = BeautifulSoup(r.text, 'lxml')
title = soup.title.string.strip() if soup.title else 'Missing'
desc_tag = soup.find('meta', attrs={'name': 'description'})
desc = desc_tag['content'].strip() if desc_tag else 'Missing'
h1_tags = [h.get_text().strip() for h in soup.find_all('h1')]
return {
'url': url, 'title': title, 'title_len': len(title),
'description': desc, 'desc_len': len(desc),
'h1': h1_tags[0] if h1_tags else 'Missing',
'h1_count': len(h1_tags)
}
except Exception as e:
return {'url': url, 'title': f'Error: {e}', 'title_len': 0,
'description': '', 'desc_len': 0, 'h1': '', 'h1_count': 0}
urls = open('urls.txt').read().strip().split('\n')
results = [extract_meta(url) for url in urls]
df = pd.DataFrame(results)
df.to_csv('meta_audit.csv', index=False)
## Flag issues
print("Missing titles:", len(df[df['title'] == 'Missing']))
print("Missing descriptions:", len(df[df['description'] == 'Missing']))
print("Multiple H1s:", len(df[df['h1_count'] > 1]))
print("Titles > 60 chars:", len(df[df['title_len'] > 60]))
Script 3: Redirect chain detector
Identify redirect chains that waste crawl budget and dilute link equity.
import requests
def trace_redirects(url):
try:
r = requests.get(url, timeout=10, allow_redirects=True)
chain = [resp.url for resp in r.history] + [r.url]
return {'url': url, 'chain_length': len(chain) - 1,
'chain': ' → '.join(chain), 'final_status': r.status_code}
except Exception as e:
return {'url': url, 'chain_length': -1, 'chain': str(e), 'final_status': 'Error'}
urls = open('urls.txt').read().strip().split('\n')
for url in urls:
result = trace_redirects(url)
if result['chain_length'] > 1:
print(f"⚠️ Chain ({result['chain_length']} hops): {result['chain']}")
Script 4: XML sitemap parser and validator
Parse sitemaps, check all URLs for status codes, and identify issues. (We explore this further in Question-Style Headings That AI Engines Pull.)
import advertools as adv
import pandas as pd
import requests
## Parse sitemap (handles sitemap index files too)
sitemap_df = adv.sitemap_to_df('https://example.com/sitemap.xml')
print(f"Total URLs in sitemap: {len(sitemap_df)}")
## Check a sample for status codes
sample = sitemap_df['loc'].head(100).tolist()
statuses = []
for url in sample:
try:
r = requests.head(url, timeout=10)
statuses.append({'url': url, 'status': r.status_code})
except:
statuses.append({'url': url, 'status': 'Error'})
status_df = pd.DataFrame(statuses)
print("\nStatus distribution:")
print(status_df['status'].value_counts())
## Find non-200 URLs
issues = status_df[status_df['status'] != 200]
if len(issues) > 0:
print(f"\n⚠️ {len(issues)} URLs with issues:")
print(issues)
Script 5: Robots.txt AI crawler checker
Check robots.txt for AI crawler access across multiple domains.
import requests
AI_BOTS = ['GPTBot', 'ChatGPT-User', 'PerplexityBot',
'Google-Extended', 'anthropic-ai', 'Bytespider']
def check_ai_access(domain):
try:
r = requests.get(f'https://{domain}/robots.txt', timeout=10)
robots = r.text.lower()
results = {}
for bot in AI_BOTS:
bot_lower = bot.lower()
if f'user-agent: {bot_lower}' in robots:
## Find the section and check for disallow
section_start = robots.index(f'user-agent: {bot_lower}')
section = robots[section_start:section_start+200]
if 'disallow: /' in section:
results[bot] = '❌ Blocked'
else:
results[bot] = '✅ Allowed'
else:
## Check if wildcard blocks everything
if 'user-agent: *' in robots and 'disallow: /' in robots:
results[bot] = '⚠️ Possibly blocked by wildcard'
else:
results[bot] = '✅ Not mentioned (allowed)'
return results
except Exception as e:
return {bot: f'Error: {e}' for bot in AI_BOTS}
domains = ['example.com', 'competitor1.com', 'competitor2.com']
for domain in domains:
print(f"\n{'='*50}")
print(f"AI Crawler Access: {domain}")
for bot, status in check_ai_access(domain).items():
print(f" {bot}: {status}")
What Python Scripts Help with Keyword Research and Clustering?
Script 6: Semantic keyword clustering
Group keywords by meaning using sentence embeddings. This is the free alternative to paid clustering tools.
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
## Load keywords from CSV (column: 'keyword')
df = pd.read_csv('keywords.csv')
keywords = df['keyword'].tolist()
## Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(keywords, show_progress_bar=True)
## Cluster
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=1.2,
metric='cosine',
linkage='average'
)
labels = clustering.fit_predict(embeddings)
df['cluster'] = labels
## Summarize clusters
for cluster_id in sorted(df['cluster'].unique()):
cluster_kws = df[df['cluster'] == cluster_id]['keyword'].tolist()
if len(cluster_kws) > 1:
print(f"\nCluster {cluster_id} ({len(cluster_kws)} keywords):")
for kw in cluster_kws[:10]:
print(f" - {kw}")
df.to_csv('clustered_keywords.csv', index=False)
print(f"\n{len(df['cluster'].unique())} clusters from {len(keywords)} keywords")
Script 7: Search Console data puller
Extract performance data from Google Search Console API for custom analysis. This relates closely to what we cover in How Do AI Search Engines Decide What to Cite?.
from google.oauth2 import service_account
from googleapiclient.discovery import build
import pandas as pd
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
credentials = service_account.Credentials.from_service_account_file(
'service-account.json', scopes=SCOPES)
service = build('searchconsole', 'v1', credentials=credentials)
SITE_URL = 'https://yourdomain.com'
def get_gsc_data(start_date, end_date, dimensions=['query', 'page']):
all_rows = []
start_row = 0
while True:
request = {
'startDate': start_date, 'endDate': end_date,
'dimensions': dimensions,
'rowLimit': 25000, 'startRow': start_row
}
response = service.searchanalytics().query(
siteUrl=SITE_URL, body=request).execute()
rows = response.get('rows', [])
if not rows:
break
for row in rows:
data = dict(zip(dimensions, row['keys']))
data.update({
'clicks': row['clicks'], 'impressions': row['impressions'],
'ctr': row['ctr'], 'position': row['position']
})
all_rows.append(data)
start_row += len(rows)
if len(rows) < 25000:
break
return pd.DataFrame(all_rows)
df = get_gsc_data('2026-01-01', '2026-01-31')
df.to_csv('gsc_data.csv', index=False)
print(f"Exported {len(df)} rows")
## Quick wins: high impressions, low position
quick_wins = df[(df['position'] > 5) & (df['position'] < 20) &
(df['impressions'] > 100)].sort_values('impressions', ascending=False)
print("\nQuick win opportunities:")
print(quick_wins.head(20)[['query', 'impressions', 'position', 'ctr']])
What Python Scripts Work for GEO and AI Search?
Script 8: AI citation monitor
Track your brand’s appearance in AI search responses using the OpenAI API.
import openai
import json
from datetime import date
client = openai.OpenAI(api_key='your-key')
BRAND = "YourBrand"
QUERIES = [
"What is the best CRM software?",
"Top CRM tools for startups",
"How to choose a CRM platform",
"CRM software comparison 2026",
]
results = []
for query in QUERIES:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
)
answer = response.choices[0].message.content
mentioned = BRAND.lower() in answer.lower()
results.append({
'date': str(date.today()),
'query': query,
'cited': mentioned,
'response_preview': answer[:200]
})
print(f"{'✅' if mentioned else '❌'} {query}")
## Save results
with open(f'citations_{date.today()}.json', 'w') as f:
json.dump(results, f, indent=2)
cited = sum(1 for r in results if r['cited'])
print(f"\nCitation rate: {cited}/{len(results)} ({cited/len(results)*100:.0f}%)")
Script 9: Schema markup extractor and validator
Extract and validate structured data from any URL. Useful for auditing your own pages and analyzing competitors.
import requests
from bs4 import BeautifulSoup
import json
def extract_schema(url):
r = requests.get(url, timeout=10)
soup = BeautifulSoup(r.text, 'lxml')
schemas = []
for script in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(script.string)
schemas.append(data)
except json.JSONDecodeError:
schemas.append({'error': 'Invalid JSON', 'raw': script.string[:200]})
return schemas
def audit_schema(url):
schemas = extract_schema(url)
if not schemas:
print(f"⚠️ No structured data found on {url}")
return
for i, schema in enumerate(schemas):
schema_type = schema.get('@type', 'Unknown')
print(f"\nSchema {i+1}: {schema_type}")
if schema_type in ['Article', 'BlogPosting']:
checks = {
'headline': 'headline' in schema,
'author': 'author' in schema,
'datePublished': 'datePublished' in schema,
'dateModified': 'dateModified' in schema,
'image': 'image' in schema,
}
for field, present in checks.items():
print(f" {'✅' if present else '❌'} {field}")
## Audit your pages
urls = ['https://yourdomain.com/blog/post-1', 'https://yourdomain.com/about']
for url in urls:
print(f"\n{'='*50}")
audit_schema(url)
Script 10: Content similarity analyzer
Find duplicate or near-duplicate content on your site that could cause cannibalization in both traditional and AI search.
from sentence_transformers import SentenceTransformer, util
import requests
from bs4 import BeautifulSoup
import pandas as pd
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_content(url):
r = requests.get(url, timeout=10)
soup = BeautifulSoup(r.text, 'lxml')
## Remove nav, header, footer
for tag in soup.find_all(['nav', 'header', 'footer', 'aside']):
tag.decompose()
return soup.get_text(separator=' ', strip=True)[:2000]
urls = open('urls.txt').read().strip().split('\n')
contents = {url: get_content(url) for url in urls}
embeddings = model.encode(list(contents.values()))
## Find similar pairs
pairs = []
for i in range(len(urls)):
for j in range(i+1, len(urls)):
sim = util.cos_sim(embeddings[i], embeddings[j]).item()
if sim > 0.8:
pairs.append({'url_1': urls[i], 'url_2': urls[j],
'similarity': round(sim, 3)})
if pairs:
df = pd.DataFrame(pairs).sort_values('similarity', ascending=False)
print(f"⚠️ Found {len(pairs)} similar page pairs:")
print(df.to_string(index=False))
else:
print("✅ No highly similar pages found")
What Scripts Help with Link Analysis and Internal Linking?
Script 11: Internal link graph analyzer
Map your site’s internal linking structure and find orphan pages, over-linked pages, and link equity distribution issues. For more on this, see our guide to Perplexity Market Share & Growth (2026).
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from collections import defaultdict
def crawl_internal_links(start_url, max_pages=500):
domain = urlparse(start_url).netloc
visited = set()
to_visit = {start_url}
link_graph = defaultdict(set) # page -> set of pages it links to
inlinks = defaultdict(set) # page -> set of pages linking to it
while to_visit and len(visited) < max_pages:
url = to_visit.pop()
if url in visited:
continue
visited.add(url)
try:
r = requests.get(url, timeout=10)
soup = BeautifulSoup(r.text, 'lxml')
for a in soup.find_all('a', href=True):
link = urljoin(url, a['href']).split('#')[0].split('?')[0]
if urlparse(link).netloc == domain and link != url:
link_graph[url].add(link)
inlinks[link].add(url)
if link not in visited:
to_visit.add(link)
except:
continue
## Analysis
print(f"Crawled {len(visited)} pages")
## Orphan pages (no inlinks from crawled pages)
orphans = [url for url in visited if len(inlinks.get(url, set())) == 0
and url != start_url]
print(f"\nOrphan pages (0 internal links): {len(orphans)}")
for url in orphans[:10]:
print(f" {url}")
## Top linked pages
top_linked = sorted(inlinks.items(), key=lambda x: len(x[1]), reverse=True)[:10]
print(f"\nMost internally linked pages:")
for url, links in top_linked:
print(f" {len(links)} links → {url}")
return visited, link_graph, inlinks
crawl_internal_links('https://yourdomain.com')
Script 12: Broken link finder
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
def find_broken_links(page_url):
broken = []
try:
r = requests.get(page_url, timeout=10)
soup = BeautifulSoup(r.text, 'lxml')
links = [urljoin(page_url, a['href']) for a in soup.find_all('a', href=True)
if a['href'].startswith(('http', '/'))]
def check(link):
try:
resp = requests.head(link, timeout=10, allow_redirects=True)
if resp.status_code >= 400:
return {'source': page_url, 'broken_link': link,
'status': resp.status_code}
except:
return {'source': page_url, 'broken_link': link, 'status': 'Error'}
return None
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(check, links)
broken = [r for r in results if r]
except Exception as e:
print(f"Error crawling {page_url}: {e}")
return broken
## Check multiple pages
pages = open('urls.txt').read().strip().split('\n')
all_broken = []
for page in pages:
broken = find_broken_links(page)
all_broken.extend(broken)
if broken:
for b in broken:
print(f"❌ {b['status']} | {b['broken_link']} (from {b['source']})")
print(f"\nTotal broken links found: {len(all_broken)}")
What Are the Best Python Libraries Specifically for SEO?
advertools — Built specifically for SEO. Includes sitemap parsing, robots.txt analysis, URL structure analysis, and SERP data processing. The crawl() function is a full-featured web crawler built on Scrapy.
Polars — A faster alternative to pandas for processing large keyword datasets (100K+ rows). The syntax is different but operations that take 30 seconds in pandas complete in 2 seconds with Polars.
trafilatura — Extracts main content from web pages, removing boilerplate. Better than BeautifulSoup for content extraction because it uses heuristics to identify the main article text.
import trafilatura
downloaded = trafilatura.fetch_url('https://example.com/article')
text = trafilatura.extract(downloaded)
print(text[:500])
google-api-python-client — Official Google API client. Essential for pulling Search Console data, GA4 data, and using other Google APIs programmatically. Our Website Migration SEO Checklist (2026) guide covers this in detail.
pytrends — Unofficial Google Trends API. Useful for keyword research, trend analysis, and identifying seasonal patterns.
from pytrends.request import TrendReq
pytrends = TrendReq()
pytrends.build_payload(['GEO optimization', 'AI SEO'], timeframe='today 12-m')
trends = pytrends.interest_over_time()
print(trends)
These scripts represent the building blocks. Combine them to create custom workflows — pull Search Console data, cluster the keywords, identify content gaps, check for broken links, and validate schema — all in one automated pipeline. That’s the real power of Python for SEO: not any single script, but the ability to chain analyses together in ways no single tool can match.