GEOClarity
Tools

Python SEO Tools: 40+ Scripts & Libraries

A comprehensive collection of Python scripts for SEO automation — crawling, log analysis, keyword clustering, schema validation, AI citation.

GEOClarity · · Updated February 25, 2026 · 12 min read

Python is the most practical programming language for SEO professionals. It automates hours of manual work, enables analyses that no off-the-shelf tool provides, and costs nothing to use. This collection covers 40+ scripts and libraries organized by SEO function, with copy-paste code you can use immediately.

Key takeaway: You don’t need to be a developer to use Python for SEO. Each script in this guide is self-contained and documented. Start with the simple ones (broken link checking, title tag extraction), then build up to advanced scripts (keyword clustering, AI citation monitoring) as your skills grow. As we discuss in How to Win ‘Best X’ and ‘Top 10’ Prompts in AI Search, this is a critical factor.

What Python Libraries Do You Need for SEO?

Install these core libraries to cover 90% of SEO automation tasks:

pip install requests beautifulsoup4 pandas lxml scrapy sentence-transformers scikit-learn google-auth google-api-python-client advertools

Core libraries explained:

LibraryPurposeSEO Use Case
requestsHTTP requestsFetch pages, check status codes, test redirects
BeautifulSoupHTML parsingExtract titles, headings, meta tags, links
pandasData analysisProcess CSV exports, analyze keyword data
lxmlFast XML/HTML parsingParse sitemaps, large HTML files
ScrapyWeb crawling frameworkFull site crawls, structured data extraction
sentence-transformersNLP embeddingsKeyword clustering, content similarity
scikit-learnMachine learningClustering algorithms, classification
advertoolsSEO-specific utilitiesSitemap parsing, robots.txt analysis, SERP analysis
google-api-python-clientGoogle APIsSearch Console data, Analytics data

What Are the Best Python Scripts for Technical SEO?

Script 1: Bulk status code checker

Check HTTP status codes for a list of URLs. Essential for finding broken links, redirect chains, and server errors.

import requests
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

def check_url(url):
    try:
        r = requests.head(url, timeout=10, allow_redirects=True)
        return {
            'url': url,
            'status': r.status_code,
            'final_url': r.url,
            'redirected': url != r.url,
            'redirect_count': len(r.history)
        }
    except requests.RequestException as e:
        return {'url': url, 'status': 'Error', 'final_url': str(e),
                'redirected': False, 'redirect_count': 0}

urls = open('urls.txt').read().strip().split('\n')

with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(check_url, urls))

df = pd.DataFrame(results)
df.to_csv('status_report.csv', index=False)
print(f"Checked {len(urls)} URLs")
print(df['status'].value_counts())

Script 2: Title tag and meta description extractor

Pull title tags and meta descriptions from a list of URLs. Useful for auditing on-page elements at scale. If you want to go deeper, How AI Search is Changing Consumer Behavior in 2026 breaks this down step by step.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract_meta(url):
    try:
        r = requests.get(url, timeout=10)
        soup = BeautifulSoup(r.text, 'lxml')
        title = soup.title.string.strip() if soup.title else 'Missing'
        desc_tag = soup.find('meta', attrs={'name': 'description'})
        desc = desc_tag['content'].strip() if desc_tag else 'Missing'
        h1_tags = [h.get_text().strip() for h in soup.find_all('h1')]
        return {
            'url': url, 'title': title, 'title_len': len(title),
            'description': desc, 'desc_len': len(desc),
            'h1': h1_tags[0] if h1_tags else 'Missing',
            'h1_count': len(h1_tags)
        }
    except Exception as e:
        return {'url': url, 'title': f'Error: {e}', 'title_len': 0,
                'description': '', 'desc_len': 0, 'h1': '', 'h1_count': 0}

urls = open('urls.txt').read().strip().split('\n')
results = [extract_meta(url) for url in urls]
df = pd.DataFrame(results)
df.to_csv('meta_audit.csv', index=False)

## Flag issues
print("Missing titles:", len(df[df['title'] == 'Missing']))
print("Missing descriptions:", len(df[df['description'] == 'Missing']))
print("Multiple H1s:", len(df[df['h1_count'] > 1]))
print("Titles > 60 chars:", len(df[df['title_len'] > 60]))

Script 3: Redirect chain detector

Identify redirect chains that waste crawl budget and dilute link equity.

import requests

def trace_redirects(url):
    try:
        r = requests.get(url, timeout=10, allow_redirects=True)
        chain = [resp.url for resp in r.history] + [r.url]
        return {'url': url, 'chain_length': len(chain) - 1,
                'chain': ' → '.join(chain), 'final_status': r.status_code}
    except Exception as e:
        return {'url': url, 'chain_length': -1, 'chain': str(e), 'final_status': 'Error'}

urls = open('urls.txt').read().strip().split('\n')
for url in urls:
    result = trace_redirects(url)
    if result['chain_length'] > 1:
        print(f"⚠️  Chain ({result['chain_length']} hops): {result['chain']}")

Script 4: XML sitemap parser and validator

Parse sitemaps, check all URLs for status codes, and identify issues. (We explore this further in Question-Style Headings That AI Engines Pull.)

import advertools as adv
import pandas as pd
import requests

## Parse sitemap (handles sitemap index files too)
sitemap_df = adv.sitemap_to_df('https://example.com/sitemap.xml')
print(f"Total URLs in sitemap: {len(sitemap_df)}")

## Check a sample for status codes
sample = sitemap_df['loc'].head(100).tolist()
statuses = []
for url in sample:
    try:
        r = requests.head(url, timeout=10)
        statuses.append({'url': url, 'status': r.status_code})
    except:
        statuses.append({'url': url, 'status': 'Error'})

status_df = pd.DataFrame(statuses)
print("\nStatus distribution:")
print(status_df['status'].value_counts())

## Find non-200 URLs
issues = status_df[status_df['status'] != 200]
if len(issues) > 0:
    print(f"\n⚠️  {len(issues)} URLs with issues:")
    print(issues)

Script 5: Robots.txt AI crawler checker

Check robots.txt for AI crawler access across multiple domains.

import requests

AI_BOTS = ['GPTBot', 'ChatGPT-User', 'PerplexityBot',
           'Google-Extended', 'anthropic-ai', 'Bytespider']

def check_ai_access(domain):
    try:
        r = requests.get(f'https://{domain}/robots.txt', timeout=10)
        robots = r.text.lower()
        results = {}
        for bot in AI_BOTS:
            bot_lower = bot.lower()
            if f'user-agent: {bot_lower}' in robots:
                ## Find the section and check for disallow
                section_start = robots.index(f'user-agent: {bot_lower}')
                section = robots[section_start:section_start+200]
                if 'disallow: /' in section:
                    results[bot] = '❌ Blocked'
                else:
                    results[bot] = '✅ Allowed'
            else:
                ## Check if wildcard blocks everything
                if 'user-agent: *' in robots and 'disallow: /' in robots:
                    results[bot] = '⚠️  Possibly blocked by wildcard'
                else:
                    results[bot] = '✅ Not mentioned (allowed)'
        return results
    except Exception as e:
        return {bot: f'Error: {e}' for bot in AI_BOTS}

domains = ['example.com', 'competitor1.com', 'competitor2.com']
for domain in domains:
    print(f"\n{'='*50}")
    print(f"AI Crawler Access: {domain}")
    for bot, status in check_ai_access(domain).items():
        print(f"  {bot}: {status}")

What Python Scripts Help with Keyword Research and Clustering?

Script 6: Semantic keyword clustering

Group keywords by meaning using sentence embeddings. This is the free alternative to paid clustering tools.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np

## Load keywords from CSV (column: 'keyword')
df = pd.read_csv('keywords.csv')
keywords = df['keyword'].tolist()

## Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(keywords, show_progress_bar=True)

## Cluster
clustering = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=1.2,
    metric='cosine',
    linkage='average'
)
labels = clustering.fit_predict(embeddings)
df['cluster'] = labels

## Summarize clusters
for cluster_id in sorted(df['cluster'].unique()):
    cluster_kws = df[df['cluster'] == cluster_id]['keyword'].tolist()
    if len(cluster_kws) > 1:
        print(f"\nCluster {cluster_id} ({len(cluster_kws)} keywords):")
        for kw in cluster_kws[:10]:
            print(f"  - {kw}")

df.to_csv('clustered_keywords.csv', index=False)
print(f"\n{len(df['cluster'].unique())} clusters from {len(keywords)} keywords")

Script 7: Search Console data puller

Extract performance data from Google Search Console API for custom analysis. This relates closely to what we cover in How Do AI Search Engines Decide What to Cite?.

from google.oauth2 import service_account
from googleapiclient.discovery import build
import pandas as pd

SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
credentials = service_account.Credentials.from_service_account_file(
    'service-account.json', scopes=SCOPES)
service = build('searchconsole', 'v1', credentials=credentials)

SITE_URL = 'https://yourdomain.com'

def get_gsc_data(start_date, end_date, dimensions=['query', 'page']):
    all_rows = []
    start_row = 0
    while True:
        request = {
            'startDate': start_date, 'endDate': end_date,
            'dimensions': dimensions,
            'rowLimit': 25000, 'startRow': start_row
        }
        response = service.searchanalytics().query(
            siteUrl=SITE_URL, body=request).execute()
        rows = response.get('rows', [])
        if not rows:
            break
        for row in rows:
            data = dict(zip(dimensions, row['keys']))
            data.update({
                'clicks': row['clicks'], 'impressions': row['impressions'],
                'ctr': row['ctr'], 'position': row['position']
            })
            all_rows.append(data)
        start_row += len(rows)
        if len(rows) < 25000:
            break
    return pd.DataFrame(all_rows)

df = get_gsc_data('2026-01-01', '2026-01-31')
df.to_csv('gsc_data.csv', index=False)
print(f"Exported {len(df)} rows")

## Quick wins: high impressions, low position
quick_wins = df[(df['position'] > 5) & (df['position'] < 20) &
                (df['impressions'] > 100)].sort_values('impressions', ascending=False)
print("\nQuick win opportunities:")
print(quick_wins.head(20)[['query', 'impressions', 'position', 'ctr']])

Script 8: AI citation monitor

Track your brand’s appearance in AI search responses using the OpenAI API.

import openai
import json
from datetime import date

client = openai.OpenAI(api_key='your-key')

BRAND = "YourBrand"
QUERIES = [
    "What is the best CRM software?",
    "Top CRM tools for startups",
    "How to choose a CRM platform",
    "CRM software comparison 2026",
]

results = []
for query in QUERIES:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    )
    answer = response.choices[0].message.content
    mentioned = BRAND.lower() in answer.lower()
    results.append({
        'date': str(date.today()),
        'query': query,
        'cited': mentioned,
        'response_preview': answer[:200]
    })
    print(f"{'✅' if mentioned else '❌'} {query}")

## Save results
with open(f'citations_{date.today()}.json', 'w') as f:
    json.dump(results, f, indent=2)

cited = sum(1 for r in results if r['cited'])
print(f"\nCitation rate: {cited}/{len(results)} ({cited/len(results)*100:.0f}%)")

Script 9: Schema markup extractor and validator

Extract and validate structured data from any URL. Useful for auditing your own pages and analyzing competitors.

import requests
from bs4 import BeautifulSoup
import json

def extract_schema(url):
    r = requests.get(url, timeout=10)
    soup = BeautifulSoup(r.text, 'lxml')
    schemas = []
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)
            schemas.append(data)
        except json.JSONDecodeError:
            schemas.append({'error': 'Invalid JSON', 'raw': script.string[:200]})
    return schemas

def audit_schema(url):
    schemas = extract_schema(url)
    if not schemas:
        print(f"⚠️  No structured data found on {url}")
        return

    for i, schema in enumerate(schemas):
        schema_type = schema.get('@type', 'Unknown')
        print(f"\nSchema {i+1}: {schema_type}")

        if schema_type in ['Article', 'BlogPosting']:
            checks = {
                'headline': 'headline' in schema,
                'author': 'author' in schema,
                'datePublished': 'datePublished' in schema,
                'dateModified': 'dateModified' in schema,
                'image': 'image' in schema,
            }
            for field, present in checks.items():
                print(f"  {'✅' if present else '❌'} {field}")

## Audit your pages
urls = ['https://yourdomain.com/blog/post-1', 'https://yourdomain.com/about']
for url in urls:
    print(f"\n{'='*50}")
    audit_schema(url)

Script 10: Content similarity analyzer

Find duplicate or near-duplicate content on your site that could cause cannibalization in both traditional and AI search.

from sentence_transformers import SentenceTransformer, util
import requests
from bs4 import BeautifulSoup
import pandas as pd

model = SentenceTransformer('all-MiniLM-L6-v2')

def get_content(url):
    r = requests.get(url, timeout=10)
    soup = BeautifulSoup(r.text, 'lxml')
    ## Remove nav, header, footer
    for tag in soup.find_all(['nav', 'header', 'footer', 'aside']):
        tag.decompose()
    return soup.get_text(separator=' ', strip=True)[:2000]

urls = open('urls.txt').read().strip().split('\n')
contents = {url: get_content(url) for url in urls}
embeddings = model.encode(list(contents.values()))

## Find similar pairs
pairs = []
for i in range(len(urls)):
    for j in range(i+1, len(urls)):
        sim = util.cos_sim(embeddings[i], embeddings[j]).item()
        if sim > 0.8:
            pairs.append({'url_1': urls[i], 'url_2': urls[j],
                         'similarity': round(sim, 3)})

if pairs:
    df = pd.DataFrame(pairs).sort_values('similarity', ascending=False)
    print(f"⚠️  Found {len(pairs)} similar page pairs:")
    print(df.to_string(index=False))
else:
    print("✅ No highly similar pages found")

Script 11: Internal link graph analyzer

Map your site’s internal linking structure and find orphan pages, over-linked pages, and link equity distribution issues. For more on this, see our guide to Perplexity Market Share & Growth (2026).

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from collections import defaultdict

def crawl_internal_links(start_url, max_pages=500):
    domain = urlparse(start_url).netloc
    visited = set()
    to_visit = {start_url}
    link_graph = defaultdict(set)  # page -> set of pages it links to
    inlinks = defaultdict(set)     # page -> set of pages linking to it

    while to_visit and len(visited) < max_pages:
        url = to_visit.pop()
        if url in visited:
            continue
        visited.add(url)
        try:
            r = requests.get(url, timeout=10)
            soup = BeautifulSoup(r.text, 'lxml')
            for a in soup.find_all('a', href=True):
                link = urljoin(url, a['href']).split('#')[0].split('?')[0]
                if urlparse(link).netloc == domain and link != url:
                    link_graph[url].add(link)
                    inlinks[link].add(url)
                    if link not in visited:
                        to_visit.add(link)
        except:
            continue

    ## Analysis
    print(f"Crawled {len(visited)} pages")

    ## Orphan pages (no inlinks from crawled pages)
    orphans = [url for url in visited if len(inlinks.get(url, set())) == 0
               and url != start_url]
    print(f"\nOrphan pages (0 internal links): {len(orphans)}")
    for url in orphans[:10]:
        print(f"  {url}")

    ## Top linked pages
    top_linked = sorted(inlinks.items(), key=lambda x: len(x[1]), reverse=True)[:10]
    print(f"\nMost internally linked pages:")
    for url, links in top_linked:
        print(f"  {len(links)} links → {url}")

    return visited, link_graph, inlinks

crawl_internal_links('https://yourdomain.com')

Script 12: Broken link finder

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor

def find_broken_links(page_url):
    broken = []
    try:
        r = requests.get(page_url, timeout=10)
        soup = BeautifulSoup(r.text, 'lxml')
        links = [urljoin(page_url, a['href']) for a in soup.find_all('a', href=True)
                 if a['href'].startswith(('http', '/'))]

        def check(link):
            try:
                resp = requests.head(link, timeout=10, allow_redirects=True)
                if resp.status_code >= 400:
                    return {'source': page_url, 'broken_link': link,
                            'status': resp.status_code}
            except:
                return {'source': page_url, 'broken_link': link, 'status': 'Error'}
            return None

        with ThreadPoolExecutor(max_workers=5) as executor:
            results = executor.map(check, links)
            broken = [r for r in results if r]
    except Exception as e:
        print(f"Error crawling {page_url}: {e}")
    return broken

## Check multiple pages
pages = open('urls.txt').read().strip().split('\n')
all_broken = []
for page in pages:
    broken = find_broken_links(page)
    all_broken.extend(broken)
    if broken:
        for b in broken:
            print(f"❌ {b['status']} | {b['broken_link']} (from {b['source']})")

print(f"\nTotal broken links found: {len(all_broken)}")

What Are the Best Python Libraries Specifically for SEO?

advertools — Built specifically for SEO. Includes sitemap parsing, robots.txt analysis, URL structure analysis, and SERP data processing. The crawl() function is a full-featured web crawler built on Scrapy.

Polars — A faster alternative to pandas for processing large keyword datasets (100K+ rows). The syntax is different but operations that take 30 seconds in pandas complete in 2 seconds with Polars.

trafilatura — Extracts main content from web pages, removing boilerplate. Better than BeautifulSoup for content extraction because it uses heuristics to identify the main article text.

import trafilatura

downloaded = trafilatura.fetch_url('https://example.com/article')
text = trafilatura.extract(downloaded)
print(text[:500])

google-api-python-client — Official Google API client. Essential for pulling Search Console data, GA4 data, and using other Google APIs programmatically. Our Website Migration SEO Checklist (2026) guide covers this in detail.

pytrends — Unofficial Google Trends API. Useful for keyword research, trend analysis, and identifying seasonal patterns.

from pytrends.request import TrendReq
pytrends = TrendReq()
pytrends.build_payload(['GEO optimization', 'AI SEO'], timeframe='today 12-m')
trends = pytrends.interest_over_time()
print(trends)

These scripts represent the building blocks. Combine them to create custom workflows — pull Search Console data, cluster the keywords, identify content gaps, check for broken links, and validate schema — all in one automated pipeline. That’s the real power of Python for SEO: not any single script, but the ability to chain analyses together in ways no single tool can match.


Frequently Asked Questions

Do you need to know Python for SEO?
No, but it gives you a significant advantage. Python automates repetitive tasks (crawling, data analysis, reporting), enables custom analyses impossible with off-the-shelf tools, and costs nothing compared to enterprise SEO tool subscriptions. Basic Python is learnable in 2-4 weeks.
What Python libraries are essential for SEO?
The core stack is: requests (HTTP), BeautifulSoup (HTML parsing), pandas (data analysis), Scrapy (crawling), and sentence-transformers (NLP/clustering). For SEO APIs: google-auth and google-api-python-client for Search Console, ahrefs-api or SEMrush API wrappers for keyword data.
Can Python replace SEO tools like Ahrefs or SEMrush?
Not entirely. Python excels at custom analysis, automation, and connecting data from multiple sources. But Ahrefs and SEMrush have proprietary data (backlink indices, keyword databases) that can't be replicated with Python alone. The best approach is using Python to extend and automate your work with these tools.
What's the easiest Python SEO script to start with?
A broken link checker. It only requires the requests library, takes about 15 lines of code, and provides immediately useful results. It teaches you HTTP requests and response handling — fundamentals for any SEO script.
Can Python help with GEO?
Absolutely. Python scripts can monitor AI citations using the ChatGPT and Perplexity APIs, analyze content for AI-readiness, validate schema markup at scale, cluster keywords for comprehensive content coverage, and automate competitive AI search analysis.
G

GEOClarity

Writing about Generative Engine Optimization, AI search, and the future of content visibility.

Related Posts

Get GEO insights in your inbox

AI search optimization strategies. No spam.