Internal linking is one of the most impactful and most neglected parts of SEO. Most sites link haphazardly — whatever the author remembers to link to when writing. BERT and modern NLP models solve this by automatically identifying which pages are semantically related and should link to each other. (We explore this further in Python SEO Tools: 40+ Scripts & Libraries.)
Key takeaway: Use BERT sentence embeddings to compute content similarity between all pages on your site. Pages with high similarity scores should link to each other. This approach finds connections humans miss and scales to sites with thousands of pages.
Why Is Internal Linking So Difficult to Do Well Manually?
On a site with 500 pages, there are 124,750 possible page pairs. No human can evaluate all those connections. Most sites rely on authors linking to pages they know about — which is a fraction of the relevant connections.
The typical problems:
| Problem | Impact | How Common |
|---|---|---|
| Orphan pages | No internal links, rarely crawled | 10-30% of pages |
| Obvious links only | Authors link to well-known pages, miss relevant deep content | Very common |
| No linking updates | Old posts never get links to new related content | Almost universal |
| Inconsistent anchor text | Random anchor text instead of keyword-relevant text | Very common |
| Siloed structure | Content in one section never links to related content in another | Common on large sites |
The cost of poor internal linking:
Google uses internal links to discover pages, understand topic relationships, and distribute PageRank. Pages with zero or few internal links rank significantly worse than well-linked pages — even if the content is identical.
For GEO, internal linking builds topical authority. When AI engines see a cluster of interlinked pages on a topic, they recognize that site as an authority on that topic. Isolated pages, no matter how good, don’t signal the same depth of expertise.
How Do BERT Embeddings Work for Content Matching?
BERT embeddings convert text into high-dimensional vectors (arrays of numbers) that capture semantic meaning. Two pieces of text about similar topics produce vectors that are close together in this embedding space. This relates closely to what we cover in GEO Dashboard: Key Metrics and Setup Guide.
The core concept:
"How to optimize title tags for SEO" → [0.23, -0.45, 0.78, ...]
"Title tag best practices for search engines" → [0.21, -0.42, 0.81, ...]
"Best chocolate cake recipe" → [-0.67, 0.12, -0.34, ...]
The first two vectors are close (high cosine similarity) because they discuss similar topics. The third is far away because it’s semantically different.
Using sentence-transformers:
The sentence-transformers library makes this practical. Instead of the full BERT model (which processes individual tokens), sentence-transformers produces embeddings for entire sentences or paragraphs. For more on this, see our guide to How AI Search is Changing Consumer Behavior in 2026.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [
"How to optimize internal linking for SEO",
"Best practices for building internal link structure",
"Python script for web crawling and scraping"
]
embeddings = model.encode(texts)
similarities = util.cos_sim(embeddings, embeddings)
print(similarities)
Output shows text 1 and text 2 have high similarity (~0.75+) while text 3 is low similarity (~0.15) to both.
Model selection:
| Model | Size | Speed | Quality | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 80MB | Fast | Good | Most sites, quick analysis |
| all-mpnet-base-v2 | 420MB | Medium | Better | Higher accuracy needs |
| paraphrase-multilingual-MiniLM-L12-v2 | 470MB | Medium | Good | Multilingual sites |
For most internal linking tasks, all-MiniLM-L6-v2 provides the best speed/quality tradeoff. It processes hundreds of pages in seconds on a standard laptop. Our GEO for Personal Brands: Get AI to Recommend You guide covers this in detail.
How Do You Build a BERT-Powered Internal Linking System?
Here’s the complete workflow from crawling your site to generating link recommendations.
Step 1: Extract content from all pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import pandas as pd
def extract_page_content(url):
try:
r = requests.get(url, timeout=15)
soup = BeautifulSoup(r.text, 'lxml')
## Remove non-content elements
for tag in soup.find_all(['nav', 'header', 'footer', 'aside', 'script', 'style']):
tag.decompose()
title = soup.title.string.strip() if soup.title else ''
h1 = soup.find('h1').get_text().strip() if soup.find('h1') else ''
## Get main content text
content = soup.get_text(separator=' ', strip=True)[:3000]
return {'url': url, 'title': title, 'h1': h1, 'content': content}
except:
return None
## Crawl from sitemap or URL list
urls = open('my_urls.txt').read().strip().split('\n')
pages = [extract_page_content(url) for url in urls]
pages = [p for p in pages if p]
df = pd.DataFrame(pages)
print(f"Extracted content from {len(df)} pages")
Step 2: Generate embeddings for all pages.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
## Combine title + content for richer embeddings
texts = (df['title'] + ' ' + df['content']).tolist()
embeddings = model.encode(texts, show_progress_bar=True, batch_size=32)
print(f"Generated {len(embeddings)} embeddings of dimension {embeddings.shape[1]}")
Step 3: Compute similarity matrix.
from sentence_transformers import util
similarity_matrix = util.cos_sim(embeddings, embeddings).numpy()
## Zero out self-similarity
np.fill_diagonal(similarity_matrix, 0)
Step 4: Generate link recommendations.
## For each page, find top N most similar pages
N = 5
recommendations = []
for i in range(len(df)):
top_indices = similarity_matrix[i].argsort()[-N:][::-1]
for j in top_indices:
sim = similarity_matrix[i][j]
if sim > 0.3: # Minimum similarity threshold
recommendations.append({
'source_url': df.iloc[i]['url'],
'source_title': df.iloc[i]['title'],
'target_url': df.iloc[j]['url'],
'target_title': df.iloc[j]['title'],
'similarity': round(sim, 3)
})
rec_df = pd.DataFrame(recommendations)
rec_df.to_csv('link_recommendations.csv', index=False)
print(f"Generated {len(rec_df)} link recommendations")
Step 5: Check existing links to find gaps.
## Crawl existing internal links
existing_links = set()
for url in df['url'].tolist():
try:
r = requests.get(url, timeout=10)
soup = BeautifulSoup(r.text, 'lxml')
domain = urlparse(url).netloc
for a in soup.find_all('a', href=True):
link = urljoin(url, a['href']).split('#')[0].split('?')[0]
if urlparse(link).netloc == domain:
existing_links.add((url, link))
except:
continue
## Filter recommendations to only show missing links
rec_df['link_exists'] = rec_df.apply(
lambda r: (r['source_url'], r['target_url']) in existing_links, axis=1)
missing = rec_df[~rec_df['link_exists']]
print(f"\nMissing high-relevance links: {len(missing)}")
missing.to_csv('missing_links.csv', index=False)
This gives you a prioritized list of missing internal links, sorted by semantic relevance. The highest-similarity pairs that don’t already have links are your top opportunities.
How Do You Choose Good Anchor Text Using NLP?
BERT can also help generate appropriate anchor text for internal links.
The approach: For each recommended link, identify the most relevant phrase in the source page that relates to the target page’s topic. As we discuss in How to Write Answer Units — Paragraphs AI Can Quote, this is a critical factor.
from sentence_transformers import SentenceTransformer, util
import re
model = SentenceTransformer('all-MiniLM-L6-v2')
def suggest_anchor_text(source_content, target_title, target_content):
## Extract candidate phrases from source (3-7 word chunks)
sentences = re.split(r'[.!?]', source_content)
candidates = []
for sentence in sentences:
words = sentence.strip().split()
for length in range(3, 8):
for i in range(len(words) - length + 1):
phrase = ' '.join(words[i:i+length])
if len(phrase) > 15:
candidates.append(phrase)
if not candidates:
return target_title
## Find phrase most similar to target content
target_embedding = model.encode(target_title + ' ' + target_content[:500])
candidate_embeddings = model.encode(candidates[:200]) # Limit for speed
similarities = util.cos_sim(target_embedding, candidate_embeddings)[0]
best_idx = similarities.argmax().item()
return candidates[best_idx]
This finds the phrase in your source page that’s most semantically relevant to the target page — natural anchor text that includes related keywords without being forced.
What Common Mistakes Should You Avoid with Automated Linking?
Mistake 1: Setting the similarity threshold too low.
A threshold below 0.25 will generate links between loosely related pages. These “related” links don’t help users or search engines. Keep the threshold at 0.3-0.5 for high-quality recommendations.
Mistake 2: Linking everything to everything.
Even with high similarity scores, a page shouldn’t have 50 internal links. Prioritize the top 5-10 most relevant connections per page. More links dilute the signal.
Mistake 3: Ignoring link context.
A link recommendation says Page A should link to Page B. But where in Page A should the link appear? The link should be in a paragraph that’s contextually relevant to Page B’s topic. Don’t just append a list of related links at the bottom. If you want to go deeper, ChatGPT vs Perplexity vs Google AI Compared breaks this down step by step.
Mistake 4: Set-and-forget.
Re-run the analysis monthly or whenever you publish significant new content. New pages create new linking opportunities that won’t be captured by old recommendations.
Mistake 5: Not validating recommendations.
BERT-generated recommendations are suggestions, not commands. Review the top recommendations manually to ensure they make sense. Semantic similarity isn’t perfect — occasionally unrelated pages get high scores due to shared vocabulary.
How Does NLP-Powered Internal Linking Help with GEO?
AI search engines evaluate topical authority partly through content interconnection. When AI systems see that your site has 30 pages about “CRM software” — all linking to each other with relevant anchor text — they recognize deeper expertise than a site with 30 isolated CRM articles. (We explore this further in Why Every Page Needs an FAQ Section for GEO.)
Topical cluster strength:
BERT-based linking naturally creates topical clusters. Pages about related subtopics link to each other, forming dense subgraphs within your site structure. These clusters signal to AI engines: “This site covers this topic comprehensively.”
Entity relationship mapping:
Internal links communicate entity relationships. When your “CRM pricing” page links to your “CRM comparison” page with anchor text “compare CRM platforms,” you’re telling AI engines that these entities are related. This relationship mapping helps AI engines build a more complete picture of your content ecosystem. This relates closely to what we cover in Future of Search: What to Expect in 2026-2027.
Citation chain building:
When AI engines cite one of your pages, they often explore linked pages for additional context. Strong internal linking means an AI citation on one page can lead to the AI system discovering and potentially citing other pages on your site. It’s a compounding effect — one citation leads to exploration that leads to more citations.
BERT-powered internal linking transforms this from a manual, inconsistent effort into a systematic, data-driven practice. The result is a site structure that both search engines and AI systems can navigate efficiently, building compounding topical authority over time.