How to Do SEO for a Site with 1M+ Pages

TL;DR

Million-page sites operate under fundamentally different SEO constraints than smaller sites. Googlebot will never crawl all your pages frequently, so strategic crawl prioritization becomes critical. Success requires: log file analysis to understand actual Googlebot behavior, index management deciding which pages deserve indexing, internal linking architecture that directs authority to priority pages, automated quality monitoring at scale, and accepting that most pages will be crawled infrequently. The goal shifts from “get everything indexed” to “get the right things indexed and updated.”

Do This Today (3 Quick Checks)

Calculate your indexing ratio: GSC indexed count ÷ your actual page count. If <50%, you have index management work to do.

Check crawl frequency distribution: In server logs, what percentage of pages were crawled in the last 30 days? This reveals Googlebot’s actual priorities.

Identify crawl waste: What percentage of Googlebot requests go to non-indexable pages (404s, redirects, noindex)? Every wasted request is a missed opportunity.

The Scale Problem

Why 1M+ pages changes everything:

Factor	Small Site (1K pages)	Large Site (1M+ pages)
<strong>Crawl coverage</strong>	Google crawls everything frequently	Many pages crawled rarely or never
<strong>Indexing</strong>	Most pages indexed	Must choose what deserves indexing
<strong>Quality control</strong>	Manual review possible	Requires automated monitoring
<strong>Internal linking</strong>	Manageable manually	Needs programmatic architecture
<strong>Changes</strong>	Site-wide updates feasible	Must prioritize and phase
<strong>Debugging</strong>	Check individual pages	Statistical/pattern analysis

Crawl Budget Calculation Methodology

Step 1: Establish baseline from server logs

# Monthly Googlebot requests
grep -i "googlebot" access.log | wc -l

# Daily average
grep -i "googlebot" access.log | awk '{print $4}' | cut -d: -f1 | sort | uniq -c

Step 2: Calculate effective crawl budget

Metric	Calculation	Example
<strong>Total monthly crawl</strong>	Count Googlebot requests	500,000
<strong>Wasted crawl</strong>	Non-200 responses + blocked + trap URLs	150,000 (30%)
<strong>Effective crawl</strong>	Total – Wasted	350,000
<strong>Crawl per page</strong>	Effective ÷ Page count	350K ÷ 2.5M = 0.14/month

Step 3: Assess crawl health

Crawl per Page/Month	Assessment	Action
>1.0	Healthy	Maintain
0.5-1.0	Adequate	Optimize
0.1-0.5	Insufficient	Prioritize and prune
<0.1	Critical	Major intervention needed

Step 4: Calculate required improvements

If you have 2.5M pages but only 350K effective monthly crawl:

Current coverage: 14% of pages/month
To reach 50% coverage: Need 1.25M effective crawl
Gap: 900K additional crawl needed
Options: Reduce pages, reduce waste, improve crawl efficiency

Sitemap Segmentation Strategy

Segmentation approach:

<!-- sitemap-index.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-priority-high.xml</loc>
    <lastmod>2025-01-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-priority-medium.xml</loc>
    <lastmod>2025-01-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-priority-low.xml</loc>
    <lastmod>2025-01-10</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-categories.xml</loc>
    <lastmod>2025-01-15</lastmod>
  </sitemap>
</sitemapindex>

Segmentation criteria:

Segment	Criteria	Size Target	Update Frequency
<strong>High priority</strong>	Top sellers, high traffic, recent	50-100K URLs	Daily
<strong>Medium priority</strong>	Active products, moderate traffic	200-500K URLs	Weekly
<strong>Low priority</strong>	Long tail, low traffic	Remaining	Monthly
<strong>Categories</strong>	All category/collection pages	<10K URLs	Daily
<strong>New products</strong>	Added in last 30 days	Variable	Daily

Monitoring sitemap effectiveness:

Metric	Where to Find	Target
Sitemap read date	GSC Sitemaps	Recent
URLs discovered	GSC Sitemaps	Matches submitted
URLs indexed	GSC Sitemaps	>50% of submitted
Index rate by segment	Compare segments	Higher priority = higher rate

The Conversation

Priya, Head of SEO at a major e-commerce retailer, 2.5 million product pages plus category and informational pages, struggling with inconsistent indexing and slow crawl of new products, competitors with smaller catalogs ranking better, advanced SEO knowledge but overwhelmed by scale: We have 2.5 million product pages. New products take weeks to get indexed. Some products that have been live for months still aren’t indexed. Meanwhile, smaller competitors with 50,000 products seem to rank better. How do we manage SEO at this scale?

SDC SEO Brain: At 2.5 million pages, you’ve hit a scale where Googlebot’s crawl capacity becomes your primary constraint. Google allocates crawl budget based on perceived site importance and server capacity. Even for major sites, crawling 2.5 million pages frequently is impossible. Let’s understand your current state. Do you have access to server logs?

Priya: Yes, we have full logging.

SDC SEO Brain: That’s essential. Log file analysis is the only way to truly understand Googlebot’s behavior at scale. First analysis to run: what percentage of your 2.5 million URLs received at least one Googlebot visit in the last 30 days?

Priya: I ran that last month. Only about 400,000 unique URLs were crawled. That’s 16%.

SDC SEO Brain: So 84% of your pages weren’t crawled at all last month. That’s your fundamental problem. Even if those pages are indexable, Google isn’t getting to them frequently enough to keep them fresh or prioritize them. Let’s dig deeper. Of those 400,000 crawled URLs, what’s the breakdown by page type?

Priya: Roughly: 150,000 product pages, 50,000 category pages, 100,000 faceted/filter pages, 80,000 various utility pages, and about 20,000 I’m not sure about.

SDC SEO Brain: There’s your second problem. 100,000 crawl requests going to faceted/filter pages is massive crawl waste. Those pages are likely low-value variations of category pages. And 80,000 utility pages is suspicious. What falls into that category?

Priya: Pagination pages, old sale pages, customer service pages, some legacy URL structures we haven’t cleaned up.

SDC SEO Brain: So of 400,000 monthly Googlebot requests, potentially 200,000+ are going to low-value pages. Meanwhile, 2.1 million product pages are fighting over the remaining 200,000 crawl slots. That math doesn’t work.

Priya: How do we fix the allocation?

SDC SEO Brain: Crawl optimization at scale requires three parallel efforts:

1. Eliminate waste (immediate):

Block faceted navigation from crawling
Robots.txt block or remove utility pages
Clean up legacy URL structures
Fix redirect chains consuming crawl

2. Prioritize valuable pages (architectural):

Internal linking weighted toward priority products
XML sitemaps segmented by priority
Homepage and category link equity flowing to best products

3. Signal freshness where needed (ongoing):

Last-modified headers accurate
Sitemap lastmod dates meaningful
Content changes surfaced to Google faster

Priya: We already have XML sitemaps for all products.

SDC SEO Brain: Sitemaps don’t guarantee crawling, they’re suggestions. At your scale, Google won’t crawl your entire sitemap frequently. But you can use sitemaps strategically:

Segmented sitemaps by priority:

sitemap-products-high-priority.xml (top 100K products)
sitemap-products-medium-priority.xml (next 500K)
sitemap-products-standard.xml (remaining 1.9M)

Prioritize by:

Sales velocity (bestsellers crawled more)
Inventory status (in-stock vs out-of-stock)
Margin or strategic importance
Recent changes or new products

Google doesn’t blindly follow your priority hints, but segmentation helps you track crawl rates by tier and understand what Google actually prioritizes.

Priya: What about indexing? Even pages that get crawled don’t always get indexed.

SDC SEO Brain: At scale, “Crawled – not indexed” becomes a significant bucket. Google is saying: “I saw this page, but I chose not to include it in my index.” Reasons at scale:

Quality threshold: Page content too thin or similar to other pages
Duplicate signals: Near-duplicate of another product page
Authority deficit: Not enough internal/external signals to justify indexing
Quota limits: Google’s index has practical limits per site

How many pages are in “Crawled – not indexed” in GSC?

Priya: About 800,000.

SDC SEO Brain: 800,000 pages Google explicitly decided aren’t worth indexing. That’s a massive quality signal problem. Let’s diagnose. Sample 100 random URLs from that bucket. What do they look like?

Priya: Doing that now… A lot of them are products with very short descriptions, products similar to other products, out-of-stock items, and products with no reviews.

SDC SEO Brain: That’s your indexability problem. Google sees thin product pages with minimal unique content and decides they don’t add value to the index. At scale, you need indexability criteria:

Minimum requirements for indexing:

Description length >150 words
At least 1 unique product image
Price and availability present
Preferably: reviews, Q&A, or other unique content

Products below threshold:

Noindex until content requirements met
Or consolidate into parent category (don’t have individual pages)
Or programmatically enhance with template content

Priya: We can’t write unique descriptions for 2.5 million products.

SDC SEO Brain: You don’t need to. Options:

Tiered content investment:

Top 50K products: Human-written, rich content
Next 200K: Enhanced templates with unique data points
Remaining: Basic templates but meeting minimum thresholds, or noindex

Automated enhancement:

Pull unique specs into readable sentences
Aggregate review snippets
Add structured product comparisons
Generate FAQ from customer questions

User-generated content at scale:

Reviews (incentivize submission)
Q&A sections
Customer photos
Community ratings

The goal isn’t making every page exceptional, it’s ensuring every indexed page meets minimum quality thresholds.

Priya: How do we monitor quality across 2.5 million pages?

SDC SEO Brain: Automated quality monitoring:

Crawl-based audits:

Weekly Screaming Frog crawl of sample URLs
Monitor: title length, description length, H1 presence, canonical correctness
Flag pages falling below thresholds

Log-based monitoring:

Googlebot crawl patterns by section
Response code distribution
Crawl frequency by page type
New vs returning crawl ratio

GSC monitoring:

Indexed count trend (should be stable or growing intentionally)
“Crawled – not indexed” trend
Coverage issues by type
Core Web Vitals by page type

Custom dashboards:

Content quality scores aggregated
Pages below threshold counts
New page indexing velocity
Section-by-section health metrics

Priya: Our new products are the most urgent. They take weeks to get indexed.

SDC SEO Brain: New product indexing velocity at scale requires:

Immediate signals:

Internal links from high-traffic pages (homepage featured, category prominent placement)
Sitemap inclusion in high-priority segment
GSC URL Inspection for critical products (doesn’t scale, use selectively)

Structural advantages:

New products get more internal links than old products initially
“New arrivals” sections on categories with fresh links
Homepage rotation of new products

Freshness signals:

Accurate last-modified headers
Sitemap lastmod reflecting true publication date
Content that indicates newness

At your scale, you can’t manually push every product. Build systems that automatically give new products more crawl signals, then taper as they age.

FAQ

Q: How do I calculate optimal indexing ratio?
A: There’s no universal optimal ratio. The goal is: all quality pages indexed, low-quality pages noindexed. If you have 2.5M pages and 1M are thin, ideal indexed count is 1.5M, not 2.5M.

Q: Should I reduce my page count to improve crawl coverage?
A: If pages aren’t adding value, yes. Consolidating 10 thin product variants into 1 comprehensive page is often better than 10 pages competing for crawl budget.

Q: How do server resources affect crawl budget?
A: Faster servers allow Google to crawl more without overloading you. Slow response times cause Google to throttle crawl rate. At scale, server performance directly impacts crawl capacity.

Q: How often should Googlebot crawl my most important pages?
A: Depends on change frequency. High-inventory e-commerce might want daily crawls of key categories. Content that rarely changes doesn’t need frequent crawling. Log analysis reveals actual crawl frequency.

Q: Can I request more crawl budget from Google?
A: Not directly. You influence it through: site quality (Google crawls quality sites more), server speed (allows faster crawling), fresh content (signals value in crawling), and eliminating waste (more budget for good pages).

Summary

At 1M+ pages, crawl budget is your primary constraint. Googlebot cannot and will not crawl all your pages frequently. Strategic prioritization replaces “index everything.”

Log file analysis is mandatory:

Understand actual Googlebot behavior
Identify crawl waste (low-value pages consuming budget)
Measure crawl distribution across page types
Track changes over time

Crawl optimization priorities:

Eliminate waste (block low-value pages)
Direct crawl to priority pages (internal linking, sitemaps)
Signal freshness appropriately (lastmod, headers)

Index management at scale:

Not all pages deserve indexing
Set minimum quality thresholds
Noindex pages that don’t meet thresholds
Monitor “Crawled – not indexed” as quality signal

Content quality at scale:

Tiered investment (high/medium/low)
Programmatic enhancement for long tail
User-generated content for uniqueness
Automated monitoring for quality regression

New content velocity:

Structural advantages for new pages
Fresh internal linking signals
Sitemap prioritization
Gradual taper as content ages

Sources

Google Search Central: Large site crawl management – https://developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget
Google: Crawl budget – https://developers.google.com/search/blog/2017/01/what-crawl-budget-means-for-googlebot
Google: Sitemaps best practices – https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap

SDC SEO