TL;DR
Million-page sites operate under fundamentally different SEO constraints than smaller sites. Googlebot will never crawl all your pages frequently, so strategic crawl prioritization becomes critical. Success requires: log file analysis to understand actual Googlebot behavior, index management deciding which pages deserve indexing, internal linking architecture that directs authority to priority pages, automated quality monitoring at scale, and accepting that most pages will be crawled infrequently. The goal shifts from “get everything indexed” to “get the right things indexed and updated.”
Do This Today (3 Quick Checks)
- Calculate your indexing ratio: GSC indexed count ÷ your actual page count. If <50%, you have index management work to do.
- Check crawl frequency distribution: In server logs, what percentage of pages were crawled in the last 30 days? This reveals Googlebot’s actual priorities.
- Identify crawl waste: What percentage of Googlebot requests go to non-indexable pages (404s, redirects, noindex)? Every wasted request is a missed opportunity.
The Scale Problem
Why 1M+ pages changes everything:
| Factor | Small Site (1K pages) | Large Site (1M+ pages) |
|---|---|---|
| <strong>Crawl coverage</strong> | Google crawls everything frequently | Many pages crawled rarely or never |
| <strong>Indexing</strong> | Most pages indexed | Must choose what deserves indexing |
| <strong>Quality control</strong> | Manual review possible | Requires automated monitoring |
| <strong>Internal linking</strong> | Manageable manually | Needs programmatic architecture |
| <strong>Changes</strong> | Site-wide updates feasible | Must prioritize and phase |
| <strong>Debugging</strong> | Check individual pages | Statistical/pattern analysis |
Crawl Budget Calculation Methodology
Step 1: Establish baseline from server logs
# Monthly Googlebot requests
grep -i "googlebot" access.log | wc -l
# Daily average
grep -i "googlebot" access.log | awk '{print $4}' | cut -d: -f1 | sort | uniq -c
Step 2: Calculate effective crawl budget
| Metric | Calculation | Example |
|---|---|---|
| <strong>Total monthly crawl</strong> | Count Googlebot requests | 500,000 |
| <strong>Wasted crawl</strong> | Non-200 responses + blocked + trap URLs | 150,000 (30%) |
| <strong>Effective crawl</strong> | Total – Wasted | 350,000 |
| <strong>Crawl per page</strong> | Effective ÷ Page count | 350K ÷ 2.5M = 0.14/month |
Step 3: Assess crawl health
| Crawl per Page/Month | Assessment | Action |
|---|---|---|
| >1.0 | Healthy | Maintain |
| 0.5-1.0 | Adequate | Optimize |
| 0.1-0.5 | Insufficient | Prioritize and prune |
| <0.1 | Critical | Major intervention needed |
Step 4: Calculate required improvements
If you have 2.5M pages but only 350K effective monthly crawl:
- Current coverage: 14% of pages/month
- To reach 50% coverage: Need 1.25M effective crawl
- Gap: 900K additional crawl needed
- Options: Reduce pages, reduce waste, improve crawl efficiency
Sitemap Segmentation Strategy
Segmentation approach:
<!-- sitemap-index.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-priority-high.xml</loc>
<lastmod>2025-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-priority-medium.xml</loc>
<lastmod>2025-01-14</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-priority-low.xml</loc>
<lastmod>2025-01-10</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-categories.xml</loc>
<lastmod>2025-01-15</lastmod>
</sitemap>
</sitemapindex>
Segmentation criteria:
| Segment | Criteria | Size Target | Update Frequency |
|---|---|---|---|
| <strong>High priority</strong> | Top sellers, high traffic, recent | 50-100K URLs | Daily |
| <strong>Medium priority</strong> | Active products, moderate traffic | 200-500K URLs | Weekly |
| <strong>Low priority</strong> | Long tail, low traffic | Remaining | Monthly |
| <strong>Categories</strong> | All category/collection pages | <10K URLs | Daily |
| <strong>New products</strong> | Added in last 30 days | Variable | Daily |
Monitoring sitemap effectiveness:
| Metric | Where to Find | Target |
|---|---|---|
| Sitemap read date | GSC Sitemaps | Recent |
| URLs discovered | GSC Sitemaps | Matches submitted |
| URLs indexed | GSC Sitemaps | >50% of submitted |
| Index rate by segment | Compare segments | Higher priority = higher rate |
The Conversation
Priya, Head of SEO at a major e-commerce retailer, 2.5 million product pages plus category and informational pages, struggling with inconsistent indexing and slow crawl of new products, competitors with smaller catalogs ranking better, advanced SEO knowledge but overwhelmed by scale: We have 2.5 million product pages. New products take weeks to get indexed. Some products that have been live for months still aren’t indexed. Meanwhile, smaller competitors with 50,000 products seem to rank better. How do we manage SEO at this scale?
SDC SEO Brain: At 2.5 million pages, you’ve hit a scale where Googlebot’s crawl capacity becomes your primary constraint. Google allocates crawl budget based on perceived site importance and server capacity. Even for major sites, crawling 2.5 million pages frequently is impossible. Let’s understand your current state. Do you have access to server logs?
Priya: Yes, we have full logging.
SDC SEO Brain: That’s essential. Log file analysis is the only way to truly understand Googlebot’s behavior at scale. First analysis to run: what percentage of your 2.5 million URLs received at least one Googlebot visit in the last 30 days?
Priya: I ran that last month. Only about 400,000 unique URLs were crawled. That’s 16%.
SDC SEO Brain: So 84% of your pages weren’t crawled at all last month. That’s your fundamental problem. Even if those pages are indexable, Google isn’t getting to them frequently enough to keep them fresh or prioritize them. Let’s dig deeper. Of those 400,000 crawled URLs, what’s the breakdown by page type?
Priya: Roughly: 150,000 product pages, 50,000 category pages, 100,000 faceted/filter pages, 80,000 various utility pages, and about 20,000 I’m not sure about.
SDC SEO Brain: There’s your second problem. 100,000 crawl requests going to faceted/filter pages is massive crawl waste. Those pages are likely low-value variations of category pages. And 80,000 utility pages is suspicious. What falls into that category?
Priya: Pagination pages, old sale pages, customer service pages, some legacy URL structures we haven’t cleaned up.
SDC SEO Brain: So of 400,000 monthly Googlebot requests, potentially 200,000+ are going to low-value pages. Meanwhile, 2.1 million product pages are fighting over the remaining 200,000 crawl slots. That math doesn’t work.
Priya: How do we fix the allocation?
SDC SEO Brain: Crawl optimization at scale requires three parallel efforts:
1. Eliminate waste (immediate):
- Block faceted navigation from crawling
- Robots.txt block or remove utility pages
- Clean up legacy URL structures
- Fix redirect chains consuming crawl
2. Prioritize valuable pages (architectural):
- Internal linking weighted toward priority products
- XML sitemaps segmented by priority
- Homepage and category link equity flowing to best products
3. Signal freshness where needed (ongoing):
- Last-modified headers accurate
- Sitemap lastmod dates meaningful
- Content changes surfaced to Google faster
Priya: We already have XML sitemaps for all products.
SDC SEO Brain: Sitemaps don’t guarantee crawling, they’re suggestions. At your scale, Google won’t crawl your entire sitemap frequently. But you can use sitemaps strategically:
Segmented sitemaps by priority:
sitemap-products-high-priority.xml (top 100K products)
sitemap-products-medium-priority.xml (next 500K)
sitemap-products-standard.xml (remaining 1.9M)
Prioritize by:
- Sales velocity (bestsellers crawled more)
- Inventory status (in-stock vs out-of-stock)
- Margin or strategic importance
- Recent changes or new products
Google doesn’t blindly follow your priority hints, but segmentation helps you track crawl rates by tier and understand what Google actually prioritizes.
Priya: What about indexing? Even pages that get crawled don’t always get indexed.
SDC SEO Brain: At scale, “Crawled – not indexed” becomes a significant bucket. Google is saying: “I saw this page, but I chose not to include it in my index.” Reasons at scale:
Quality threshold: Page content too thin or similar to other pages
Duplicate signals: Near-duplicate of another product page
Authority deficit: Not enough internal/external signals to justify indexing
Quota limits: Google’s index has practical limits per site
How many pages are in “Crawled – not indexed” in GSC?
Priya: About 800,000.
SDC SEO Brain: 800,000 pages Google explicitly decided aren’t worth indexing. That’s a massive quality signal problem. Let’s diagnose. Sample 100 random URLs from that bucket. What do they look like?
Priya: Doing that now… A lot of them are products with very short descriptions, products similar to other products, out-of-stock items, and products with no reviews.
SDC SEO Brain: That’s your indexability problem. Google sees thin product pages with minimal unique content and decides they don’t add value to the index. At scale, you need indexability criteria:
Minimum requirements for indexing:
- Description length >150 words
- At least 1 unique product image
- Price and availability present
- Preferably: reviews, Q&A, or other unique content
Products below threshold:
- Noindex until content requirements met
- Or consolidate into parent category (don’t have individual pages)
- Or programmatically enhance with template content
Priya: We can’t write unique descriptions for 2.5 million products.
SDC SEO Brain: You don’t need to. Options:
Tiered content investment:
- Top 50K products: Human-written, rich content
- Next 200K: Enhanced templates with unique data points
- Remaining: Basic templates but meeting minimum thresholds, or noindex
Automated enhancement:
- Pull unique specs into readable sentences
- Aggregate review snippets
- Add structured product comparisons
- Generate FAQ from customer questions
User-generated content at scale:
- Reviews (incentivize submission)
- Q&A sections
- Customer photos
- Community ratings
The goal isn’t making every page exceptional, it’s ensuring every indexed page meets minimum quality thresholds.
Priya: How do we monitor quality across 2.5 million pages?
SDC SEO Brain: Automated quality monitoring:
Crawl-based audits:
- Weekly Screaming Frog crawl of sample URLs
- Monitor: title length, description length, H1 presence, canonical correctness
- Flag pages falling below thresholds
Log-based monitoring:
- Googlebot crawl patterns by section
- Response code distribution
- Crawl frequency by page type
- New vs returning crawl ratio
GSC monitoring:
- Indexed count trend (should be stable or growing intentionally)
- “Crawled – not indexed” trend
- Coverage issues by type
- Core Web Vitals by page type
Custom dashboards:
- Content quality scores aggregated
- Pages below threshold counts
- New page indexing velocity
- Section-by-section health metrics
Priya: Our new products are the most urgent. They take weeks to get indexed.
SDC SEO Brain: New product indexing velocity at scale requires:
Immediate signals:
- Internal links from high-traffic pages (homepage featured, category prominent placement)
- Sitemap inclusion in high-priority segment
- GSC URL Inspection for critical products (doesn’t scale, use selectively)
Structural advantages:
- New products get more internal links than old products initially
- “New arrivals” sections on categories with fresh links
- Homepage rotation of new products
Freshness signals:
- Accurate last-modified headers
- Sitemap lastmod reflecting true publication date
- Content that indicates newness
At your scale, you can’t manually push every product. Build systems that automatically give new products more crawl signals, then taper as they age.
FAQ
Q: How do I calculate optimal indexing ratio?
A: There’s no universal optimal ratio. The goal is: all quality pages indexed, low-quality pages noindexed. If you have 2.5M pages and 1M are thin, ideal indexed count is 1.5M, not 2.5M.
Q: Should I reduce my page count to improve crawl coverage?
A: If pages aren’t adding value, yes. Consolidating 10 thin product variants into 1 comprehensive page is often better than 10 pages competing for crawl budget.
Q: How do server resources affect crawl budget?
A: Faster servers allow Google to crawl more without overloading you. Slow response times cause Google to throttle crawl rate. At scale, server performance directly impacts crawl capacity.
Q: How often should Googlebot crawl my most important pages?
A: Depends on change frequency. High-inventory e-commerce might want daily crawls of key categories. Content that rarely changes doesn’t need frequent crawling. Log analysis reveals actual crawl frequency.
Q: Can I request more crawl budget from Google?
A: Not directly. You influence it through: site quality (Google crawls quality sites more), server speed (allows faster crawling), fresh content (signals value in crawling), and eliminating waste (more budget for good pages).
Summary
At 1M+ pages, crawl budget is your primary constraint. Googlebot cannot and will not crawl all your pages frequently. Strategic prioritization replaces “index everything.”
Log file analysis is mandatory:
- Understand actual Googlebot behavior
- Identify crawl waste (low-value pages consuming budget)
- Measure crawl distribution across page types
- Track changes over time
Crawl optimization priorities:
- Eliminate waste (block low-value pages)
- Direct crawl to priority pages (internal linking, sitemaps)
- Signal freshness appropriately (lastmod, headers)
Index management at scale:
- Not all pages deserve indexing
- Set minimum quality thresholds
- Noindex pages that don’t meet thresholds
- Monitor “Crawled – not indexed” as quality signal
Content quality at scale:
- Tiered investment (high/medium/low)
- Programmatic enhancement for long tail
- User-generated content for uniqueness
- Automated monitoring for quality regression
New content velocity:
- Structural advantages for new pages
- Fresh internal linking signals
- Sitemap prioritization
- Gradual taper as content ages
Sources
- Google Search Central: Large site crawl management – https://developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget
- Google: Crawl budget – https://developers.google.com/search/blog/2017/01/what-crawl-budget-means-for-googlebot
- Google: Sitemaps best practices – https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap