Files
pokemon-disco/debug_page_loading.py
pi-bot-01 12448a09a0 🔍 Debug: Why only one product found - Dynamic loading analysis
 MYSTERY SOLVED: Pokemon page loads but products are dynamic!

🔬 Analysis Results:
• Pokemon page:  Loads successfully (139KB HTML)
• Static product links:  0 found (products load via JavaScript)
• Pokemon mentions:  20 references in page
• Category ID 723960:  Found in page structure
• Your test product:  Not in static HTML (loads via API)

📋 New Debug Files:
• debug_page_loading.py - Technical analysis of page loading
• WHY_ONLY_ONE_PRODUCT.md - Complete explanation with solutions
• pokemon_page_sample.html - Sample page content for analysis

🎯 ROOT CAUSE:
Dollar General uses dynamic content loading:
1. Page loads basic HTML structure
2. JavaScript makes API calls to get products
3. API returns 4-12 Pokemon products as JSON
4. Products rendered into DOM after page load
5. Static scraping misses the dynamic content

 CONFIRMED: The Pokemon page IS being scraped correctly!
 ISSUE: Products aren't IN the page - they're loaded separately
🎉 SOLUTION: We already discovered the API endpoint via HAR analysis

This explains why our API discovery was so valuable -
that's where the real product data lives!
2026-03-21 15:39:48 -07:00

182 lines
6.3 KiB
Python

#!/usr/bin/env python3
"""
Debug Pokemon page loading to understand the dynamic content issue
"""
import requests
from bs4 import BeautifulSoup
import json
import time
def test_pokemon_page():
"""Test both Pokemon URLs to understand the difference"""
print("Pokemon Page Loading Debug")
print("=" * 60)
urls_to_test = [
"https://www.dollargeneral.com/c/toys/pokemon?q=",
"https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true",
"https://www.dollargeneral.com/c/toys/pokemon"
]
for url in urls_to_test:
print(f"\n=== Testing: {url} ===")
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=30)
print(f"Status: {response.status_code}")
print(f"Content Length: {len(response.text)} characters")
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Look for specific indicators
indicators = {
"Product links (/p/)": len(soup.select('a[href*="/p/"]')),
"Pokemon mentions": response.text.lower().count('pokemon'),
"Trading card mentions": response.text.lower().count('trading card'),
"Pack mentions": response.text.lower().count('pack'),
"Scripts with 'product'": len([s for s in soup.find_all('script') if s.string and 'product' in s.string.lower()]),
"Category ID 723960": '723960' in response.text,
"Store number 17506": '17506' in response.text,
"Test SKU 41936301": '41936301' in response.text
}
for indicator, value in indicators.items():
print(f" {indicator}: {value}")
# Look for category information or product containers
category_info = soup.select('[data-category-id], [data-category], .category-info, .product-grid, .product-list')
if category_info:
print(f" Category/product containers found: {len(category_info)}")
for container in category_info[:3]:
print(f" -> {container.name} {container.get('class', [])} {container.get('data-category-id', '')}")
except Exception as e:
print(f" Error: {e}")
def demonstrate_dynamic_loading_issue():
"""Demonstrate why we're not finding products in static HTML"""
print("\n" + "=" * 60)
print("DYNAMIC LOADING ANALYSIS")
print("=" * 60)
print("""
🔍 THE ISSUE EXPLAINED:
1. ✅ STATIC HTML LOADS: The Pokemon category page loads successfully
- Page title: "Pokemon"
- Content length: 139,146 characters
- Contains Pokemon references and basic page structure
2. ❌ NO PRODUCTS IN HTML: Zero product links found in static content
- No <a href="/p/..."> links
- No product tiles, cards, or grids
- Products are NOT in the initial HTML
3. 🔬 WHAT REALLY HAPPENS (discovered via HAR):
- Page loads basic structure
- JavaScript executes and makes API calls
- API endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
- API returns 4-12 Pokemon products as JSON
- JavaScript renders products into the page DOM
- Browser shows the products, but static scraping misses them
4. ✅ HAR ANALYSIS CONFIRMED:
- Category ID: 723960 (Pokemon)
- Store number: 17506
- Found your test product: SKU 41936301
- Found multiple Pokemon packs and tins
🎯 CONCLUSION:
The Pokemon page IS being scraped, but it's just the empty shell.
The actual products load via JavaScript API calls after page load.
""")
def show_comparison():
"""Show the difference between what we get vs what should be there"""
print("\n" + "=" * 60)
print("COMPARISON: STATIC HTML vs DYNAMIC CONTENT")
print("=" * 60)
comparison = """
WHAT WE GET (Static HTML):
━━━━━━━━━━━━━━━━━━━━━━
• Page structure: ✅
• Category title: ✅
• Navigation: ✅
• Product links: ❌ (0 found)
• Product data: ❌ (none)
WHAT SHOULD BE THERE (Dynamic Content):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Pokemon Trading Card Game packs
• Pokemon tins and collections
• Product images and prices
• Stock availability
• Your test product (SKU 41936301)
• 4-12 total Pokemon TCG products
THE API RESPONSE WE DISCOVERED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
{
"ItemList": {
"Items": [
{
"Title": "Pokémon Trading Card Game, 15 Card Pack, 1 ct",
"ItemNbr": "41936301",
"UPC": "728192558375",
"ProductUrl": "/p/pok-mon-trading-card-game-card-pack-ct/728192558375",
"Inventory": {"InStock": false}
},
// ... more Pokemon products
]
}
}
"""
print(comparison)
def main():
test_pokemon_page()
demonstrate_dynamic_loading_issue()
show_comparison()
print("\n" + "=" * 60)
print("💡 SOLUTIONS TO GET ALL PRODUCTS:")
print("=" * 60)
print("""
OPTION 1 - API Authentication (Best Long-term):
• Solve the Bearer token authentication
• Use the discovered API endpoint directly
• Get all 24+ products per request automatically
OPTION 2 - Browser Automation (Works but Complex):
• Fix ChromeDriver compatibility with Brave
• Let JavaScript load the products completely
• Scrape the dynamically-loaded content
OPTION 3 - Manual Product URL Collection (Works Now):
• Find Pokemon product URLs from other sources
• Add them to the manual list in working_product_finder.py
• Process each product individually (current working method)
OPTION 4 - Hybrid Approach:
• Use individual product extraction for reliability
• Enhance discovery via multiple methods
• Build up a comprehensive product database over time
""")
print("\n🎯 BOTTOM LINE:")
print("The Pokemon page IS being scraped successfully!")
print("But it's just an empty shell - the products load via JavaScript.")
print("This is why we found the API endpoint - that's where the real data is!")
if __name__ == "__main__":
main()