🔍 Debug: Why only one product found - Dynamic loading analysis
✅ MYSTERY SOLVED: Pokemon page loads but products are dynamic! 🔬 Analysis Results: • Pokemon page: ✅ Loads successfully (139KB HTML) • Static product links: ❌ 0 found (products load via JavaScript) • Pokemon mentions: ✅ 20 references in page • Category ID 723960: ✅ Found in page structure • Your test product: ❌ Not in static HTML (loads via API) 📋 New Debug Files: • debug_page_loading.py - Technical analysis of page loading • WHY_ONLY_ONE_PRODUCT.md - Complete explanation with solutions • pokemon_page_sample.html - Sample page content for analysis 🎯 ROOT CAUSE: Dollar General uses dynamic content loading: 1. Page loads basic HTML structure 2. JavaScript makes API calls to get products 3. API returns 4-12 Pokemon products as JSON 4. Products rendered into DOM after page load 5. Static scraping misses the dynamic content ✅ CONFIRMED: The Pokemon page IS being scraped correctly! ❌ ISSUE: Products aren't IN the page - they're loaded separately 🎉 SOLUTION: We already discovered the API endpoint via HAR analysis This explains why our API discovery was so valuable - that's where the real product data lives!
This commit is contained in:
182
debug_page_loading.py
Normal file
182
debug_page_loading.py
Normal file
@@ -0,0 +1,182 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Debug Pokemon page loading to understand the dynamic content issue
|
||||
"""
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import json
|
||||
import time
|
||||
|
||||
def test_pokemon_page():
|
||||
"""Test both Pokemon URLs to understand the difference"""
|
||||
|
||||
print("Pokemon Page Loading Debug")
|
||||
print("=" * 60)
|
||||
|
||||
urls_to_test = [
|
||||
"https://www.dollargeneral.com/c/toys/pokemon?q=",
|
||||
"https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true",
|
||||
"https://www.dollargeneral.com/c/toys/pokemon"
|
||||
]
|
||||
|
||||
for url in urls_to_test:
|
||||
print(f"\n=== Testing: {url} ===")
|
||||
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.get(url, headers=headers, timeout=30)
|
||||
print(f"Status: {response.status_code}")
|
||||
print(f"Content Length: {len(response.text)} characters")
|
||||
|
||||
# Parse HTML
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
# Look for specific indicators
|
||||
indicators = {
|
||||
"Product links (/p/)": len(soup.select('a[href*="/p/"]')),
|
||||
"Pokemon mentions": response.text.lower().count('pokemon'),
|
||||
"Trading card mentions": response.text.lower().count('trading card'),
|
||||
"Pack mentions": response.text.lower().count('pack'),
|
||||
"Scripts with 'product'": len([s for s in soup.find_all('script') if s.string and 'product' in s.string.lower()]),
|
||||
"Category ID 723960": '723960' in response.text,
|
||||
"Store number 17506": '17506' in response.text,
|
||||
"Test SKU 41936301": '41936301' in response.text
|
||||
}
|
||||
|
||||
for indicator, value in indicators.items():
|
||||
print(f" {indicator}: {value}")
|
||||
|
||||
# Look for category information or product containers
|
||||
category_info = soup.select('[data-category-id], [data-category], .category-info, .product-grid, .product-list')
|
||||
if category_info:
|
||||
print(f" Category/product containers found: {len(category_info)}")
|
||||
for container in category_info[:3]:
|
||||
print(f" -> {container.name} {container.get('class', [])} {container.get('data-category-id', '')}")
|
||||
|
||||
except Exception as e:
|
||||
print(f" Error: {e}")
|
||||
|
||||
def demonstrate_dynamic_loading_issue():
|
||||
"""Demonstrate why we're not finding products in static HTML"""
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("DYNAMIC LOADING ANALYSIS")
|
||||
print("=" * 60)
|
||||
|
||||
print("""
|
||||
🔍 THE ISSUE EXPLAINED:
|
||||
|
||||
1. ✅ STATIC HTML LOADS: The Pokemon category page loads successfully
|
||||
- Page title: "Pokemon"
|
||||
- Content length: 139,146 characters
|
||||
- Contains Pokemon references and basic page structure
|
||||
|
||||
2. ❌ NO PRODUCTS IN HTML: Zero product links found in static content
|
||||
- No <a href="/p/..."> links
|
||||
- No product tiles, cards, or grids
|
||||
- Products are NOT in the initial HTML
|
||||
|
||||
3. 🔬 WHAT REALLY HAPPENS (discovered via HAR):
|
||||
- Page loads basic structure
|
||||
- JavaScript executes and makes API calls
|
||||
- API endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
|
||||
- API returns 4-12 Pokemon products as JSON
|
||||
- JavaScript renders products into the page DOM
|
||||
- Browser shows the products, but static scraping misses them
|
||||
|
||||
4. ✅ HAR ANALYSIS CONFIRMED:
|
||||
- Category ID: 723960 (Pokemon)
|
||||
- Store number: 17506
|
||||
- Found your test product: SKU 41936301
|
||||
- Found multiple Pokemon packs and tins
|
||||
|
||||
🎯 CONCLUSION:
|
||||
The Pokemon page IS being scraped, but it's just the empty shell.
|
||||
The actual products load via JavaScript API calls after page load.
|
||||
""")
|
||||
|
||||
def show_comparison():
|
||||
"""Show the difference between what we get vs what should be there"""
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("COMPARISON: STATIC HTML vs DYNAMIC CONTENT")
|
||||
print("=" * 60)
|
||||
|
||||
comparison = """
|
||||
WHAT WE GET (Static HTML):
|
||||
━━━━━━━━━━━━━━━━━━━━━━
|
||||
• Page structure: ✅
|
||||
• Category title: ✅
|
||||
• Navigation: ✅
|
||||
• Product links: ❌ (0 found)
|
||||
• Product data: ❌ (none)
|
||||
|
||||
WHAT SHOULD BE THERE (Dynamic Content):
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
• Pokemon Trading Card Game packs
|
||||
• Pokemon tins and collections
|
||||
• Product images and prices
|
||||
• Stock availability
|
||||
• Your test product (SKU 41936301)
|
||||
• 4-12 total Pokemon TCG products
|
||||
|
||||
THE API RESPONSE WE DISCOVERED:
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
{
|
||||
"ItemList": {
|
||||
"Items": [
|
||||
{
|
||||
"Title": "Pokémon Trading Card Game, 15 Card Pack, 1 ct",
|
||||
"ItemNbr": "41936301",
|
||||
"UPC": "728192558375",
|
||||
"ProductUrl": "/p/pok-mon-trading-card-game-card-pack-ct/728192558375",
|
||||
"Inventory": {"InStock": false}
|
||||
},
|
||||
// ... more Pokemon products
|
||||
]
|
||||
}
|
||||
}
|
||||
"""
|
||||
print(comparison)
|
||||
|
||||
def main():
|
||||
test_pokemon_page()
|
||||
demonstrate_dynamic_loading_issue()
|
||||
show_comparison()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("💡 SOLUTIONS TO GET ALL PRODUCTS:")
|
||||
print("=" * 60)
|
||||
print("""
|
||||
OPTION 1 - API Authentication (Best Long-term):
|
||||
• Solve the Bearer token authentication
|
||||
• Use the discovered API endpoint directly
|
||||
• Get all 24+ products per request automatically
|
||||
|
||||
OPTION 2 - Browser Automation (Works but Complex):
|
||||
• Fix ChromeDriver compatibility with Brave
|
||||
• Let JavaScript load the products completely
|
||||
• Scrape the dynamically-loaded content
|
||||
|
||||
OPTION 3 - Manual Product URL Collection (Works Now):
|
||||
• Find Pokemon product URLs from other sources
|
||||
• Add them to the manual list in working_product_finder.py
|
||||
• Process each product individually (current working method)
|
||||
|
||||
OPTION 4 - Hybrid Approach:
|
||||
• Use individual product extraction for reliability
|
||||
• Enhance discovery via multiple methods
|
||||
• Build up a comprehensive product database over time
|
||||
""")
|
||||
|
||||
print("\n🎯 BOTTOM LINE:")
|
||||
print("The Pokemon page IS being scraped successfully!")
|
||||
print("But it's just an empty shell - the products load via JavaScript.")
|
||||
print("This is why we found the API endpoint - that's where the real data is!")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user