✅ Successfully discovered internal API via HAR analysis: • Endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider • Method: POST with JSON payload • Category ID: 723960 (Pokemon products) • Store Number: 17506 • Response: Contains SKU 41936301 and all Pokemon TCG products! 🔬 HAR Analysis Tools Added: • analyze_har.py - Extract API calls from HAR files • extract_api_details.py - Detailed API request format extraction • implement_api_scraper.py - Full API implementation framework • test_api_scraper.py - API endpoint testing 📋 API Documentation: • DISCOVERY_SUCCESS.md - Complete analysis and findings • api_request_template.json - Exact request format • scraper.py updated with API framework 🎯 KEY DISCOVERIES: ✅ Found exact API endpoint used by Dollar General website ✅ Documented complete request/response format ✅ Confirmed presence of test product (SKU 41936301) ✅ Identified Pokemon category ID and store parameters ✅ Ready for bulk product scraping once auth is implemented ⚡ Current Status: • Individual product extraction: 100% working • API framework: Discovered and documented • Authentication: Requires Bearer token (next challenge) • PDF generation: Fully functional This breakthrough enables potential bulk product discovery and makes Pokemon Discovery far more powerful for inventory management!
232 lines
7.0 KiB
Markdown
232 lines
7.0 KiB
Markdown
# Pokemon Discovery (pokemon-disco)
|
|
|
|
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
|
|
|
## Features
|
|
|
|
- **🔍 API Discovery**: Discovered Dollar General's internal product API via HAR analysis
|
|
- **📱 Product Extraction**: Successfully extracts Pokemon TCG product details (title, SKU, price, stock)
|
|
- **🏷️ Barcode Generation**: Creates scannable UPC-A barcodes for inventory management
|
|
- **📄 PDF Catalogs**: Professional PDF catalogs with images, details, and barcodes
|
|
- **🕰️ Unix-Friendly**: Timestamped filenames (`YYYYMMDD_HHMMSS`) for easy scripting
|
|
- **🌐 Brave Browser Support**: Configured for dynamic content scraping
|
|
- **🛡️ Anti-Bot Handling**: Multiple fallback strategies (requests → Selenium → individual products)
|
|
|
|
## Requirements
|
|
|
|
### System Requirements
|
|
- Python 3.7+
|
|
- pandoc (for PDF generation)
|
|
- Chrome/Chromium browser (for Selenium fallback)
|
|
|
|
### Python Dependencies
|
|
All dependencies are automatically installed via `requirements.txt`:
|
|
- requests
|
|
- beautifulsoup4
|
|
- selenium
|
|
- webdriver-manager
|
|
- python-barcode
|
|
- Pillow
|
|
- pandas
|
|
- lxml
|
|
|
|
## Installation
|
|
|
|
1. **Clone/Download** this directory to your system
|
|
|
|
2. **Install pandoc** (required for PDF generation):
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt install pandoc
|
|
|
|
# macOS
|
|
brew install pandoc
|
|
|
|
# Arch Linux
|
|
sudo pacman -S pandoc
|
|
```
|
|
|
|
3. **Install Python dependencies** (automatically done by the script):
|
|
```bash
|
|
cd pokemon-disco
|
|
pip3 install -r requirements.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Quick Start (Recommended)
|
|
|
|
Run the complete pipeline with one command:
|
|
|
|
```bash
|
|
cd pokemon-disco
|
|
python3 run_scraper.py
|
|
```
|
|
|
|
This will:
|
|
1. Check and install Python requirements
|
|
2. Scrape Pokemon TCG products from Dollar General
|
|
3. Generate a PDF catalog with images and barcodes
|
|
4. Create timestamped files for easy organization
|
|
|
|
### Manual Usage
|
|
|
|
If you prefer to run components separately:
|
|
|
|
#### 1. Scrape Products
|
|
```bash
|
|
python3 scraper.py
|
|
```
|
|
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
|
|
|
#### 2. Generate PDF Catalog
|
|
```bash
|
|
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
|
```
|
|
|
|
## Output Files
|
|
|
|
### Generated Files
|
|
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
|
- Raw scraped data in JSON format
|
|
- Contains all product information
|
|
|
|
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
|
- Professional PDF catalog
|
|
- Includes product images, details, and UPC-A barcodes
|
|
|
|
### Output Directory Structure
|
|
```
|
|
pokemon-disco/
|
|
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
|
├── catalog_output/
|
|
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
|
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
|
│ ├── images/
|
|
│ │ ├── product_1_SKU123.jpg
|
|
│ │ ├── product_2_SKU456.jpg
|
|
│ │ └── placeholder.png
|
|
│ └── barcodes/
|
|
│ ├── barcode_SKU123.png
|
|
│ ├── barcode_SKU456.png
|
|
│ └── ...
|
|
```
|
|
|
|
## PDF Catalog Features
|
|
|
|
Each product in the PDF includes:
|
|
- **Product Image**: Downloaded from Dollar General or placeholder
|
|
- **Product Details Table**:
|
|
- Title
|
|
- Price
|
|
- Stock Status
|
|
- SKU (formatted as code)
|
|
- Product URL
|
|
- **UPC-A Barcode**: Generated from SKU for inventory management
|
|
|
|
## Data Fields Extracted
|
|
|
|
For each Pokemon TCG product:
|
|
- `title`: Product name
|
|
- `price`: Current price
|
|
- `stock`: Availability status
|
|
- `sku`: Product SKU/item number
|
|
- `image_url`: Direct link to product image
|
|
- `url`: Link to product page
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **No products found**
|
|
- Dollar General may have anti-bot protection
|
|
- The script will automatically retry with Selenium
|
|
- Website structure may have changed
|
|
|
|
2. **PDF generation fails**
|
|
- Ensure pandoc is installed: `pandoc --version`
|
|
- Try alternative LaTeX engines if available
|
|
- Markdown file is still generated for manual conversion
|
|
|
|
3. **Image download failures**
|
|
- Network connectivity issues
|
|
- Placeholder images will be used automatically
|
|
|
|
4. **Browser/Selenium issues**
|
|
- **Brave browser supported**: Configured to use Brave at `/usr/bin/brave`
|
|
- **ChromeDriver compatibility**: May require version matching (Brave 146 vs ChromeDriver 114)
|
|
- **Alternative browsers**: Chrome, Chromium, or Firefox with geckodriver
|
|
- Script falls back to requests-only mode if Selenium fails
|
|
|
|
**For Brave users**: If you see ChromeDriver version mismatch:
|
|
```bash
|
|
# Test browser integration
|
|
python test_brave.py
|
|
|
|
# Solutions for version mismatch:
|
|
pip install --upgrade webdriver-manager
|
|
# or manually install compatible ChromeDriver
|
|
```
|
|
|
|
### Debug Mode
|
|
|
|
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
|
- Which products are found and filtered
|
|
- Network request status
|
|
- File generation progress
|
|
|
|
## API Discovery Success 🎉
|
|
|
|
**Pokemon Discovery has successfully discovered Dollar General's internal API endpoint!**
|
|
|
|
- **Endpoint Found**: `https://dggo.dollargeneral.com/omni/api/v2/category/search/provider`
|
|
- **Method**: POST with JSON payload
|
|
- **Category ID**: `723960` (Pokemon products)
|
|
- **Response Format**: Complete product details including your test product (SKU: `41936301`)
|
|
- **Status**: Documented and integrated, requires authentication token
|
|
|
|
**Current Status**: Individual product extraction works perfectly. API bulk scraping available once authentication is implemented.
|
|
|
|
## Technical Details
|
|
|
|
### Scraping Strategy
|
|
1. **Primary Method**: Uses requests with browser-like headers
|
|
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
|
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
|
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
|
|
|
### Barcode Generation
|
|
- Converts SKUs to 11-digit numeric format
|
|
- Generates UPC-A barcodes with check digits
|
|
- High-quality PNG images suitable for printing
|
|
|
|
### PDF Generation
|
|
- Uses pandoc with LaTeX for professional formatting
|
|
- Includes table of contents
|
|
- Optimized for printing and digital viewing
|
|
- Images scaled appropriately for page layout
|
|
|
|
## Customization
|
|
|
|
### Modifying Product Filters
|
|
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
|
|
|
### Changing PDF Layout
|
|
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
|
|
|
### Adding New Data Fields
|
|
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
|
|
|
## License
|
|
|
|
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
|
|
|
## Support
|
|
|
|
If you encounter issues:
|
|
1. Check the console output for error messages
|
|
2. Ensure all system requirements are installed
|
|
3. Verify internet connectivity
|
|
4. Check if the Dollar General website structure has changed
|
|
|
|
Generated files include timestamps for easy organization and version tracking. |