Clean up: remove obsolete files, update docs and docstrings
Removed 20 files: old test scripts, debug tools, duplicate docs,
generated JSON, old PDF generator, launcher scripts.
Kept:
disco.py — main tool (scrape HAR + generate PDF)
scraper.py — reference site scraper (HTML + Selenium/Brave)
requirements.txt
*.har — browser capture with API data
Updated:
README.md — rewritten to reflect current tool and usage
.gitignore — simplified
scraper.py — module/class/method docstrings updated to clarify
this is a reference implementation, disco.py is primary
This commit is contained in:
273
README.md
273
README.md
@@ -1,232 +1,129 @@
|
||||
# Pokemon Discovery (pokemon-disco)
|
||||
|
||||
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
||||
Scrapes Pokemon TCG card pack and tin products from Dollar General and generates a PDF product catalog with images and UPC-A barcodes.
|
||||
|
||||
## Features
|
||||
## How It Works
|
||||
|
||||
- **🔍 API Discovery**: Discovered Dollar General's internal product API via HAR analysis
|
||||
- **📱 Product Extraction**: Successfully extracts Pokemon TCG product details (title, SKU, price, stock)
|
||||
- **🏷️ Barcode Generation**: Creates scannable UPC-A barcodes for inventory management
|
||||
- **📄 PDF Catalogs**: Professional PDF catalogs with images, details, and barcodes
|
||||
- **🕰️ Unix-Friendly**: Timestamped filenames (`YYYYMMDD_HHMMSS`) for easy scripting
|
||||
- **🌐 Brave Browser Support**: Configured for dynamic content scraping
|
||||
- **🛡️ Anti-Bot Handling**: Multiple fallback strategies (requests → Selenium → individual products)
|
||||
Dollar General's Pokemon category page loads products dynamically via an internal API. A browser HAR capture contains the API responses with all product data. `disco.py` extracts products from the HAR file, filters for card packs and tins, downloads product images, generates UPC-A barcodes, and produces a LaTeX-based PDF catalog.
|
||||
|
||||
### Pipeline
|
||||
|
||||
```
|
||||
HAR file → Extract API responses → Filter packs/tins → Download images
|
||||
→ Generate UPC-A barcodes → Compile PDF catalog (pdflatex)
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
### System Requirements
|
||||
- Python 3.7+
|
||||
- pandoc (for PDF generation)
|
||||
- Chrome/Chromium browser (for Selenium fallback)
|
||||
- Python 3.10+
|
||||
- pdflatex (via `texlive-core` + `texlive-latexextra`)
|
||||
- Python packages: `requests`, `beautifulsoup4`, `python-barcode`, `Pillow`
|
||||
|
||||
### Python Dependencies
|
||||
All dependencies are automatically installed via `requirements.txt`:
|
||||
- requests
|
||||
- beautifulsoup4
|
||||
- selenium
|
||||
- webdriver-manager
|
||||
- python-barcode
|
||||
- Pillow
|
||||
- pandas
|
||||
- lxml
|
||||
### Install (Arch / CachyOS)
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone/Download** this directory to your system
|
||||
|
||||
2. **Install pandoc** (required for PDF generation):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt install pandoc
|
||||
|
||||
# macOS
|
||||
brew install pandoc
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S pandoc
|
||||
```
|
||||
|
||||
3. **Install Python dependencies** (automatically done by the script):
|
||||
```bash
|
||||
cd pokemon-disco
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
```bash
|
||||
sudo pacman -S texlive-basic texlive-latex texlive-latexextra texlive-fontsrecommended
|
||||
python -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start (Recommended)
|
||||
|
||||
Run the complete pipeline with one command:
|
||||
### Full run (scrape + PDF)
|
||||
|
||||
```bash
|
||||
cd pokemon-disco
|
||||
python3 run_scraper.py
|
||||
source venv/bin/activate
|
||||
python disco.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Check and install Python requirements
|
||||
2. Scrape Pokemon TCG products from Dollar General
|
||||
3. Generate a PDF catalog with images and barcodes
|
||||
4. Create timestamped files for easy organization
|
||||
### Scrape only (output JSON)
|
||||
|
||||
### Manual Usage
|
||||
|
||||
If you prefer to run components separately:
|
||||
|
||||
#### 1. Scrape Products
|
||||
```bash
|
||||
python3 scraper.py
|
||||
python disco.py --scrape-only
|
||||
```
|
||||
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
||||
|
||||
#### 2. Generate PDF Catalog
|
||||
### PDF only (from existing JSON)
|
||||
|
||||
```bash
|
||||
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
||||
python disco.py --pdf-only pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
||||
```
|
||||
|
||||
## Output Files
|
||||
## Output
|
||||
|
||||
### Generated Files
|
||||
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
||||
- Raw scraped data in JSON format
|
||||
- Contains all product information
|
||||
|
||||
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
||||
- Professional PDF catalog
|
||||
- Includes product images, details, and UPC-A barcodes
|
||||
|
||||
### Output Directory Structure
|
||||
```
|
||||
pokemon-disco/
|
||||
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
||||
├── catalog_output/
|
||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
||||
│ ├── images/
|
||||
│ │ ├── product_1_SKU123.jpg
|
||||
│ │ ├── product_2_SKU456.jpg
|
||||
│ │ └── placeholder.png
|
||||
│ └── barcodes/
|
||||
│ ├── barcode_SKU123.png
|
||||
│ ├── barcode_SKU456.png
|
||||
│ └── ...
|
||||
pokemon_tcg_products_YYYYMMDD_HHMMSS.json Product data
|
||||
catalog_output/
|
||||
├── pokemon_catalog_YYYYMMDD_HHMMSS.pdf PDF catalog
|
||||
├── pokemon_catalog_YYYYMMDD_HHMMSS.tex LaTeX source
|
||||
├── images/ Product images (PNG)
|
||||
└── barcodes/ UPC-A barcodes (PNG)
|
||||
```
|
||||
|
||||
## PDF Catalog Features
|
||||
### PDF Layout
|
||||
|
||||
Each product in the PDF includes:
|
||||
- **Product Image**: Downloaded from Dollar General or placeholder
|
||||
- **Product Details Table**:
|
||||
- Title
|
||||
- Price
|
||||
- Stock Status
|
||||
- SKU (formatted as code)
|
||||
- Product URL
|
||||
- **UPC-A Barcode**: Generated from SKU for inventory management
|
||||
**Page 1 — Manifest:** table of all products with SKU, price, and stock count.
|
||||
|
||||
## Data Fields Extracted
|
||||
**Product pages:**
|
||||
|
||||
For each Pokemon TCG product:
|
||||
- `title`: Product name
|
||||
- `price`: Current price
|
||||
- `stock`: Availability status
|
||||
- `sku`: Product SKU/item number
|
||||
- `image_url`: Direct link to product image
|
||||
- `url`: Link to product page
|
||||
```
|
||||
Product Name
|
||||
Stock status Price
|
||||
SKU: XXXXXXXX UPC: XXXXXXXXXXXX
|
||||
|
||||
## Troubleshooting
|
||||
┌─────────────────────────────┐
|
||||
│ │
|
||||
│ Product Image │
|
||||
│ │
|
||||
└─────────────────────────────┘
|
||||
|
||||
### Common Issues
|
||||
┌─────────────────────────────┐
|
||||
│ UPC-A Barcode │
|
||||
└─────────────────────────────┘
|
||||
```
|
||||
|
||||
1. **No products found**
|
||||
- Dollar General may have anti-bot protection
|
||||
- The script will automatically retry with Selenium
|
||||
- Website structure may have changed
|
||||
## Capturing a HAR File
|
||||
|
||||
2. **PDF generation fails**
|
||||
- Ensure pandoc is installed: `pandoc --version`
|
||||
- Try alternative LaTeX engines if available
|
||||
- Markdown file is still generated for manual conversion
|
||||
The HAR file provides product data from Dollar General's internal API. To capture one:
|
||||
|
||||
3. **Image download failures**
|
||||
- Network connectivity issues
|
||||
- Placeholder images will be used automatically
|
||||
1. Open your browser (Brave, Chrome, Firefox)
|
||||
2. Open DevTools → **Network** tab
|
||||
3. Visit `https://www.dollargeneral.com/c/toys/pokemon?q=`
|
||||
4. Wait for products to load, toggle any filters you want
|
||||
5. Right-click in the Network tab → **Save all as HAR**
|
||||
6. Place the `.har` file in the project root
|
||||
|
||||
4. **Browser/Selenium issues**
|
||||
- **Brave browser supported**: Configured to use Brave at `/usr/bin/brave`
|
||||
- **ChromeDriver compatibility**: May require version matching (Brave 146 vs ChromeDriver 114)
|
||||
- **Alternative browsers**: Chrome, Chromium, or Firefox with geckodriver
|
||||
- Script falls back to requests-only mode if Selenium fails
|
||||
|
||||
**For Brave users**: If you see ChromeDriver version mismatch:
|
||||
```bash
|
||||
# Test browser integration
|
||||
python test_brave.py
|
||||
|
||||
# Solutions for version mismatch:
|
||||
pip install --upgrade webdriver-manager
|
||||
# or manually install compatible ChromeDriver
|
||||
```
|
||||
`disco.py` looks for any `.har` file matching the default name pattern. Edit the `HAR_FILE` constant at the top of `disco.py` if your filename differs.
|
||||
|
||||
### Debug Mode
|
||||
## Files
|
||||
|
||||
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
||||
- Which products are found and filtered
|
||||
- Network request status
|
||||
- File generation progress
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `disco.py` | Main tool — scrape, filter, generate PDF |
|
||||
| `scraper.py` | Reference site scraper (HTML + Selenium/Brave) |
|
||||
| `requirements.txt` | Python dependencies |
|
||||
| `*.har` | Browser HAR capture with API data |
|
||||
|
||||
## API Discovery Success 🎉
|
||||
## API Details (Reference)
|
||||
|
||||
**Pokemon Discovery has successfully discovered Dollar General's internal API endpoint!**
|
||||
The product data comes from this internal API:
|
||||
|
||||
- **Endpoint Found**: `https://dggo.dollargeneral.com/omni/api/v2/category/search/provider`
|
||||
- **Method**: POST with JSON payload
|
||||
- **Category ID**: `723960` (Pokemon products)
|
||||
- **Response Format**: Complete product details including your test product (SKU: `41936301`)
|
||||
- **Status**: Documented and integrated, requires authentication token
|
||||
```
|
||||
POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
|
||||
Content-Type: application/json
|
||||
Authorization: Bearer <session-token>
|
||||
|
||||
**Current Status**: Individual product extraction works perfectly. API bulk scraping available once authentication is implemented.
|
||||
{
|
||||
"StoreNbr": 17506,
|
||||
"Id": 723960, // Pokemon category
|
||||
"PageSize": 24,
|
||||
"Filters": {
|
||||
"soldAtStore": true,
|
||||
"inStock": false // false = include out of stock
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Technical Details
|
||||
Response contains `ItemList.Items[]` with fields: `Description`, `UPC`, `Price`, `Image`, `AvailableQty`, `rootSV` (internal ID → SKU).
|
||||
|
||||
### Scraping Strategy
|
||||
1. **Primary Method**: Uses requests with browser-like headers
|
||||
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
||||
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
||||
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
||||
|
||||
### Barcode Generation
|
||||
- Converts SKUs to 11-digit numeric format
|
||||
- Generates UPC-A barcodes with check digits
|
||||
- High-quality PNG images suitable for printing
|
||||
|
||||
### PDF Generation
|
||||
- Uses pandoc with LaTeX for professional formatting
|
||||
- Includes table of contents
|
||||
- Optimized for printing and digital viewing
|
||||
- Images scaled appropriately for page layout
|
||||
|
||||
## Customization
|
||||
|
||||
### Modifying Product Filters
|
||||
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
||||
|
||||
### Changing PDF Layout
|
||||
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
||||
|
||||
### Adding New Data Fields
|
||||
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
||||
|
||||
## License
|
||||
|
||||
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
||||
|
||||
## Support
|
||||
|
||||
If you encounter issues:
|
||||
1. Check the console output for error messages
|
||||
2. Ensure all system requirements are installed
|
||||
3. Verify internet connectivity
|
||||
4. Check if the Dollar General website structure has changed
|
||||
|
||||
Generated files include timestamps for easy organization and version tracking.
|
||||
The bearer token is session-scoped and short-lived. `disco.py` sidesteps this by reading the API responses directly from a HAR capture.
|
||||
|
||||
Reference in New Issue
Block a user