Clean up: remove obsolete files, update docs and docstrings

Removed 20 files: old test scripts, debug tools, duplicate docs,
generated JSON, old PDF generator, launcher scripts.

Kept:
  disco.py        — main tool (scrape HAR + generate PDF)
  scraper.py      — reference site scraper (HTML + Selenium/Brave)
  requirements.txt
  *.har           — browser capture with API data

Updated:
  README.md       — rewritten to reflect current tool and usage
  .gitignore      — simplified
  scraper.py      — module/class/method docstrings updated to clarify
                    this is a reference implementation, disco.py is primary
This commit is contained in:
2026-03-21 23:28:52 -07:00
parent 90661e1957
commit 0c7e139245
24 changed files with 115 additions and 3380 deletions

273
README.md
View File

@@ -1,232 +1,129 @@
# Pokemon Discovery (pokemon-disco)
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
Scrapes Pokemon TCG card pack and tin products from Dollar General and generates a PDF product catalog with images and UPC-A barcodes.
## Features
## How It Works
- **🔍 API Discovery**: Discovered Dollar General's internal product API via HAR analysis
- **📱 Product Extraction**: Successfully extracts Pokemon TCG product details (title, SKU, price, stock)
- **🏷️ Barcode Generation**: Creates scannable UPC-A barcodes for inventory management
- **📄 PDF Catalogs**: Professional PDF catalogs with images, details, and barcodes
- **🕰️ Unix-Friendly**: Timestamped filenames (`YYYYMMDD_HHMMSS`) for easy scripting
- **🌐 Brave Browser Support**: Configured for dynamic content scraping
- **🛡️ Anti-Bot Handling**: Multiple fallback strategies (requests → Selenium → individual products)
Dollar General's Pokemon category page loads products dynamically via an internal API. A browser HAR capture contains the API responses with all product data. `disco.py` extracts products from the HAR file, filters for card packs and tins, downloads product images, generates UPC-A barcodes, and produces a LaTeX-based PDF catalog.
### Pipeline
```
HAR file → Extract API responses → Filter packs/tins → Download images
→ Generate UPC-A barcodes → Compile PDF catalog (pdflatex)
```
## Requirements
### System Requirements
- Python 3.7+
- pandoc (for PDF generation)
- Chrome/Chromium browser (for Selenium fallback)
- Python 3.10+
- pdflatex (via `texlive-core` + `texlive-latexextra`)
- Python packages: `requests`, `beautifulsoup4`, `python-barcode`, `Pillow`
### Python Dependencies
All dependencies are automatically installed via `requirements.txt`:
- requests
- beautifulsoup4
- selenium
- webdriver-manager
- python-barcode
- Pillow
- pandas
- lxml
### Install (Arch / CachyOS)
## Installation
1. **Clone/Download** this directory to your system
2. **Install pandoc** (required for PDF generation):
```bash
# Ubuntu/Debian
sudo apt install pandoc
# macOS
brew install pandoc
# Arch Linux
sudo pacman -S pandoc
```
3. **Install Python dependencies** (automatically done by the script):
```bash
cd pokemon-disco
pip3 install -r requirements.txt
```
```bash
sudo pacman -S texlive-basic texlive-latex texlive-latexextra texlive-fontsrecommended
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
## Usage
### Quick Start (Recommended)
Run the complete pipeline with one command:
### Full run (scrape + PDF)
```bash
cd pokemon-disco
python3 run_scraper.py
source venv/bin/activate
python disco.py
```
This will:
1. Check and install Python requirements
2. Scrape Pokemon TCG products from Dollar General
3. Generate a PDF catalog with images and barcodes
4. Create timestamped files for easy organization
### Scrape only (output JSON)
### Manual Usage
If you prefer to run components separately:
#### 1. Scrape Products
```bash
python3 scraper.py
python disco.py --scrape-only
```
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
#### 2. Generate PDF Catalog
### PDF only (from existing JSON)
```bash
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
python disco.py --pdf-only pokemon_tcg_products_YYYYMMDD_HHMMSS.json
```
## Output Files
## Output
### Generated Files
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
- Raw scraped data in JSON format
- Contains all product information
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
- Professional PDF catalog
- Includes product images, details, and UPC-A barcodes
### Output Directory Structure
```
pokemon-disco/
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
├── catalog_output/
├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
│ ├── images/
│ │ ├── product_1_SKU123.jpg
│ │ ├── product_2_SKU456.jpg
│ │ └── placeholder.png
│ └── barcodes/
│ ├── barcode_SKU123.png
│ ├── barcode_SKU456.png
│ └── ...
pokemon_tcg_products_YYYYMMDD_HHMMSS.json Product data
catalog_output/
├── pokemon_catalog_YYYYMMDD_HHMMSS.pdf PDF catalog
├── pokemon_catalog_YYYYMMDD_HHMMSS.tex LaTeX source
├── images/ Product images (PNG)
└── barcodes/ UPC-A barcodes (PNG)
```
## PDF Catalog Features
### PDF Layout
Each product in the PDF includes:
- **Product Image**: Downloaded from Dollar General or placeholder
- **Product Details Table**:
- Title
- Price
- Stock Status
- SKU (formatted as code)
- Product URL
- **UPC-A Barcode**: Generated from SKU for inventory management
**Page 1 — Manifest:** table of all products with SKU, price, and stock count.
## Data Fields Extracted
**Product pages:**
For each Pokemon TCG product:
- `title`: Product name
- `price`: Current price
- `stock`: Availability status
- `sku`: Product SKU/item number
- `image_url`: Direct link to product image
- `url`: Link to product page
```
Product Name
Stock status Price
SKU: XXXXXXXX UPC: XXXXXXXXXXXX
## Troubleshooting
┌─────────────────────────────┐
│ │
│ Product Image │
│ │
└─────────────────────────────┘
### Common Issues
┌─────────────────────────────┐
│ UPC-A Barcode │
└─────────────────────────────┘
```
1. **No products found**
- Dollar General may have anti-bot protection
- The script will automatically retry with Selenium
- Website structure may have changed
## Capturing a HAR File
2. **PDF generation fails**
- Ensure pandoc is installed: `pandoc --version`
- Try alternative LaTeX engines if available
- Markdown file is still generated for manual conversion
The HAR file provides product data from Dollar General's internal API. To capture one:
3. **Image download failures**
- Network connectivity issues
- Placeholder images will be used automatically
1. Open your browser (Brave, Chrome, Firefox)
2. Open DevTools → **Network** tab
3. Visit `https://www.dollargeneral.com/c/toys/pokemon?q=`
4. Wait for products to load, toggle any filters you want
5. Right-click in the Network tab → **Save all as HAR**
6. Place the `.har` file in the project root
4. **Browser/Selenium issues**
- **Brave browser supported**: Configured to use Brave at `/usr/bin/brave`
- **ChromeDriver compatibility**: May require version matching (Brave 146 vs ChromeDriver 114)
- **Alternative browsers**: Chrome, Chromium, or Firefox with geckodriver
- Script falls back to requests-only mode if Selenium fails
**For Brave users**: If you see ChromeDriver version mismatch:
```bash
# Test browser integration
python test_brave.py
# Solutions for version mismatch:
pip install --upgrade webdriver-manager
# or manually install compatible ChromeDriver
```
`disco.py` looks for any `.har` file matching the default name pattern. Edit the `HAR_FILE` constant at the top of `disco.py` if your filename differs.
### Debug Mode
## Files
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
- Which products are found and filtered
- Network request status
- File generation progress
| File | Purpose |
|------|---------|
| `disco.py` | Main tool — scrape, filter, generate PDF |
| `scraper.py` | Reference site scraper (HTML + Selenium/Brave) |
| `requirements.txt` | Python dependencies |
| `*.har` | Browser HAR capture with API data |
## API Discovery Success 🎉
## API Details (Reference)
**Pokemon Discovery has successfully discovered Dollar General's internal API endpoint!**
The product data comes from this internal API:
- **Endpoint Found**: `https://dggo.dollargeneral.com/omni/api/v2/category/search/provider`
- **Method**: POST with JSON payload
- **Category ID**: `723960` (Pokemon products)
- **Response Format**: Complete product details including your test product (SKU: `41936301`)
- **Status**: Documented and integrated, requires authentication token
```
POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
Content-Type: application/json
Authorization: Bearer <session-token>
**Current Status**: Individual product extraction works perfectly. API bulk scraping available once authentication is implemented.
{
"StoreNbr": 17506,
"Id": 723960, // Pokemon category
"PageSize": 24,
"Filters": {
"soldAtStore": true,
"inStock": false // false = include out of stock
}
}
```
## Technical Details
Response contains `ItemList.Items[]` with fields: `Description`, `UPC`, `Price`, `Image`, `AvailableQty`, `rootSV` (internal ID → SKU).
### Scraping Strategy
1. **Primary Method**: Uses requests with browser-like headers
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
4. **Rate Limiting**: 1-second delay between requests to be respectful
### Barcode Generation
- Converts SKUs to 11-digit numeric format
- Generates UPC-A barcodes with check digits
- High-quality PNG images suitable for printing
### PDF Generation
- Uses pandoc with LaTeX for professional formatting
- Includes table of contents
- Optimized for printing and digital viewing
- Images scaled appropriately for page layout
## Customization
### Modifying Product Filters
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
### Changing PDF Layout
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
### Adding New Data Fields
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
## License
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
## Support
If you encounter issues:
1. Check the console output for error messages
2. Ensure all system requirements are installed
3. Verify internet connectivity
4. Check if the Dollar General website structure has changed
Generated files include timestamps for easy organization and version tracking.
The bearer token is session-scoped and short-lived. `disco.py` sidesteps this by reading the API responses directly from a HAR capture.