Files
pokemon-disco/README.md
pi-bot-01 0c7e139245 Clean up: remove obsolete files, update docs and docstrings
Removed 20 files: old test scripts, debug tools, duplicate docs,
generated JSON, old PDF generator, launcher scripts.

Kept:
  disco.py        — main tool (scrape HAR + generate PDF)
  scraper.py      — reference site scraper (HTML + Selenium/Brave)
  requirements.txt
  *.har           — browser capture with API data

Updated:
  README.md       — rewritten to reflect current tool and usage
  .gitignore      — simplified
  scraper.py      — module/class/method docstrings updated to clarify
                    this is a reference implementation, disco.py is primary
2026-03-21 23:28:52 -07:00

130 lines
3.9 KiB
Markdown

# Pokemon Discovery (pokemon-disco)
Scrapes Pokemon TCG card pack and tin products from Dollar General and generates a PDF product catalog with images and UPC-A barcodes.
## How It Works
Dollar General's Pokemon category page loads products dynamically via an internal API. A browser HAR capture contains the API responses with all product data. `disco.py` extracts products from the HAR file, filters for card packs and tins, downloads product images, generates UPC-A barcodes, and produces a LaTeX-based PDF catalog.
### Pipeline
```
HAR file → Extract API responses → Filter packs/tins → Download images
→ Generate UPC-A barcodes → Compile PDF catalog (pdflatex)
```
## Requirements
- Python 3.10+
- pdflatex (via `texlive-core` + `texlive-latexextra`)
- Python packages: `requests`, `beautifulsoup4`, `python-barcode`, `Pillow`
### Install (Arch / CachyOS)
```bash
sudo pacman -S texlive-basic texlive-latex texlive-latexextra texlive-fontsrecommended
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
## Usage
### Full run (scrape + PDF)
```bash
source venv/bin/activate
python disco.py
```
### Scrape only (output JSON)
```bash
python disco.py --scrape-only
```
### PDF only (from existing JSON)
```bash
python disco.py --pdf-only pokemon_tcg_products_YYYYMMDD_HHMMSS.json
```
## Output
```
pokemon_tcg_products_YYYYMMDD_HHMMSS.json Product data
catalog_output/
├── pokemon_catalog_YYYYMMDD_HHMMSS.pdf PDF catalog
├── pokemon_catalog_YYYYMMDD_HHMMSS.tex LaTeX source
├── images/ Product images (PNG)
└── barcodes/ UPC-A barcodes (PNG)
```
### PDF Layout
**Page 1 — Manifest:** table of all products with SKU, price, and stock count.
**Product pages:**
```
Product Name
Stock status Price
SKU: XXXXXXXX UPC: XXXXXXXXXXXX
┌─────────────────────────────┐
│ │
│ Product Image │
│ │
└─────────────────────────────┘
┌─────────────────────────────┐
│ UPC-A Barcode │
└─────────────────────────────┘
```
## Capturing a HAR File
The HAR file provides product data from Dollar General's internal API. To capture one:
1. Open your browser (Brave, Chrome, Firefox)
2. Open DevTools → **Network** tab
3. Visit `https://www.dollargeneral.com/c/toys/pokemon?q=`
4. Wait for products to load, toggle any filters you want
5. Right-click in the Network tab → **Save all as HAR**
6. Place the `.har` file in the project root
`disco.py` looks for any `.har` file matching the default name pattern. Edit the `HAR_FILE` constant at the top of `disco.py` if your filename differs.
## Files
| File | Purpose |
|------|---------|
| `disco.py` | Main tool — scrape, filter, generate PDF |
| `scraper.py` | Reference site scraper (HTML + Selenium/Brave) |
| `requirements.txt` | Python dependencies |
| `*.har` | Browser HAR capture with API data |
## API Details (Reference)
The product data comes from this internal API:
```
POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
Content-Type: application/json
Authorization: Bearer <session-token>
{
"StoreNbr": 17506,
"Id": 723960, // Pokemon category
"PageSize": 24,
"Filters": {
"soldAtStore": true,
"inStock": false // false = include out of stock
}
}
```
Response contains `ItemList.Items[]` with fields: `Description`, `UPC`, `Price`, `Image`, `AvailableQty`, `rootSV` (internal ID → SKU).
The bearer token is session-scoped and short-lived. `disco.py` sidesteps this by reading the API responses directly from a HAR capture.