Files

pi-bot-01 0c7e139245 Clean up: remove obsolete files, update docs and docstrings

Removed 20 files: old test scripts, debug tools, duplicate docs,
generated JSON, old PDF generator, launcher scripts.

Kept:
  disco.py        — main tool (scrape HAR + generate PDF)
  scraper.py      — reference site scraper (HTML + Selenium/Brave)
  requirements.txt
  *.har           — browser capture with API data

Updated:
  README.md       — rewritten to reflect current tool and usage
  .gitignore      — simplified
  scraper.py      — module/class/method docstrings updated to clarify
                    this is a reference implementation, disco.py is primary

2026-03-21 23:28:52 -07:00

3.9 KiB

Raw Blame History

Pokemon Discovery (pokemon-disco)

Scrapes Pokemon TCG card pack and tin products from Dollar General and generates a PDF product catalog with images and UPC-A barcodes.

How It Works

Dollar General's Pokemon category page loads products dynamically via an internal API. A browser HAR capture contains the API responses with all product data. disco.py extracts products from the HAR file, filters for card packs and tins, downloads product images, generates UPC-A barcodes, and produces a LaTeX-based PDF catalog.

Pipeline

HAR file → Extract API responses → Filter packs/tins → Download images
         → Generate UPC-A barcodes → Compile PDF catalog (pdflatex)

Requirements

Python 3.10+
pdflatex (via texlive-core + texlive-latexextra)
Python packages: requests, beautifulsoup4, python-barcode, Pillow

Install (Arch / CachyOS)

sudo pacman -S texlive-basic texlive-latex texlive-latexextra texlive-fontsrecommended
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Full run (scrape + PDF)

source venv/bin/activate
python disco.py

Scrape only (output JSON)

python disco.py --scrape-only

PDF only (from existing JSON)

python disco.py --pdf-only pokemon_tcg_products_YYYYMMDD_HHMMSS.json

Output

pokemon_tcg_products_YYYYMMDD_HHMMSS.json    Product data
catalog_output/
├── pokemon_catalog_YYYYMMDD_HHMMSS.pdf      PDF catalog
├── pokemon_catalog_YYYYMMDD_HHMMSS.tex      LaTeX source
├── images/                                   Product images (PNG)
└── barcodes/                                 UPC-A barcodes (PNG)

PDF Layout

Page 1 — Manifest: table of all products with SKU, price, and stock count.

Product pages:

Product Name
Stock status                              Price
SKU: XXXXXXXX                   UPC: XXXXXXXXXXXX

┌─────────────────────────────┐
│                             │
│       Product Image         │
│                             │
└─────────────────────────────┘

┌─────────────────────────────┐
│      UPC-A Barcode          │
└─────────────────────────────┘

Capturing a HAR File

The HAR file provides product data from Dollar General's internal API. To capture one:

Open your browser (Brave, Chrome, Firefox)
Open DevTools → Network tab
Visit https://www.dollargeneral.com/c/toys/pokemon?q=
Wait for products to load, toggle any filters you want
Right-click in the Network tab → Save all as HAR
Place the .har file in the project root

disco.py looks for any .har file matching the default name pattern. Edit the HAR_FILE constant at the top of disco.py if your filename differs.

Files

File	Purpose
`disco.py`	Main tool — scrape, filter, generate PDF
`scraper.py`	Reference site scraper (HTML + Selenium/Brave)
`requirements.txt`	Python dependencies
`*.har`	Browser HAR capture with API data

API Details (Reference)

The product data comes from this internal API:

POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
Content-Type: application/json
Authorization: Bearer <session-token>

{
  "StoreNbr": 17506,
  "Id": 723960,          // Pokemon category
  "PageSize": 24,
  "Filters": {
    "soldAtStore": true,
    "inStock": false      // false = include out of stock
  }
}

Response contains ItemList.Items[] with fields: Description, UPC, Price, Image, AvailableQty, rootSV (internal ID → SKU).

The bearer token is session-scoped and short-lived. disco.py sidesteps this by reading the API responses directly from a HAR capture.

3.9 KiB Raw Blame History