Removed 20 files: old test scripts, debug tools, duplicate docs,
generated JSON, old PDF generator, launcher scripts.
Kept:
disco.py — main tool (scrape HAR + generate PDF)
scraper.py — reference site scraper (HTML + Selenium/Brave)
requirements.txt
*.har — browser capture with API data
Updated:
README.md — rewritten to reflect current tool and usage
.gitignore — simplified
scraper.py — module/class/method docstrings updated to clarify
this is a reference implementation, disco.py is primary
130 lines
3.9 KiB
Markdown
130 lines
3.9 KiB
Markdown
# Pokemon Discovery (pokemon-disco)
|
|
|
|
Scrapes Pokemon TCG card pack and tin products from Dollar General and generates a PDF product catalog with images and UPC-A barcodes.
|
|
|
|
## How It Works
|
|
|
|
Dollar General's Pokemon category page loads products dynamically via an internal API. A browser HAR capture contains the API responses with all product data. `disco.py` extracts products from the HAR file, filters for card packs and tins, downloads product images, generates UPC-A barcodes, and produces a LaTeX-based PDF catalog.
|
|
|
|
### Pipeline
|
|
|
|
```
|
|
HAR file → Extract API responses → Filter packs/tins → Download images
|
|
→ Generate UPC-A barcodes → Compile PDF catalog (pdflatex)
|
|
```
|
|
|
|
## Requirements
|
|
|
|
- Python 3.10+
|
|
- pdflatex (via `texlive-core` + `texlive-latexextra`)
|
|
- Python packages: `requests`, `beautifulsoup4`, `python-barcode`, `Pillow`
|
|
|
|
### Install (Arch / CachyOS)
|
|
|
|
```bash
|
|
sudo pacman -S texlive-basic texlive-latex texlive-latexextra texlive-fontsrecommended
|
|
python -m venv venv
|
|
source venv/bin/activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Full run (scrape + PDF)
|
|
|
|
```bash
|
|
source venv/bin/activate
|
|
python disco.py
|
|
```
|
|
|
|
### Scrape only (output JSON)
|
|
|
|
```bash
|
|
python disco.py --scrape-only
|
|
```
|
|
|
|
### PDF only (from existing JSON)
|
|
|
|
```bash
|
|
python disco.py --pdf-only pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
|
```
|
|
|
|
## Output
|
|
|
|
```
|
|
pokemon_tcg_products_YYYYMMDD_HHMMSS.json Product data
|
|
catalog_output/
|
|
├── pokemon_catalog_YYYYMMDD_HHMMSS.pdf PDF catalog
|
|
├── pokemon_catalog_YYYYMMDD_HHMMSS.tex LaTeX source
|
|
├── images/ Product images (PNG)
|
|
└── barcodes/ UPC-A barcodes (PNG)
|
|
```
|
|
|
|
### PDF Layout
|
|
|
|
**Page 1 — Manifest:** table of all products with SKU, price, and stock count.
|
|
|
|
**Product pages:**
|
|
|
|
```
|
|
Product Name
|
|
Stock status Price
|
|
SKU: XXXXXXXX UPC: XXXXXXXXXXXX
|
|
|
|
┌─────────────────────────────┐
|
|
│ │
|
|
│ Product Image │
|
|
│ │
|
|
└─────────────────────────────┘
|
|
|
|
┌─────────────────────────────┐
|
|
│ UPC-A Barcode │
|
|
└─────────────────────────────┘
|
|
```
|
|
|
|
## Capturing a HAR File
|
|
|
|
The HAR file provides product data from Dollar General's internal API. To capture one:
|
|
|
|
1. Open your browser (Brave, Chrome, Firefox)
|
|
2. Open DevTools → **Network** tab
|
|
3. Visit `https://www.dollargeneral.com/c/toys/pokemon?q=`
|
|
4. Wait for products to load, toggle any filters you want
|
|
5. Right-click in the Network tab → **Save all as HAR**
|
|
6. Place the `.har` file in the project root
|
|
|
|
`disco.py` looks for any `.har` file matching the default name pattern. Edit the `HAR_FILE` constant at the top of `disco.py` if your filename differs.
|
|
|
|
## Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `disco.py` | Main tool — scrape, filter, generate PDF |
|
|
| `scraper.py` | Reference site scraper (HTML + Selenium/Brave) |
|
|
| `requirements.txt` | Python dependencies |
|
|
| `*.har` | Browser HAR capture with API data |
|
|
|
|
## API Details (Reference)
|
|
|
|
The product data comes from this internal API:
|
|
|
|
```
|
|
POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
|
|
Content-Type: application/json
|
|
Authorization: Bearer <session-token>
|
|
|
|
{
|
|
"StoreNbr": 17506,
|
|
"Id": 723960, // Pokemon category
|
|
"PageSize": 24,
|
|
"Filters": {
|
|
"soldAtStore": true,
|
|
"inStock": false // false = include out of stock
|
|
}
|
|
}
|
|
```
|
|
|
|
Response contains `ItemList.Items[]` with fields: `Description`, `UPC`, `Price`, `Image`, `AvailableQty`, `rootSV` (internal ID → SKU).
|
|
|
|
The bearer token is session-scoped and short-lived. `disco.py` sidesteps this by reading the API responses directly from a HAR capture.
|