- Comprehensive scraper for Dollar General Pokemon TCG products - Professional PDF catalog generator with UPC-A barcodes - Robust anti-bot handling with requests + Selenium fallback - Automatic image downloading and barcode generation - Unix-friendly timestamped filenames - Virtual environment support and dependency management - Complete documentation and usage guides
115 lines
3.1 KiB
Markdown
115 lines
3.1 KiB
Markdown
# Quick Start Guide
|
|
|
|
## Simple Usage (Recommended)
|
|
|
|
1. **Make sure you're in the project directory:**
|
|
```bash
|
|
cd pokemon-disco
|
|
```
|
|
|
|
2. **Run the complete scraper and PDF generator:**
|
|
```bash
|
|
./run.sh
|
|
```
|
|
|
|
This single command will:
|
|
- Set up the Python virtual environment
|
|
- Install all required packages
|
|
- Scrape Pokemon TCG products from Dollar General
|
|
- Generate a professional PDF catalog with barcodes
|
|
- Create timestamped files for easy organization
|
|
|
|
## What You'll Get
|
|
|
|
### Generated Files:
|
|
- **`pokemon_tcg_products_YYYYMMDD_HHMMSS.json`** - Raw data in JSON format
|
|
- **`catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`** - Professional PDF catalog
|
|
|
|
### PDF Catalog Contents:
|
|
- Product images (downloaded automatically)
|
|
- Product details (title, price, stock, SKU)
|
|
- UPC-A barcodes for each product (generated from SKU)
|
|
- Table of contents for easy navigation
|
|
- Professional formatting suitable for printing
|
|
|
|
## Alternative Commands
|
|
|
|
If you prefer more control:
|
|
|
|
```bash
|
|
# Activate virtual environment first
|
|
source venv/bin/activate
|
|
|
|
# Run only the scraper
|
|
python scraper.py
|
|
|
|
# Run only the PDF generator (after scraping)
|
|
python pdf_generator.py pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
|
|
|
# Run everything (installs requirements automatically)
|
|
python run_scraper.py
|
|
```
|
|
|
|
## Output Location
|
|
|
|
All generated files will be in:
|
|
- JSON data: Current directory
|
|
- PDF catalog: `catalog_output/` directory
|
|
- Product images: `catalog_output/images/`
|
|
- Barcode images: `catalog_output/barcodes/`
|
|
|
|
## Requirements
|
|
|
|
- Python 3.7+
|
|
- pandoc (for PDF generation)
|
|
- Internet connection (for scraping)
|
|
|
|
The script will automatically handle Python dependencies via virtual environment.
|
|
|
|
## Troubleshooting
|
|
|
|
If you encounter issues:
|
|
|
|
1. **Permission denied:** Make sure the script is executable:
|
|
```bash
|
|
chmod +x run.sh
|
|
```
|
|
|
|
2. **Pandoc not found:** Install pandoc for your system:
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt install pandoc
|
|
|
|
# Arch Linux
|
|
sudo pacman -S pandoc
|
|
|
|
# macOS
|
|
brew install pandoc
|
|
```
|
|
|
|
3. **No products found:** The website may have anti-bot protection or changed structure. The script includes fallback mechanisms.
|
|
|
|
4. **PDF generation fails:** The markdown file will still be generated, which you can manually convert or view.
|
|
|
|
## File Naming Convention
|
|
|
|
All output files include Unix-friendly timestamps:
|
|
- Format: `YYYYMMDD_HHMMSS` (e.g., `20241221_143025`)
|
|
- This ensures chronological sorting with `ls` command
|
|
- No spaces or special characters for script-friendly handling
|
|
|
|
## Example Output
|
|
|
|
```
|
|
pokemon-disco/
|
|
├── pokemon_tcg_products_20241221_143025.json # Scraped data
|
|
├── catalog_output/
|
|
│ ├── pokemon_tcg_catalog_20241221_143025.pdf # Final catalog
|
|
│ ├── pokemon_tcg_catalog_20241221_143025.md # Markdown source
|
|
│ ├── images/
|
|
│ │ ├── product_1_SKU123456.jpg # Product images
|
|
│ │ └── product_2_SKU789012.jpg
|
|
│ └── barcodes/
|
|
│ ├── barcode_SKU123456.png # UPC-A barcodes
|
|
│ └── barcode_SKU789012.png
|
|
``` |