- Comprehensive scraper for Dollar General Pokemon TCG products - Professional PDF catalog generator with UPC-A barcodes - Robust anti-bot handling with requests + Selenium fallback - Automatic image downloading and barcode generation - Unix-friendly timestamped filenames - Virtual environment support and dependency management - Complete documentation and usage guides
208 lines
5.8 KiB
Markdown
208 lines
5.8 KiB
Markdown
# Pokemon Discovery (pokemon-disco)
|
|
|
|
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
|
|
|
## Features
|
|
|
|
- **Web Scraping**: Automatically scrapes Pokemon TCG products from Dollar General
|
|
- **Robust Data Extraction**: Extracts product name, price, stock status, SKU, and images
|
|
- **Anti-Bot Handling**: Uses both requests and Selenium for dynamic content
|
|
- **Barcode Generation**: Creates UPC-A barcodes for each product SKU
|
|
- **PDF Catalog**: Professional PDF with images, details, and barcodes
|
|
- **Unix-Friendly Naming**: Timestamped filenames for easy sorting
|
|
|
|
## Requirements
|
|
|
|
### System Requirements
|
|
- Python 3.7+
|
|
- pandoc (for PDF generation)
|
|
- Chrome/Chromium browser (for Selenium fallback)
|
|
|
|
### Python Dependencies
|
|
All dependencies are automatically installed via `requirements.txt`:
|
|
- requests
|
|
- beautifulsoup4
|
|
- selenium
|
|
- webdriver-manager
|
|
- python-barcode
|
|
- Pillow
|
|
- pandas
|
|
- lxml
|
|
|
|
## Installation
|
|
|
|
1. **Clone/Download** this directory to your system
|
|
|
|
2. **Install pandoc** (required for PDF generation):
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt install pandoc
|
|
|
|
# macOS
|
|
brew install pandoc
|
|
|
|
# Arch Linux
|
|
sudo pacman -S pandoc
|
|
```
|
|
|
|
3. **Install Python dependencies** (automatically done by the script):
|
|
```bash
|
|
cd pokemon-disco
|
|
pip3 install -r requirements.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Quick Start (Recommended)
|
|
|
|
Run the complete pipeline with one command:
|
|
|
|
```bash
|
|
cd pokemon-disco
|
|
python3 run_scraper.py
|
|
```
|
|
|
|
This will:
|
|
1. Check and install Python requirements
|
|
2. Scrape Pokemon TCG products from Dollar General
|
|
3. Generate a PDF catalog with images and barcodes
|
|
4. Create timestamped files for easy organization
|
|
|
|
### Manual Usage
|
|
|
|
If you prefer to run components separately:
|
|
|
|
#### 1. Scrape Products
|
|
```bash
|
|
python3 scraper.py
|
|
```
|
|
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
|
|
|
#### 2. Generate PDF Catalog
|
|
```bash
|
|
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
|
```
|
|
|
|
## Output Files
|
|
|
|
### Generated Files
|
|
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
|
- Raw scraped data in JSON format
|
|
- Contains all product information
|
|
|
|
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
|
- Professional PDF catalog
|
|
- Includes product images, details, and UPC-A barcodes
|
|
|
|
### Output Directory Structure
|
|
```
|
|
pokemon-disco/
|
|
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
|
├── catalog_output/
|
|
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
|
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
|
│ ├── images/
|
|
│ │ ├── product_1_SKU123.jpg
|
|
│ │ ├── product_2_SKU456.jpg
|
|
│ │ └── placeholder.png
|
|
│ └── barcodes/
|
|
│ ├── barcode_SKU123.png
|
|
│ ├── barcode_SKU456.png
|
|
│ └── ...
|
|
```
|
|
|
|
## PDF Catalog Features
|
|
|
|
Each product in the PDF includes:
|
|
- **Product Image**: Downloaded from Dollar General or placeholder
|
|
- **Product Details Table**:
|
|
- Title
|
|
- Price
|
|
- Stock Status
|
|
- SKU (formatted as code)
|
|
- Product URL
|
|
- **UPC-A Barcode**: Generated from SKU for inventory management
|
|
|
|
## Data Fields Extracted
|
|
|
|
For each Pokemon TCG product:
|
|
- `title`: Product name
|
|
- `price`: Current price
|
|
- `stock`: Availability status
|
|
- `sku`: Product SKU/item number
|
|
- `image_url`: Direct link to product image
|
|
- `url`: Link to product page
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **No products found**
|
|
- Dollar General may have anti-bot protection
|
|
- The script will automatically retry with Selenium
|
|
- Website structure may have changed
|
|
|
|
2. **PDF generation fails**
|
|
- Ensure pandoc is installed: `pandoc --version`
|
|
- Try alternative LaTeX engines if available
|
|
- Markdown file is still generated for manual conversion
|
|
|
|
3. **Image download failures**
|
|
- Network connectivity issues
|
|
- Placeholder images will be used automatically
|
|
|
|
4. **Chrome/Selenium issues**
|
|
- Ensure Chrome or Chromium is installed
|
|
- webdriver-manager will automatically download ChromeDriver
|
|
- Script falls back to requests-only mode if Selenium fails
|
|
|
|
### Debug Mode
|
|
|
|
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
|
- Which products are found and filtered
|
|
- Network request status
|
|
- File generation progress
|
|
|
|
## Technical Details
|
|
|
|
### Scraping Strategy
|
|
1. **Primary Method**: Uses requests with browser-like headers
|
|
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
|
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
|
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
|
|
|
### Barcode Generation
|
|
- Converts SKUs to 11-digit numeric format
|
|
- Generates UPC-A barcodes with check digits
|
|
- High-quality PNG images suitable for printing
|
|
|
|
### PDF Generation
|
|
- Uses pandoc with LaTeX for professional formatting
|
|
- Includes table of contents
|
|
- Optimized for printing and digital viewing
|
|
- Images scaled appropriately for page layout
|
|
|
|
## Customization
|
|
|
|
### Modifying Product Filters
|
|
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
|
|
|
### Changing PDF Layout
|
|
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
|
|
|
### Adding New Data Fields
|
|
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
|
|
|
## License
|
|
|
|
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
|
|
|
## Support
|
|
|
|
If you encounter issues:
|
|
1. Check the console output for error messages
|
|
2. Ensure all system requirements are installed
|
|
3. Verify internet connectivity
|
|
4. Check if the Dollar General website structure has changed
|
|
|
|
Generated files include timestamps for easy organization and version tracking. |