Initial commit: Pokemon Discovery - TCG product scraper and PDF catalog generator
- Comprehensive scraper for Dollar General Pokemon TCG products - Professional PDF catalog generator with UPC-A barcodes - Robust anti-bot handling with requests + Selenium fallback - Automatic image downloading and barcode generation - Unix-friendly timestamped filenames - Virtual environment support and dependency management - Complete documentation and usage guides
This commit is contained in:
208
README.md
Normal file
208
README.md
Normal file
@@ -0,0 +1,208 @@
|
||||
# Pokemon Discovery (pokemon-disco)
|
||||
|
||||
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
||||
|
||||
## Features
|
||||
|
||||
- **Web Scraping**: Automatically scrapes Pokemon TCG products from Dollar General
|
||||
- **Robust Data Extraction**: Extracts product name, price, stock status, SKU, and images
|
||||
- **Anti-Bot Handling**: Uses both requests and Selenium for dynamic content
|
||||
- **Barcode Generation**: Creates UPC-A barcodes for each product SKU
|
||||
- **PDF Catalog**: Professional PDF with images, details, and barcodes
|
||||
- **Unix-Friendly Naming**: Timestamped filenames for easy sorting
|
||||
|
||||
## Requirements
|
||||
|
||||
### System Requirements
|
||||
- Python 3.7+
|
||||
- pandoc (for PDF generation)
|
||||
- Chrome/Chromium browser (for Selenium fallback)
|
||||
|
||||
### Python Dependencies
|
||||
All dependencies are automatically installed via `requirements.txt`:
|
||||
- requests
|
||||
- beautifulsoup4
|
||||
- selenium
|
||||
- webdriver-manager
|
||||
- python-barcode
|
||||
- Pillow
|
||||
- pandas
|
||||
- lxml
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone/Download** this directory to your system
|
||||
|
||||
2. **Install pandoc** (required for PDF generation):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt install pandoc
|
||||
|
||||
# macOS
|
||||
brew install pandoc
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S pandoc
|
||||
```
|
||||
|
||||
3. **Install Python dependencies** (automatically done by the script):
|
||||
```bash
|
||||
cd pokemon-disco
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start (Recommended)
|
||||
|
||||
Run the complete pipeline with one command:
|
||||
|
||||
```bash
|
||||
cd pokemon-disco
|
||||
python3 run_scraper.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Check and install Python requirements
|
||||
2. Scrape Pokemon TCG products from Dollar General
|
||||
3. Generate a PDF catalog with images and barcodes
|
||||
4. Create timestamped files for easy organization
|
||||
|
||||
### Manual Usage
|
||||
|
||||
If you prefer to run components separately:
|
||||
|
||||
#### 1. Scrape Products
|
||||
```bash
|
||||
python3 scraper.py
|
||||
```
|
||||
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
||||
|
||||
#### 2. Generate PDF Catalog
|
||||
```bash
|
||||
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
### Generated Files
|
||||
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
||||
- Raw scraped data in JSON format
|
||||
- Contains all product information
|
||||
|
||||
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
||||
- Professional PDF catalog
|
||||
- Includes product images, details, and UPC-A barcodes
|
||||
|
||||
### Output Directory Structure
|
||||
```
|
||||
pokemon-disco/
|
||||
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
||||
├── catalog_output/
|
||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
||||
│ ├── images/
|
||||
│ │ ├── product_1_SKU123.jpg
|
||||
│ │ ├── product_2_SKU456.jpg
|
||||
│ │ └── placeholder.png
|
||||
│ └── barcodes/
|
||||
│ ├── barcode_SKU123.png
|
||||
│ ├── barcode_SKU456.png
|
||||
│ └── ...
|
||||
```
|
||||
|
||||
## PDF Catalog Features
|
||||
|
||||
Each product in the PDF includes:
|
||||
- **Product Image**: Downloaded from Dollar General or placeholder
|
||||
- **Product Details Table**:
|
||||
- Title
|
||||
- Price
|
||||
- Stock Status
|
||||
- SKU (formatted as code)
|
||||
- Product URL
|
||||
- **UPC-A Barcode**: Generated from SKU for inventory management
|
||||
|
||||
## Data Fields Extracted
|
||||
|
||||
For each Pokemon TCG product:
|
||||
- `title`: Product name
|
||||
- `price`: Current price
|
||||
- `stock`: Availability status
|
||||
- `sku`: Product SKU/item number
|
||||
- `image_url`: Direct link to product image
|
||||
- `url`: Link to product page
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **No products found**
|
||||
- Dollar General may have anti-bot protection
|
||||
- The script will automatically retry with Selenium
|
||||
- Website structure may have changed
|
||||
|
||||
2. **PDF generation fails**
|
||||
- Ensure pandoc is installed: `pandoc --version`
|
||||
- Try alternative LaTeX engines if available
|
||||
- Markdown file is still generated for manual conversion
|
||||
|
||||
3. **Image download failures**
|
||||
- Network connectivity issues
|
||||
- Placeholder images will be used automatically
|
||||
|
||||
4. **Chrome/Selenium issues**
|
||||
- Ensure Chrome or Chromium is installed
|
||||
- webdriver-manager will automatically download ChromeDriver
|
||||
- Script falls back to requests-only mode if Selenium fails
|
||||
|
||||
### Debug Mode
|
||||
|
||||
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
||||
- Which products are found and filtered
|
||||
- Network request status
|
||||
- File generation progress
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Scraping Strategy
|
||||
1. **Primary Method**: Uses requests with browser-like headers
|
||||
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
||||
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
||||
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
||||
|
||||
### Barcode Generation
|
||||
- Converts SKUs to 11-digit numeric format
|
||||
- Generates UPC-A barcodes with check digits
|
||||
- High-quality PNG images suitable for printing
|
||||
|
||||
### PDF Generation
|
||||
- Uses pandoc with LaTeX for professional formatting
|
||||
- Includes table of contents
|
||||
- Optimized for printing and digital viewing
|
||||
- Images scaled appropriately for page layout
|
||||
|
||||
## Customization
|
||||
|
||||
### Modifying Product Filters
|
||||
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
||||
|
||||
### Changing PDF Layout
|
||||
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
||||
|
||||
### Adding New Data Fields
|
||||
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
||||
|
||||
## License
|
||||
|
||||
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
||||
|
||||
## Support
|
||||
|
||||
If you encounter issues:
|
||||
1. Check the console output for error messages
|
||||
2. Ensure all system requirements are installed
|
||||
3. Verify internet connectivity
|
||||
4. Check if the Dollar General website structure has changed
|
||||
|
||||
Generated files include timestamps for easy organization and version tracking.
|
||||
Reference in New Issue
Block a user