✅ Configured Brave browser integration (/usr/bin/brave) ✅ Updated Selenium WebDriver to use Brave binary ✅ Added proper Service-based WebDriver initialization ✅ Enhanced error handling and fallback mechanisms ✅ Created comprehensive Brave compatibility test script 🔧 Technical improvements: - Fixed WebDriver initialization for newer Selenium versions - Added detailed browser version detection - Improved error messages for ChromeDriver compatibility issues - Enhanced dynamic content handling with longer wait times 📋 Known compatibility note: - Brave 146 vs ChromeDriver 114 version mismatch (solvable) - Core PDF generation functionality works independently - Graceful fallback to requests-only mode when browser unavailable This allows users with Brave browser to utilize dynamic content scraping while maintaining full functionality for PDF catalog generation.
219 lines
6.2 KiB
Markdown
219 lines
6.2 KiB
Markdown
# Pokemon Discovery (pokemon-disco)
|
|
|
|
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
|
|
|
## Features
|
|
|
|
- **Web Scraping**: Automatically scrapes Pokemon TCG products from Dollar General
|
|
- **Robust Data Extraction**: Extracts product name, price, stock status, SKU, and images
|
|
- **Anti-Bot Handling**: Uses both requests and Selenium for dynamic content
|
|
- **Barcode Generation**: Creates UPC-A barcodes for each product SKU
|
|
- **PDF Catalog**: Professional PDF with images, details, and barcodes
|
|
- **Unix-Friendly Naming**: Timestamped filenames for easy sorting
|
|
|
|
## Requirements
|
|
|
|
### System Requirements
|
|
- Python 3.7+
|
|
- pandoc (for PDF generation)
|
|
- Chrome/Chromium browser (for Selenium fallback)
|
|
|
|
### Python Dependencies
|
|
All dependencies are automatically installed via `requirements.txt`:
|
|
- requests
|
|
- beautifulsoup4
|
|
- selenium
|
|
- webdriver-manager
|
|
- python-barcode
|
|
- Pillow
|
|
- pandas
|
|
- lxml
|
|
|
|
## Installation
|
|
|
|
1. **Clone/Download** this directory to your system
|
|
|
|
2. **Install pandoc** (required for PDF generation):
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt install pandoc
|
|
|
|
# macOS
|
|
brew install pandoc
|
|
|
|
# Arch Linux
|
|
sudo pacman -S pandoc
|
|
```
|
|
|
|
3. **Install Python dependencies** (automatically done by the script):
|
|
```bash
|
|
cd pokemon-disco
|
|
pip3 install -r requirements.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Quick Start (Recommended)
|
|
|
|
Run the complete pipeline with one command:
|
|
|
|
```bash
|
|
cd pokemon-disco
|
|
python3 run_scraper.py
|
|
```
|
|
|
|
This will:
|
|
1. Check and install Python requirements
|
|
2. Scrape Pokemon TCG products from Dollar General
|
|
3. Generate a PDF catalog with images and barcodes
|
|
4. Create timestamped files for easy organization
|
|
|
|
### Manual Usage
|
|
|
|
If you prefer to run components separately:
|
|
|
|
#### 1. Scrape Products
|
|
```bash
|
|
python3 scraper.py
|
|
```
|
|
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
|
|
|
#### 2. Generate PDF Catalog
|
|
```bash
|
|
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
|
```
|
|
|
|
## Output Files
|
|
|
|
### Generated Files
|
|
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
|
- Raw scraped data in JSON format
|
|
- Contains all product information
|
|
|
|
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
|
- Professional PDF catalog
|
|
- Includes product images, details, and UPC-A barcodes
|
|
|
|
### Output Directory Structure
|
|
```
|
|
pokemon-disco/
|
|
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
|
├── catalog_output/
|
|
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
|
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
|
│ ├── images/
|
|
│ │ ├── product_1_SKU123.jpg
|
|
│ │ ├── product_2_SKU456.jpg
|
|
│ │ └── placeholder.png
|
|
│ └── barcodes/
|
|
│ ├── barcode_SKU123.png
|
|
│ ├── barcode_SKU456.png
|
|
│ └── ...
|
|
```
|
|
|
|
## PDF Catalog Features
|
|
|
|
Each product in the PDF includes:
|
|
- **Product Image**: Downloaded from Dollar General or placeholder
|
|
- **Product Details Table**:
|
|
- Title
|
|
- Price
|
|
- Stock Status
|
|
- SKU (formatted as code)
|
|
- Product URL
|
|
- **UPC-A Barcode**: Generated from SKU for inventory management
|
|
|
|
## Data Fields Extracted
|
|
|
|
For each Pokemon TCG product:
|
|
- `title`: Product name
|
|
- `price`: Current price
|
|
- `stock`: Availability status
|
|
- `sku`: Product SKU/item number
|
|
- `image_url`: Direct link to product image
|
|
- `url`: Link to product page
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **No products found**
|
|
- Dollar General may have anti-bot protection
|
|
- The script will automatically retry with Selenium
|
|
- Website structure may have changed
|
|
|
|
2. **PDF generation fails**
|
|
- Ensure pandoc is installed: `pandoc --version`
|
|
- Try alternative LaTeX engines if available
|
|
- Markdown file is still generated for manual conversion
|
|
|
|
3. **Image download failures**
|
|
- Network connectivity issues
|
|
- Placeholder images will be used automatically
|
|
|
|
4. **Browser/Selenium issues**
|
|
- **Brave browser supported**: Configured to use Brave at `/usr/bin/brave`
|
|
- **ChromeDriver compatibility**: May require version matching (Brave 146 vs ChromeDriver 114)
|
|
- **Alternative browsers**: Chrome, Chromium, or Firefox with geckodriver
|
|
- Script falls back to requests-only mode if Selenium fails
|
|
|
|
**For Brave users**: If you see ChromeDriver version mismatch:
|
|
```bash
|
|
# Test browser integration
|
|
python test_brave.py
|
|
|
|
# Solutions for version mismatch:
|
|
pip install --upgrade webdriver-manager
|
|
# or manually install compatible ChromeDriver
|
|
```
|
|
|
|
### Debug Mode
|
|
|
|
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
|
- Which products are found and filtered
|
|
- Network request status
|
|
- File generation progress
|
|
|
|
## Technical Details
|
|
|
|
### Scraping Strategy
|
|
1. **Primary Method**: Uses requests with browser-like headers
|
|
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
|
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
|
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
|
|
|
### Barcode Generation
|
|
- Converts SKUs to 11-digit numeric format
|
|
- Generates UPC-A barcodes with check digits
|
|
- High-quality PNG images suitable for printing
|
|
|
|
### PDF Generation
|
|
- Uses pandoc with LaTeX for professional formatting
|
|
- Includes table of contents
|
|
- Optimized for printing and digital viewing
|
|
- Images scaled appropriately for page layout
|
|
|
|
## Customization
|
|
|
|
### Modifying Product Filters
|
|
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
|
|
|
### Changing PDF Layout
|
|
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
|
|
|
### Adding New Data Fields
|
|
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
|
|
|
## License
|
|
|
|
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
|
|
|
## Support
|
|
|
|
If you encounter issues:
|
|
1. Check the console output for error messages
|
|
2. Ensure all system requirements are installed
|
|
3. Verify internet connectivity
|
|
4. Check if the Dollar General website structure has changed
|
|
|
|
Generated files include timestamps for easy organization and version tracking. |