Initial commit: Pokemon Discovery - TCG product scraper and PDF catalog generator
- Comprehensive scraper for Dollar General Pokemon TCG products - Professional PDF catalog generator with UPC-A barcodes - Robust anti-bot handling with requests + Selenium fallback - Automatic image downloading and barcode generation - Unix-friendly timestamped filenames - Virtual environment support and dependency management - Complete documentation and usage guides
This commit is contained in:
37
.gitignore
vendored
Normal file
37
.gitignore
vendored
Normal file
@@ -0,0 +1,37 @@
|
||||
# Virtual environment
|
||||
venv/
|
||||
env/
|
||||
.env
|
||||
|
||||
# Python cache
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.pyo
|
||||
*.pyd
|
||||
.Python
|
||||
*.so
|
||||
.pytest_cache/
|
||||
|
||||
# Output files
|
||||
*.json
|
||||
catalog_output/
|
||||
test_output/
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
|
||||
# OS files
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
.directory
|
||||
|
||||
# IDE files
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
|
||||
# Temporary files
|
||||
*.tmp
|
||||
*.temp
|
||||
.cache/
|
||||
208
README.md
Normal file
208
README.md
Normal file
@@ -0,0 +1,208 @@
|
||||
# Pokemon Discovery (pokemon-disco)
|
||||
|
||||
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
||||
|
||||
## Features
|
||||
|
||||
- **Web Scraping**: Automatically scrapes Pokemon TCG products from Dollar General
|
||||
- **Robust Data Extraction**: Extracts product name, price, stock status, SKU, and images
|
||||
- **Anti-Bot Handling**: Uses both requests and Selenium for dynamic content
|
||||
- **Barcode Generation**: Creates UPC-A barcodes for each product SKU
|
||||
- **PDF Catalog**: Professional PDF with images, details, and barcodes
|
||||
- **Unix-Friendly Naming**: Timestamped filenames for easy sorting
|
||||
|
||||
## Requirements
|
||||
|
||||
### System Requirements
|
||||
- Python 3.7+
|
||||
- pandoc (for PDF generation)
|
||||
- Chrome/Chromium browser (for Selenium fallback)
|
||||
|
||||
### Python Dependencies
|
||||
All dependencies are automatically installed via `requirements.txt`:
|
||||
- requests
|
||||
- beautifulsoup4
|
||||
- selenium
|
||||
- webdriver-manager
|
||||
- python-barcode
|
||||
- Pillow
|
||||
- pandas
|
||||
- lxml
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone/Download** this directory to your system
|
||||
|
||||
2. **Install pandoc** (required for PDF generation):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt install pandoc
|
||||
|
||||
# macOS
|
||||
brew install pandoc
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S pandoc
|
||||
```
|
||||
|
||||
3. **Install Python dependencies** (automatically done by the script):
|
||||
```bash
|
||||
cd pokemon-disco
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start (Recommended)
|
||||
|
||||
Run the complete pipeline with one command:
|
||||
|
||||
```bash
|
||||
cd pokemon-disco
|
||||
python3 run_scraper.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Check and install Python requirements
|
||||
2. Scrape Pokemon TCG products from Dollar General
|
||||
3. Generate a PDF catalog with images and barcodes
|
||||
4. Create timestamped files for easy organization
|
||||
|
||||
### Manual Usage
|
||||
|
||||
If you prefer to run components separately:
|
||||
|
||||
#### 1. Scrape Products
|
||||
```bash
|
||||
python3 scraper.py
|
||||
```
|
||||
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
||||
|
||||
#### 2. Generate PDF Catalog
|
||||
```bash
|
||||
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
### Generated Files
|
||||
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
||||
- Raw scraped data in JSON format
|
||||
- Contains all product information
|
||||
|
||||
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
||||
- Professional PDF catalog
|
||||
- Includes product images, details, and UPC-A barcodes
|
||||
|
||||
### Output Directory Structure
|
||||
```
|
||||
pokemon-disco/
|
||||
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
||||
├── catalog_output/
|
||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
||||
│ ├── images/
|
||||
│ │ ├── product_1_SKU123.jpg
|
||||
│ │ ├── product_2_SKU456.jpg
|
||||
│ │ └── placeholder.png
|
||||
│ └── barcodes/
|
||||
│ ├── barcode_SKU123.png
|
||||
│ ├── barcode_SKU456.png
|
||||
│ └── ...
|
||||
```
|
||||
|
||||
## PDF Catalog Features
|
||||
|
||||
Each product in the PDF includes:
|
||||
- **Product Image**: Downloaded from Dollar General or placeholder
|
||||
- **Product Details Table**:
|
||||
- Title
|
||||
- Price
|
||||
- Stock Status
|
||||
- SKU (formatted as code)
|
||||
- Product URL
|
||||
- **UPC-A Barcode**: Generated from SKU for inventory management
|
||||
|
||||
## Data Fields Extracted
|
||||
|
||||
For each Pokemon TCG product:
|
||||
- `title`: Product name
|
||||
- `price`: Current price
|
||||
- `stock`: Availability status
|
||||
- `sku`: Product SKU/item number
|
||||
- `image_url`: Direct link to product image
|
||||
- `url`: Link to product page
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **No products found**
|
||||
- Dollar General may have anti-bot protection
|
||||
- The script will automatically retry with Selenium
|
||||
- Website structure may have changed
|
||||
|
||||
2. **PDF generation fails**
|
||||
- Ensure pandoc is installed: `pandoc --version`
|
||||
- Try alternative LaTeX engines if available
|
||||
- Markdown file is still generated for manual conversion
|
||||
|
||||
3. **Image download failures**
|
||||
- Network connectivity issues
|
||||
- Placeholder images will be used automatically
|
||||
|
||||
4. **Chrome/Selenium issues**
|
||||
- Ensure Chrome or Chromium is installed
|
||||
- webdriver-manager will automatically download ChromeDriver
|
||||
- Script falls back to requests-only mode if Selenium fails
|
||||
|
||||
### Debug Mode
|
||||
|
||||
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
||||
- Which products are found and filtered
|
||||
- Network request status
|
||||
- File generation progress
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Scraping Strategy
|
||||
1. **Primary Method**: Uses requests with browser-like headers
|
||||
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
||||
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
||||
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
||||
|
||||
### Barcode Generation
|
||||
- Converts SKUs to 11-digit numeric format
|
||||
- Generates UPC-A barcodes with check digits
|
||||
- High-quality PNG images suitable for printing
|
||||
|
||||
### PDF Generation
|
||||
- Uses pandoc with LaTeX for professional formatting
|
||||
- Includes table of contents
|
||||
- Optimized for printing and digital viewing
|
||||
- Images scaled appropriately for page layout
|
||||
|
||||
## Customization
|
||||
|
||||
### Modifying Product Filters
|
||||
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
||||
|
||||
### Changing PDF Layout
|
||||
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
||||
|
||||
### Adding New Data Fields
|
||||
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
||||
|
||||
## License
|
||||
|
||||
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
||||
|
||||
## Support
|
||||
|
||||
If you encounter issues:
|
||||
1. Check the console output for error messages
|
||||
2. Ensure all system requirements are installed
|
||||
3. Verify internet connectivity
|
||||
4. Check if the Dollar General website structure has changed
|
||||
|
||||
Generated files include timestamps for easy organization and version tracking.
|
||||
115
USAGE.md
Normal file
115
USAGE.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Quick Start Guide
|
||||
|
||||
## Simple Usage (Recommended)
|
||||
|
||||
1. **Make sure you're in the project directory:**
|
||||
```bash
|
||||
cd pokemon-disco
|
||||
```
|
||||
|
||||
2. **Run the complete scraper and PDF generator:**
|
||||
```bash
|
||||
./run.sh
|
||||
```
|
||||
|
||||
This single command will:
|
||||
- Set up the Python virtual environment
|
||||
- Install all required packages
|
||||
- Scrape Pokemon TCG products from Dollar General
|
||||
- Generate a professional PDF catalog with barcodes
|
||||
- Create timestamped files for easy organization
|
||||
|
||||
## What You'll Get
|
||||
|
||||
### Generated Files:
|
||||
- **`pokemon_tcg_products_YYYYMMDD_HHMMSS.json`** - Raw data in JSON format
|
||||
- **`catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`** - Professional PDF catalog
|
||||
|
||||
### PDF Catalog Contents:
|
||||
- Product images (downloaded automatically)
|
||||
- Product details (title, price, stock, SKU)
|
||||
- UPC-A barcodes for each product (generated from SKU)
|
||||
- Table of contents for easy navigation
|
||||
- Professional formatting suitable for printing
|
||||
|
||||
## Alternative Commands
|
||||
|
||||
If you prefer more control:
|
||||
|
||||
```bash
|
||||
# Activate virtual environment first
|
||||
source venv/bin/activate
|
||||
|
||||
# Run only the scraper
|
||||
python scraper.py
|
||||
|
||||
# Run only the PDF generator (after scraping)
|
||||
python pdf_generator.py pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
||||
|
||||
# Run everything (installs requirements automatically)
|
||||
python run_scraper.py
|
||||
```
|
||||
|
||||
## Output Location
|
||||
|
||||
All generated files will be in:
|
||||
- JSON data: Current directory
|
||||
- PDF catalog: `catalog_output/` directory
|
||||
- Product images: `catalog_output/images/`
|
||||
- Barcode images: `catalog_output/barcodes/`
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.7+
|
||||
- pandoc (for PDF generation)
|
||||
- Internet connection (for scraping)
|
||||
|
||||
The script will automatically handle Python dependencies via virtual environment.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
If you encounter issues:
|
||||
|
||||
1. **Permission denied:** Make sure the script is executable:
|
||||
```bash
|
||||
chmod +x run.sh
|
||||
```
|
||||
|
||||
2. **Pandoc not found:** Install pandoc for your system:
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt install pandoc
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S pandoc
|
||||
|
||||
# macOS
|
||||
brew install pandoc
|
||||
```
|
||||
|
||||
3. **No products found:** The website may have anti-bot protection or changed structure. The script includes fallback mechanisms.
|
||||
|
||||
4. **PDF generation fails:** The markdown file will still be generated, which you can manually convert or view.
|
||||
|
||||
## File Naming Convention
|
||||
|
||||
All output files include Unix-friendly timestamps:
|
||||
- Format: `YYYYMMDD_HHMMSS` (e.g., `20241221_143025`)
|
||||
- This ensures chronological sorting with `ls` command
|
||||
- No spaces or special characters for script-friendly handling
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
pokemon-disco/
|
||||
├── pokemon_tcg_products_20241221_143025.json # Scraped data
|
||||
├── catalog_output/
|
||||
│ ├── pokemon_tcg_catalog_20241221_143025.pdf # Final catalog
|
||||
│ ├── pokemon_tcg_catalog_20241221_143025.md # Markdown source
|
||||
│ ├── images/
|
||||
│ │ ├── product_1_SKU123456.jpg # Product images
|
||||
│ │ └── product_2_SKU789012.jpg
|
||||
│ └── barcodes/
|
||||
│ ├── barcode_SKU123456.png # UPC-A barcodes
|
||||
│ └── barcode_SKU789012.png
|
||||
```
|
||||
278
pdf_generator.py
Executable file
278
pdf_generator.py
Executable file
@@ -0,0 +1,278 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Pokemon Discovery - TCG Product Catalog PDF Generator
|
||||
Generates PDF catalog with product images, details, and UPC-A barcodes
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import requests
|
||||
import subprocess
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
import barcode
|
||||
from barcode.writer import ImageWriter
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
import tempfile
|
||||
import shutil
|
||||
|
||||
class PokemonTCGCatalogGenerator:
|
||||
def __init__(self, json_file):
|
||||
self.json_file = json_file
|
||||
self.output_dir = Path("catalog_output")
|
||||
self.images_dir = self.output_dir / "images"
|
||||
self.barcodes_dir = self.output_dir / "barcodes"
|
||||
|
||||
# Create output directories
|
||||
self.output_dir.mkdir(exist_ok=True)
|
||||
self.images_dir.mkdir(exist_ok=True)
|
||||
self.barcodes_dir.mkdir(exist_ok=True)
|
||||
|
||||
# Load product data
|
||||
with open(json_file, 'r') as f:
|
||||
self.products = json.load(f)
|
||||
|
||||
def download_image(self, url, filename):
|
||||
"""Download product image"""
|
||||
if not url:
|
||||
return None
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
|
||||
filepath = self.images_dir / filename
|
||||
with open(filepath, 'wb') as f:
|
||||
f.write(response.content)
|
||||
|
||||
return filepath
|
||||
except Exception as e:
|
||||
print(f"Failed to download image {url}: {e}")
|
||||
return None
|
||||
|
||||
def generate_upc_barcode(self, sku):
|
||||
"""Generate UPC-A barcode from SKU"""
|
||||
try:
|
||||
# Convert SKU to 12-digit UPC-A format
|
||||
# Remove non-digits and pad/truncate to 11 digits (12th is check digit)
|
||||
digits_only = ''.join(filter(str.isdigit, str(sku)))
|
||||
|
||||
if len(digits_only) < 11:
|
||||
# Pad with zeros at the start
|
||||
upc_base = digits_only.zfill(11)
|
||||
else:
|
||||
# Take the last 11 digits
|
||||
upc_base = digits_only[-11:]
|
||||
|
||||
# Generate UPC-A barcode
|
||||
upc_generator = barcode.get_barcode_class('upca')
|
||||
upc = upc_generator(upc_base, writer=ImageWriter())
|
||||
|
||||
# Save barcode image
|
||||
barcode_filename = f"barcode_{sku.replace('/', '_').replace(' ', '_')}.png"
|
||||
barcode_path = self.barcodes_dir / barcode_filename
|
||||
|
||||
# Save with specific options for better appearance
|
||||
upc.save(str(barcode_path).replace('.png', ''), options={
|
||||
'module_width': 0.2,
|
||||
'module_height': 15.0,
|
||||
'quiet_zone': 6.5,
|
||||
'font_size': 10,
|
||||
'text_distance': 5.0,
|
||||
'background': 'white',
|
||||
'foreground': 'black'
|
||||
})
|
||||
|
||||
return f"{barcode_path}.png"
|
||||
|
||||
except Exception as e:
|
||||
print(f"Failed to generate barcode for SKU {sku}: {e}")
|
||||
return None
|
||||
|
||||
def create_placeholder_image(self, width=300, height=200):
|
||||
"""Create a placeholder image when product image is not available"""
|
||||
img = Image.new('RGB', (width, height), color='lightgray')
|
||||
draw = ImageDraw.Draw(img)
|
||||
|
||||
try:
|
||||
# Try to use a system font
|
||||
font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf', 24)
|
||||
except:
|
||||
try:
|
||||
font = ImageFont.truetype('arial.ttf', 24)
|
||||
except:
|
||||
font = ImageFont.load_default()
|
||||
|
||||
text = "No Image\nAvailable"
|
||||
|
||||
# Get text bounding box for centering
|
||||
lines = text.split('\n')
|
||||
y_offset = height // 2 - (len(lines) * 30) // 2
|
||||
|
||||
for line in lines:
|
||||
bbox = draw.textbbox((0, 0), line, font=font)
|
||||
text_width = bbox[2] - bbox[0]
|
||||
x_offset = (width - text_width) // 2
|
||||
draw.text((x_offset, y_offset), line, fill='darkgray', font=font)
|
||||
y_offset += 30
|
||||
|
||||
placeholder_path = self.images_dir / "placeholder.png"
|
||||
img.save(placeholder_path)
|
||||
return placeholder_path
|
||||
|
||||
def generate_markdown(self):
|
||||
"""Generate markdown content for the catalog"""
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
markdown = f"""---
|
||||
title: "Pokemon TCG Product Catalog"
|
||||
subtitle: "Dollar General - Generated {timestamp}"
|
||||
author: "Automated Scraper"
|
||||
date: "{timestamp}"
|
||||
geometry: margin=1in
|
||||
fontsize: 11pt
|
||||
documentclass: article
|
||||
---
|
||||
|
||||
# Pokemon TCG Product Catalog
|
||||
|
||||
Generated on: {timestamp}
|
||||
Source: Dollar General
|
||||
Total Products: {len(self.products)}
|
||||
|
||||
---
|
||||
|
||||
"""
|
||||
|
||||
for i, product in enumerate(self.products, 1):
|
||||
print(f"Processing product {i}/{len(self.products)}: {product.get('title', 'Unknown')}")
|
||||
|
||||
# Download product image
|
||||
image_path = None
|
||||
if product.get('image_url'):
|
||||
filename = f"product_{i}_{product.get('sku', 'unknown').replace('/', '_').replace(' ', '_')}.jpg"
|
||||
image_path = self.download_image(product.get('image_url'), filename)
|
||||
|
||||
if not image_path:
|
||||
# Use placeholder
|
||||
image_path = self.create_placeholder_image()
|
||||
|
||||
# Generate barcode
|
||||
barcode_path = None
|
||||
if product.get('sku'):
|
||||
barcode_path = self.generate_upc_barcode(product.get('sku'))
|
||||
|
||||
# Add product section to markdown
|
||||
markdown += f"## {i}. {product.get('title', 'Unknown Product')}\n\n"
|
||||
|
||||
# Product image
|
||||
if image_path:
|
||||
rel_image_path = os.path.relpath(image_path, self.output_dir)
|
||||
markdown += f"{{width=300px}}\n\n"
|
||||
|
||||
# Product details in a table
|
||||
markdown += "| Field | Value |\n"
|
||||
markdown += "|-------|-------|\n"
|
||||
markdown += f"| **Title** | {product.get('title', 'N/A')} |\n"
|
||||
markdown += f"| **Price** | {product.get('price', 'N/A')} |\n"
|
||||
markdown += f"| **Stock** | {product.get('stock', 'N/A')} |\n"
|
||||
markdown += f"| **SKU** | `{product.get('sku', 'N/A')}` |\n"
|
||||
markdown += f"| **URL** | {product.get('url', 'N/A')} |\n"
|
||||
markdown += "\n"
|
||||
|
||||
# Barcode
|
||||
if barcode_path:
|
||||
rel_barcode_path = os.path.relpath(barcode_path, self.output_dir)
|
||||
markdown += f"**UPC-A Barcode:**\n\n"
|
||||
markdown += f"{{width=200px}}\n\n"
|
||||
|
||||
markdown += "---\n\n"
|
||||
|
||||
return markdown
|
||||
|
||||
def generate_pdf(self):
|
||||
"""Generate PDF catalog using pandoc"""
|
||||
print("Generating markdown content...")
|
||||
markdown_content = self.generate_markdown()
|
||||
|
||||
# Save markdown file
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
markdown_file = self.output_dir / f"pokemon_tcg_catalog_{timestamp}.md"
|
||||
|
||||
with open(markdown_file, 'w', encoding='utf-8') as f:
|
||||
f.write(markdown_content)
|
||||
|
||||
print(f"Markdown saved to: {markdown_file}")
|
||||
|
||||
# Generate PDF using pandoc
|
||||
pdf_file = self.output_dir / f"pokemon_tcg_catalog_{timestamp}.pdf"
|
||||
|
||||
print("Converting to PDF using pandoc...")
|
||||
|
||||
try:
|
||||
subprocess.run([
|
||||
'pandoc',
|
||||
str(markdown_file),
|
||||
'-o', str(pdf_file),
|
||||
'--pdf-engine=xelatex',
|
||||
'-V', 'colorlinks=true',
|
||||
'-V', 'linkcolor=blue',
|
||||
'-V', 'filecolor=magenta',
|
||||
'-V', 'urlcolor=cyan',
|
||||
'--toc',
|
||||
'--toc-depth=2'
|
||||
], check=True)
|
||||
|
||||
print(f"PDF generated successfully: {pdf_file}")
|
||||
return pdf_file
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
print(f"Pandoc conversion failed: {e}")
|
||||
print("Trying with pdflatex instead...")
|
||||
|
||||
try:
|
||||
subprocess.run([
|
||||
'pandoc',
|
||||
str(markdown_file),
|
||||
'-o', str(pdf_file),
|
||||
'--pdf-engine=pdflatex',
|
||||
'--toc'
|
||||
], check=True)
|
||||
|
||||
print(f"PDF generated successfully: {pdf_file}")
|
||||
return pdf_file
|
||||
|
||||
except subprocess.CalledProcessError as e2:
|
||||
print(f"PDF generation failed with both engines: {e2}")
|
||||
print(f"Markdown file available at: {markdown_file}")
|
||||
return None
|
||||
|
||||
except FileNotFoundError:
|
||||
print("Error: pandoc not found. Please install pandoc to generate PDF.")
|
||||
print(f"Markdown file available at: {markdown_file}")
|
||||
return None
|
||||
|
||||
def main():
|
||||
if len(sys.argv) != 2:
|
||||
print("Usage: python3 pdf_generator.py <json_file>")
|
||||
print("Example: python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json")
|
||||
sys.exit(1)
|
||||
|
||||
json_file = sys.argv[1]
|
||||
|
||||
if not os.path.exists(json_file):
|
||||
print(f"Error: JSON file '{json_file}' not found")
|
||||
sys.exit(1)
|
||||
|
||||
generator = PokemonTCGCatalogGenerator(json_file)
|
||||
pdf_file = generator.generate_pdf()
|
||||
|
||||
if pdf_file:
|
||||
print(f"\nCatalog generation completed!")
|
||||
print(f"PDF file: {pdf_file}")
|
||||
print(f"Output directory: {generator.output_dir}")
|
||||
else:
|
||||
print(f"\nPDF generation failed, but markdown file is available in: {generator.output_dir}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
8
requirements.txt
Normal file
8
requirements.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
requests
|
||||
beautifulsoup4
|
||||
selenium
|
||||
webdriver-manager
|
||||
python-barcode
|
||||
Pillow
|
||||
pandas
|
||||
lxml
|
||||
31
run.sh
Executable file
31
run.sh
Executable file
@@ -0,0 +1,31 @@
|
||||
#!/bin/bash
|
||||
# Pokemon Discovery - Scraper & Catalog Generator Launcher
|
||||
# Automatically activates virtual environment and runs the scraper
|
||||
|
||||
set -e
|
||||
|
||||
cd "$(dirname "$0")"
|
||||
|
||||
echo "Pokemon Discovery - Product Scraper & Catalog Generator"
|
||||
echo "================================================"
|
||||
|
||||
# Check if virtual environment exists
|
||||
if [[ ! -d "venv" ]]; then
|
||||
echo "Creating virtual environment..."
|
||||
python3 -m venv venv
|
||||
fi
|
||||
|
||||
# Activate virtual environment
|
||||
source venv/bin/activate
|
||||
|
||||
# Check if requirements are installed
|
||||
if ! python -c "import requests, bs4, barcode, selenium" 2>/dev/null; then
|
||||
echo "Installing Python requirements..."
|
||||
pip install -r requirements.txt
|
||||
fi
|
||||
|
||||
# Run the main script
|
||||
python run_scraper.py
|
||||
|
||||
echo ""
|
||||
echo "Script completed. Check the output above for results."
|
||||
139
run_scraper.py
Executable file
139
run_scraper.py
Executable file
@@ -0,0 +1,139 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Pokemon Discovery - Scraper and Catalog Generator
|
||||
Main script that runs both scraping and PDF generation
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import subprocess
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
def install_requirements():
|
||||
"""Install Python requirements"""
|
||||
print("Installing Python requirements...")
|
||||
try:
|
||||
subprocess.run([sys.executable, '-m', 'pip', 'install', '-r', 'requirements.txt'],
|
||||
check=True)
|
||||
print("Requirements installed successfully!")
|
||||
except subprocess.CalledProcessError as e:
|
||||
print(f"Failed to install requirements: {e}")
|
||||
return False
|
||||
return True
|
||||
|
||||
def run_scraper():
|
||||
"""Run the scraper to collect product data"""
|
||||
print("=" * 60)
|
||||
print("STEP 1: SCRAPING POKEMON TCG PRODUCTS")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
result = subprocess.run([sys.executable, 'scraper.py'],
|
||||
capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
print("Scraping completed successfully!")
|
||||
print(result.stdout)
|
||||
|
||||
# Find the generated JSON file
|
||||
json_files = list(Path('.').glob('pokemon_tcg_products_*.json'))
|
||||
if json_files:
|
||||
latest_file = max(json_files, key=os.path.getctime)
|
||||
return str(latest_file)
|
||||
else:
|
||||
print("No JSON file was generated")
|
||||
return None
|
||||
else:
|
||||
print("Scraping failed:")
|
||||
print(result.stderr)
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error running scraper: {e}")
|
||||
return None
|
||||
|
||||
def run_pdf_generator(json_file):
|
||||
"""Run the PDF generator with the scraped data"""
|
||||
print("=" * 60)
|
||||
print("STEP 2: GENERATING PDF CATALOG")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
result = subprocess.run([sys.executable, 'pdf_generator.py', json_file],
|
||||
capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
print("PDF generation completed successfully!")
|
||||
print(result.stdout)
|
||||
return True
|
||||
else:
|
||||
print("PDF generation failed:")
|
||||
print(result.stderr)
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error running PDF generator: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
print("Pokemon Discovery - Product Scraper & Catalog Generator")
|
||||
print("=" * 60)
|
||||
print(f"Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
print()
|
||||
|
||||
# Check if requirements are installed
|
||||
try:
|
||||
import requests, bs4, barcode, PIL
|
||||
print("✓ Required packages are available")
|
||||
except ImportError as e:
|
||||
print(f"✗ Missing required package: {e}")
|
||||
print("Installing requirements...")
|
||||
if not install_requirements():
|
||||
sys.exit(1)
|
||||
|
||||
# Check if pandoc is available
|
||||
try:
|
||||
subprocess.run(['pandoc', '--version'],
|
||||
capture_output=True, check=True)
|
||||
print("✓ Pandoc is available for PDF generation")
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
print("⚠ Pandoc not found. PDF generation may fail.")
|
||||
print(" Install pandoc with: sudo apt install pandoc (Ubuntu/Debian)")
|
||||
print(" or: brew install pandoc (macOS)")
|
||||
print(" or: pacman -S pandoc (Arch Linux)")
|
||||
|
||||
print()
|
||||
|
||||
# Run scraper
|
||||
json_file = run_scraper()
|
||||
if not json_file:
|
||||
print("Scraping failed. Exiting.")
|
||||
sys.exit(1)
|
||||
|
||||
# Run PDF generator
|
||||
if run_pdf_generator(json_file):
|
||||
print("=" * 60)
|
||||
print("SUCCESS! Both scraping and PDF generation completed.")
|
||||
print("=" * 60)
|
||||
print(f"JSON data: {json_file}")
|
||||
print("PDF catalog: Check the catalog_output/ directory")
|
||||
print()
|
||||
print("Files generated:")
|
||||
|
||||
# List generated files
|
||||
for file_pattern in ['pokemon_tcg_products_*.json', 'catalog_output/pokemon_tcg_catalog_*.pdf']:
|
||||
files = list(Path('.').glob(file_pattern))
|
||||
if files:
|
||||
latest = max(files, key=os.path.getctime)
|
||||
print(f" - {latest}")
|
||||
else:
|
||||
print("=" * 60)
|
||||
print("PARTIAL SUCCESS: Scraping completed, but PDF generation failed.")
|
||||
print("=" * 60)
|
||||
print(f"JSON data: {json_file}")
|
||||
print("You can manually run the PDF generator with:")
|
||||
print(f" python3 pdf_generator.py {json_file}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
329
scraper.py
Executable file
329
scraper.py
Executable file
@@ -0,0 +1,329 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Pokemon Discovery - TCG Product Scraper for Dollar General
|
||||
Scrapes product information and saves to JSON for PDF generation
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import requests
|
||||
from datetime import datetime
|
||||
from urllib.parse import urljoin, urlparse
|
||||
import pandas as pd
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Try selenium imports (fallback for dynamic content)
|
||||
try:
|
||||
from selenium import webdriver
|
||||
from selenium.webdriver.chrome.options import Options
|
||||
from selenium.webdriver.common.by import By
|
||||
from selenium.webdriver.support.ui import WebDriverWait
|
||||
from selenium.webdriver.support import expected_conditions as EC
|
||||
from selenium.common.exceptions import TimeoutException
|
||||
from webdriver_manager.chrome import ChromeDriverManager
|
||||
SELENIUM_AVAILABLE = True
|
||||
except ImportError:
|
||||
SELENIUM_AVAILABLE = False
|
||||
print("Selenium not available, using requests only")
|
||||
|
||||
class PokemonTCGScraper:
|
||||
def __init__(self):
|
||||
self.base_url = "https://www.dollargeneral.com"
|
||||
self.search_url = "https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true"
|
||||
self.session = requests.Session()
|
||||
|
||||
# Headers to appear more like a real browser
|
||||
self.headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
|
||||
'Accept-Language': 'en-US,en;q=0.5',
|
||||
'Accept-Encoding': 'gzip, deflate',
|
||||
'DNT': '1',
|
||||
'Connection': 'keep-alive',
|
||||
'Upgrade-Insecure-Requests': '1',
|
||||
}
|
||||
self.session.headers.update(self.headers)
|
||||
|
||||
self.products = []
|
||||
|
||||
def get_page_with_requests(self, url):
|
||||
"""Try to get page content using requests"""
|
||||
try:
|
||||
response = self.session.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
except requests.RequestException as e:
|
||||
print(f"Requests failed for {url}: {e}")
|
||||
return None
|
||||
|
||||
def get_page_with_selenium(self, url):
|
||||
"""Fallback to selenium for dynamic content"""
|
||||
if not SELENIUM_AVAILABLE:
|
||||
return None
|
||||
|
||||
options = Options()
|
||||
options.add_argument('--headless')
|
||||
options.add_argument('--no-sandbox')
|
||||
options.add_argument('--disable-dev-shm-usage')
|
||||
options.add_argument('--disable-gpu')
|
||||
options.add_argument(f'--user-agent={self.headers["User-Agent"]}')
|
||||
|
||||
try:
|
||||
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
|
||||
driver.get(url)
|
||||
|
||||
# Wait for content to load
|
||||
WebDriverWait(driver, 10).until(
|
||||
EC.presence_of_element_located((By.TAG_NAME, "body"))
|
||||
)
|
||||
|
||||
# Additional wait for dynamic content
|
||||
time.sleep(3)
|
||||
|
||||
html = driver.page_source
|
||||
driver.quit()
|
||||
return html
|
||||
|
||||
except Exception as e:
|
||||
print(f"Selenium failed for {url}: {e}")
|
||||
if 'driver' in locals():
|
||||
driver.quit()
|
||||
return None
|
||||
|
||||
def get_page_content(self, url):
|
||||
"""Get page content, trying requests first, then selenium"""
|
||||
print(f"Fetching: {url}")
|
||||
|
||||
# Try requests first
|
||||
content = self.get_page_with_requests(url)
|
||||
if content and len(content) > 1000: # Basic content check
|
||||
return content
|
||||
|
||||
# Fallback to selenium
|
||||
print("Falling back to Selenium...")
|
||||
return self.get_page_with_selenium(url)
|
||||
|
||||
def extract_product_links(self, html):
|
||||
"""Extract product page links from search results"""
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
links = []
|
||||
|
||||
# Common selectors for product links
|
||||
selectors = [
|
||||
'a[href*="/p/"]',
|
||||
'.product-item a',
|
||||
'.product-card a',
|
||||
'.product-link',
|
||||
'[data-testid*="product"] a'
|
||||
]
|
||||
|
||||
for selector in selectors:
|
||||
elements = soup.select(selector)
|
||||
for element in elements:
|
||||
href = element.get('href')
|
||||
if href and '/p/' in href:
|
||||
full_url = urljoin(self.base_url, href)
|
||||
if full_url not in links:
|
||||
links.append(full_url)
|
||||
|
||||
return links
|
||||
|
||||
def extract_product_info(self, url, html):
|
||||
"""Extract product information from product page"""
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
product = {'url': url}
|
||||
|
||||
# Extract title
|
||||
title_selectors = [
|
||||
'h1',
|
||||
'.product-title',
|
||||
'.product-name',
|
||||
'[data-testid="product-title"]',
|
||||
'.pdp-product-name'
|
||||
]
|
||||
|
||||
for selector in title_selectors:
|
||||
title_elem = soup.select_one(selector)
|
||||
if title_elem:
|
||||
product['title'] = title_elem.get_text().strip()
|
||||
break
|
||||
|
||||
# Extract price
|
||||
price_selectors = [
|
||||
'.price',
|
||||
'.product-price',
|
||||
'[data-testid="price"]',
|
||||
'.price-current',
|
||||
'.current-price'
|
||||
]
|
||||
|
||||
for selector in price_selectors:
|
||||
price_elem = soup.select_one(selector)
|
||||
if price_elem:
|
||||
price_text = price_elem.get_text().strip()
|
||||
product['price'] = price_text
|
||||
break
|
||||
|
||||
# Extract SKU
|
||||
sku_selectors = [
|
||||
'[data-sku]',
|
||||
'.sku',
|
||||
'.product-sku',
|
||||
'*[text()*="SKU"]',
|
||||
'script[type="application/ld+json"]'
|
||||
]
|
||||
|
||||
# Try data attributes first
|
||||
for selector in sku_selectors[:-1]:
|
||||
elem = soup.select_one(selector)
|
||||
if elem:
|
||||
sku = elem.get('data-sku') or elem.get_text().strip()
|
||||
if sku and sku.lower() != 'sku':
|
||||
product['sku'] = sku
|
||||
break
|
||||
|
||||
# Try JSON-LD structured data
|
||||
if 'sku' not in product:
|
||||
scripts = soup.find_all('script', type='application/ld+json')
|
||||
for script in scripts:
|
||||
try:
|
||||
data = json.loads(script.string)
|
||||
if isinstance(data, dict) and 'sku' in data:
|
||||
product['sku'] = data['sku']
|
||||
break
|
||||
elif isinstance(data, list):
|
||||
for item in data:
|
||||
if isinstance(item, dict) and 'sku' in item:
|
||||
product['sku'] = item['sku']
|
||||
break
|
||||
except:
|
||||
continue
|
||||
|
||||
# Extract stock information
|
||||
stock_selectors = [
|
||||
'.stock',
|
||||
'.inventory',
|
||||
'.availability',
|
||||
'[data-testid="stock"]',
|
||||
'.in-stock',
|
||||
'.out-of-stock'
|
||||
]
|
||||
|
||||
for selector in stock_selectors:
|
||||
stock_elem = soup.select_one(selector)
|
||||
if stock_elem:
|
||||
stock_text = stock_elem.get_text().strip().lower()
|
||||
if 'in stock' in stock_text:
|
||||
product['stock'] = 'In Stock'
|
||||
elif 'out of stock' in stock_text:
|
||||
product['stock'] = 'Out of Stock'
|
||||
else:
|
||||
product['stock'] = stock_text
|
||||
break
|
||||
|
||||
# Extract image URL
|
||||
img_selectors = [
|
||||
'.product-image img',
|
||||
'.product-photo img',
|
||||
'.pdp-image img',
|
||||
'[data-testid="product-image"] img',
|
||||
'img[alt*="Pokemon"]',
|
||||
'img[alt*="TCG"]'
|
||||
]
|
||||
|
||||
for selector in img_selectors:
|
||||
img_elem = soup.select_one(selector)
|
||||
if img_elem:
|
||||
src = img_elem.get('src') or img_elem.get('data-src')
|
||||
if src:
|
||||
product['image_url'] = urljoin(self.base_url, src)
|
||||
break
|
||||
|
||||
return product
|
||||
|
||||
def is_pokemon_tcg_product(self, product):
|
||||
"""Check if product is a Pokemon TCG card pack or tin"""
|
||||
if not product.get('title'):
|
||||
return False
|
||||
|
||||
title = product['title'].lower()
|
||||
pokemon_keywords = ['pokemon', 'tcg', 'trading card', 'cards']
|
||||
tcg_keywords = ['pack', 'tin', 'box', 'booster', 'collection']
|
||||
|
||||
has_pokemon = any(keyword in title for keyword in pokemon_keywords)
|
||||
has_tcg = any(keyword in title for keyword in tcg_keywords)
|
||||
|
||||
return has_pokemon and has_tcg
|
||||
|
||||
def scrape_products(self):
|
||||
"""Main scraping method"""
|
||||
print(f"Starting scrape of: {self.search_url}")
|
||||
|
||||
# Get search results page
|
||||
html = self.get_page_content(self.search_url)
|
||||
if not html:
|
||||
print("Failed to get search results page")
|
||||
return []
|
||||
|
||||
# Extract product links
|
||||
product_links = self.extract_product_links(html)
|
||||
print(f"Found {len(product_links)} potential product links")
|
||||
|
||||
if not product_links:
|
||||
print("No product links found. The page structure may have changed.")
|
||||
print("First 1000 chars of page:")
|
||||
print(html[:1000])
|
||||
return []
|
||||
|
||||
# Scrape each product page
|
||||
for i, link in enumerate(product_links):
|
||||
print(f"Scraping product {i+1}/{len(product_links)}: {link}")
|
||||
|
||||
product_html = self.get_page_content(link)
|
||||
if not product_html:
|
||||
continue
|
||||
|
||||
product = self.extract_product_info(link, product_html)
|
||||
|
||||
# Filter for Pokemon TCG products
|
||||
if self.is_pokemon_tcg_product(product):
|
||||
print(f"Found Pokemon TCG product: {product.get('title', 'Unknown')}")
|
||||
self.products.append(product)
|
||||
else:
|
||||
print(f"Skipping non-TCG product: {product.get('title', 'Unknown')}")
|
||||
|
||||
# Be respectful to the server
|
||||
time.sleep(1)
|
||||
|
||||
return self.products
|
||||
|
||||
def save_to_json(self, filename=None):
|
||||
"""Save scraped products to JSON file"""
|
||||
if not filename:
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"pokemon_tcg_products_{timestamp}.json"
|
||||
|
||||
with open(filename, 'w') as f:
|
||||
json.dump(self.products, f, indent=2)
|
||||
|
||||
print(f"Saved {len(self.products)} products to {filename}")
|
||||
return filename
|
||||
|
||||
def main():
|
||||
scraper = PokemonTCGScraper()
|
||||
products = scraper.scrape_products()
|
||||
|
||||
if products:
|
||||
filename = scraper.save_to_json()
|
||||
print(f"\nScraping completed successfully!")
|
||||
print(f"Found {len(products)} Pokemon TCG products")
|
||||
print(f"Data saved to: {filename}")
|
||||
else:
|
||||
print("\nNo products found. This could be due to:")
|
||||
print("1. No Pokemon TCG products in stock")
|
||||
print("2. Website structure changes")
|
||||
print("3. Anti-bot protection")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
55
test_barcode.py
Normal file
55
test_barcode.py
Normal file
@@ -0,0 +1,55 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script to verify barcode generation functionality
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# Add current directory to path if running in venv
|
||||
sys.path.insert(0, '.')
|
||||
|
||||
try:
|
||||
import barcode
|
||||
from barcode.writer import ImageWriter
|
||||
print("✓ Barcode generation libraries are available")
|
||||
|
||||
# Test barcode generation
|
||||
test_sku = "123456789012"
|
||||
|
||||
upc_generator = barcode.get_barcode_class('upca')
|
||||
test_barcode = upc_generator("12345678901", writer=ImageWriter())
|
||||
|
||||
# Create test output directory
|
||||
test_dir = Path("test_output")
|
||||
test_dir.mkdir(exist_ok=True)
|
||||
|
||||
# Generate test barcode
|
||||
barcode_path = test_dir / "test_barcode"
|
||||
test_barcode.save(str(barcode_path), options={
|
||||
'module_width': 0.2,
|
||||
'module_height': 15.0,
|
||||
'quiet_zone': 6.5,
|
||||
'font_size': 10,
|
||||
'text_distance': 5.0,
|
||||
'background': 'white',
|
||||
'foreground': 'black'
|
||||
})
|
||||
|
||||
final_path = f"{barcode_path}.png"
|
||||
if os.path.exists(final_path):
|
||||
print(f"✓ Test barcode generated successfully: {final_path}")
|
||||
print(f" File size: {os.path.getsize(final_path)} bytes")
|
||||
else:
|
||||
print(f"✗ Failed to generate test barcode")
|
||||
sys.exit(1)
|
||||
|
||||
except ImportError as e:
|
||||
print(f"✗ Missing barcode library: {e}")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"✗ Barcode generation failed: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
print("✓ All barcode generation tests passed!")
|
||||
Reference in New Issue
Block a user