Compare commits
9 Commits
58e995f6a6
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 0c7e139245 | |||
| 90661e1957 | |||
| 4b91ac5812 | |||
| dddfbe7355 | |||
| ecc026d07b | |||
| f71df3f558 | |||
| c0ec0f947b | |||
| e9efcf1460 | |||
| 12448a09a0 |
32
.gitignore
vendored
32
.gitignore
vendored
@@ -1,37 +1,11 @@
|
|||||||
# Virtual environment
|
|
||||||
venv/
|
venv/
|
||||||
env/
|
|
||||||
.env
|
|
||||||
|
|
||||||
# Python cache
|
|
||||||
__pycache__/
|
__pycache__/
|
||||||
*.pyc
|
*.pyc
|
||||||
*.pyo
|
|
||||||
*.pyd
|
|
||||||
.Python
|
|
||||||
*.so
|
|
||||||
.pytest_cache/
|
|
||||||
|
|
||||||
# Output files
|
# Generated output
|
||||||
pokemon_tcg_products_*.json
|
|
||||||
catalog_output/
|
catalog_output/
|
||||||
test_output/
|
pokemon_tcg_products_*.json
|
||||||
|
|
||||||
# Logs
|
# OS / editor
|
||||||
*.log
|
|
||||||
|
|
||||||
# OS files
|
|
||||||
.DS_Store
|
.DS_Store
|
||||||
Thumbs.db
|
|
||||||
.directory
|
|
||||||
|
|
||||||
# IDE files
|
|
||||||
.vscode/
|
|
||||||
.idea/
|
|
||||||
*.swp
|
*.swp
|
||||||
*.swo
|
|
||||||
|
|
||||||
# Temporary files
|
|
||||||
*.tmp
|
|
||||||
*.temp
|
|
||||||
.cache/
|
|
||||||
@@ -1,169 +0,0 @@
|
|||||||
# Pokemon Discovery - URL Discovery SUCCESS! 🎉
|
|
||||||
|
|
||||||
## ✅ **API Endpoint Successfully Discovered**
|
|
||||||
|
|
||||||
**Your HAR file revealed the exact API endpoint used by Dollar General!**
|
|
||||||
|
|
||||||
### 🔍 **Discovered API Details**
|
|
||||||
|
|
||||||
**Endpoint**: `https://dggo.dollargeneral.com/omni/api/v2/category/search/provider`
|
|
||||||
**Method**: POST
|
|
||||||
**Content-Type**: application/json
|
|
||||||
**Authentication**: Bearer token required
|
|
||||||
|
|
||||||
### 📋 **Exact Request Format**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": null,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": false,
|
|
||||||
"dgPickUp": false,
|
|
||||||
"dgShipTohome": false,
|
|
||||||
"soldAtStore": true,
|
|
||||||
"inStock": false,
|
|
||||||
"onlyActivatedDeals": false
|
|
||||||
},
|
|
||||||
"IncludeSponsored": true,
|
|
||||||
"IncludeShipToHome": true,
|
|
||||||
"IncludeDeals": true,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960,
|
|
||||||
"IncludeProducts": false,
|
|
||||||
"DoNotSave": false,
|
|
||||||
"OptOut": false,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 🎯 **Key Findings from HAR Analysis**
|
|
||||||
|
|
||||||
1. **✅ Contains Your Test Product**: SKU `41936301` and UPC `728192558375` found!
|
|
||||||
2. **✅ Multiple Pokemon Products**: API returns 4-12 Pokemon items per request
|
|
||||||
3. **✅ Proper Filtering**: `soldAtStore: true` shows in-store products
|
|
||||||
4. **✅ Stock Control**: `inStock: false` includes out-of-stock items
|
|
||||||
5. **✅ Category ID**: `723960` is the Pokemon category identifier
|
|
||||||
6. **✅ Store Location**: `17506` is the store number used
|
|
||||||
|
|
||||||
### 📊 **API Response Contains**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ItemList": {
|
|
||||||
"Items": [
|
|
||||||
{
|
|
||||||
"Title": "Pokémon Trading Card Game, 15 Card Pack, 1 ct",
|
|
||||||
"ItemNbr": "41936301",
|
|
||||||
"UPC": "728192558375",
|
|
||||||
"Price": {"Amount": 4.25},
|
|
||||||
"ProductUrl": "/p/pok-mon-trading-card-game-card-pack-ct/728192558375",
|
|
||||||
"Inventory": {"InStock": false},
|
|
||||||
"ImageURL": "...",
|
|
||||||
"Description": "...",
|
|
||||||
"Brand": "..."
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🔧 **Implementation Status**
|
|
||||||
|
|
||||||
### ✅ **Completed**
|
|
||||||
- [x] API endpoint discovery via HAR analysis
|
|
||||||
- [x] Request format extraction and documentation
|
|
||||||
- [x] Response structure mapping
|
|
||||||
- [x] Pokemon product filtering logic
|
|
||||||
- [x] Integration into Pokemon Discovery scraper
|
|
||||||
- [x] Individual product extraction (100% working)
|
|
||||||
|
|
||||||
### ⚠️ **Authentication Challenge**
|
|
||||||
- **Issue**: API requires Bearer token from authenticated session
|
|
||||||
- **Status**: Token extraction attempted but expires quickly
|
|
||||||
- **Solutions Available**:
|
|
||||||
1. **Browser Automation**: Use Selenium with proper session management
|
|
||||||
2. **Session Replication**: Implement full authentication flow
|
|
||||||
3. **Individual Products**: Current working approach (proven successful)
|
|
||||||
|
|
||||||
## 🚀 **Current Capabilities**
|
|
||||||
|
|
||||||
### 1. **Individual Product Extraction** (✅ WORKING)
|
|
||||||
```bash
|
|
||||||
# Test with your specific product
|
|
||||||
python test_real_products.py
|
|
||||||
# Result: Successfully extracts SKU 41936301 with all details
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. **API Framework** (✅ READY)
|
|
||||||
```python
|
|
||||||
# API call implementation ready in scraper.py
|
|
||||||
# Just needs authentication token to activate
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. **Complete Pipeline** (✅ WORKING)
|
|
||||||
```bash
|
|
||||||
# Generate PDF from any product data
|
|
||||||
python pdf_generator.py test_data.json
|
|
||||||
# Result: 153KB professional PDF with UPC-A barcodes
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📈 **Performance Comparison**
|
|
||||||
|
|
||||||
| Method | Speed | Product Count | Authentication | Status |
|
|
||||||
|--------|-------|---------------|----------------|--------|
|
|
||||||
| **API Endpoint** | Very Fast | 24+ per request | Required | Discovered ✅ |
|
|
||||||
| **Individual Products** | Moderate | 1 per request | None | Working ✅ |
|
|
||||||
| **Browser Automation** | Slower | Variable | Session-based | Possible |
|
|
||||||
|
|
||||||
## 🎯 **Next Steps**
|
|
||||||
|
|
||||||
### **Option A: Full API Implementation**
|
|
||||||
1. Implement proper browser session management
|
|
||||||
2. Extract Bearer token during session
|
|
||||||
3. Use API for bulk product discovery
|
|
||||||
4. **Result**: Very fast, bulk product scraping
|
|
||||||
|
|
||||||
### **Option B: Enhanced Individual Scraping**
|
|
||||||
1. Create list of known Pokemon product URLs
|
|
||||||
2. Process each URL individually (current working method)
|
|
||||||
3. Scale up with concurrent requests
|
|
||||||
4. **Result**: Reliable, no authentication needed
|
|
||||||
|
|
||||||
### **Option C: Hybrid Approach**
|
|
||||||
1. Use individual scraping for reliable operation
|
|
||||||
2. Add API capability when authentication is solved
|
|
||||||
3. Provide both options to users
|
|
||||||
4. **Result**: Best of both worlds
|
|
||||||
|
|
||||||
## 🏆 **SUCCESS METRICS**
|
|
||||||
|
|
||||||
- ✅ **URL Discovery**: SOLVED via HAR analysis
|
|
||||||
- ✅ **API Endpoint**: Found and documented
|
|
||||||
- ✅ **Request Format**: Complete specification extracted
|
|
||||||
- ✅ **Product Extraction**: Working with real products
|
|
||||||
- ✅ **PDF Generation**: Professional catalogs with barcodes
|
|
||||||
- ✅ **Repository**: Public and ready for use
|
|
||||||
|
|
||||||
## 💡 **Practical Usage Right Now**
|
|
||||||
|
|
||||||
**Pokemon Discovery is fully functional for product catalog generation:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone and use immediately
|
|
||||||
git clone https://git.dominat.us/pi-bot-01/pokemon-disco.git
|
|
||||||
cd pokemon-disco
|
|
||||||
./run.sh
|
|
||||||
|
|
||||||
# Add more product URLs to test_real_products.py
|
|
||||||
# Generate professional PDF catalogs with barcodes
|
|
||||||
```
|
|
||||||
|
|
||||||
**The API endpoint discovery is a major breakthrough that makes bulk scraping possible once authentication is properly implemented!** 🎉
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Repository**: https://git.dominat.us/pi-bot-01/pokemon-disco
|
|
||||||
**Status**: Production-ready with API framework for future enhancement
|
|
||||||
273
README.md
273
README.md
@@ -1,232 +1,129 @@
|
|||||||
# Pokemon Discovery (pokemon-disco)
|
# Pokemon Discovery (pokemon-disco)
|
||||||
|
|
||||||
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
Scrapes Pokemon TCG card pack and tin products from Dollar General and generates a PDF product catalog with images and UPC-A barcodes.
|
||||||
|
|
||||||
## Features
|
## How It Works
|
||||||
|
|
||||||
- **🔍 API Discovery**: Discovered Dollar General's internal product API via HAR analysis
|
Dollar General's Pokemon category page loads products dynamically via an internal API. A browser HAR capture contains the API responses with all product data. `disco.py` extracts products from the HAR file, filters for card packs and tins, downloads product images, generates UPC-A barcodes, and produces a LaTeX-based PDF catalog.
|
||||||
- **📱 Product Extraction**: Successfully extracts Pokemon TCG product details (title, SKU, price, stock)
|
|
||||||
- **🏷️ Barcode Generation**: Creates scannable UPC-A barcodes for inventory management
|
### Pipeline
|
||||||
- **📄 PDF Catalogs**: Professional PDF catalogs with images, details, and barcodes
|
|
||||||
- **🕰️ Unix-Friendly**: Timestamped filenames (`YYYYMMDD_HHMMSS`) for easy scripting
|
```
|
||||||
- **🌐 Brave Browser Support**: Configured for dynamic content scraping
|
HAR file → Extract API responses → Filter packs/tins → Download images
|
||||||
- **🛡️ Anti-Bot Handling**: Multiple fallback strategies (requests → Selenium → individual products)
|
→ Generate UPC-A barcodes → Compile PDF catalog (pdflatex)
|
||||||
|
```
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
### System Requirements
|
- Python 3.10+
|
||||||
- Python 3.7+
|
- pdflatex (via `texlive-core` + `texlive-latexextra`)
|
||||||
- pandoc (for PDF generation)
|
- Python packages: `requests`, `beautifulsoup4`, `python-barcode`, `Pillow`
|
||||||
- Chrome/Chromium browser (for Selenium fallback)
|
|
||||||
|
|
||||||
### Python Dependencies
|
### Install (Arch / CachyOS)
|
||||||
All dependencies are automatically installed via `requirements.txt`:
|
|
||||||
- requests
|
|
||||||
- beautifulsoup4
|
|
||||||
- selenium
|
|
||||||
- webdriver-manager
|
|
||||||
- python-barcode
|
|
||||||
- Pillow
|
|
||||||
- pandas
|
|
||||||
- lxml
|
|
||||||
|
|
||||||
## Installation
|
```bash
|
||||||
|
sudo pacman -S texlive-basic texlive-latex texlive-latexextra texlive-fontsrecommended
|
||||||
1. **Clone/Download** this directory to your system
|
python -m venv venv
|
||||||
|
source venv/bin/activate
|
||||||
2. **Install pandoc** (required for PDF generation):
|
pip install -r requirements.txt
|
||||||
```bash
|
```
|
||||||
# Ubuntu/Debian
|
|
||||||
sudo apt install pandoc
|
|
||||||
|
|
||||||
# macOS
|
|
||||||
brew install pandoc
|
|
||||||
|
|
||||||
# Arch Linux
|
|
||||||
sudo pacman -S pandoc
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Install Python dependencies** (automatically done by the script):
|
|
||||||
```bash
|
|
||||||
cd pokemon-disco
|
|
||||||
pip3 install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
### Quick Start (Recommended)
|
### Full run (scrape + PDF)
|
||||||
|
|
||||||
Run the complete pipeline with one command:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd pokemon-disco
|
source venv/bin/activate
|
||||||
python3 run_scraper.py
|
python disco.py
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
### Scrape only (output JSON)
|
||||||
1. Check and install Python requirements
|
|
||||||
2. Scrape Pokemon TCG products from Dollar General
|
|
||||||
3. Generate a PDF catalog with images and barcodes
|
|
||||||
4. Create timestamped files for easy organization
|
|
||||||
|
|
||||||
### Manual Usage
|
|
||||||
|
|
||||||
If you prefer to run components separately:
|
|
||||||
|
|
||||||
#### 1. Scrape Products
|
|
||||||
```bash
|
```bash
|
||||||
python3 scraper.py
|
python disco.py --scrape-only
|
||||||
```
|
```
|
||||||
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
|
||||||
|
|
||||||
#### 2. Generate PDF Catalog
|
### PDF only (from existing JSON)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
python disco.py --pdf-only pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output Files
|
## Output
|
||||||
|
|
||||||
### Generated Files
|
|
||||||
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
|
||||||
- Raw scraped data in JSON format
|
|
||||||
- Contains all product information
|
|
||||||
|
|
||||||
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
|
||||||
- Professional PDF catalog
|
|
||||||
- Includes product images, details, and UPC-A barcodes
|
|
||||||
|
|
||||||
### Output Directory Structure
|
|
||||||
```
|
```
|
||||||
pokemon-disco/
|
pokemon_tcg_products_YYYYMMDD_HHMMSS.json Product data
|
||||||
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
catalog_output/
|
||||||
├── catalog_output/
|
├── pokemon_catalog_YYYYMMDD_HHMMSS.pdf PDF catalog
|
||||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
├── pokemon_catalog_YYYYMMDD_HHMMSS.tex LaTeX source
|
||||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
├── images/ Product images (PNG)
|
||||||
│ ├── images/
|
└── barcodes/ UPC-A barcodes (PNG)
|
||||||
│ │ ├── product_1_SKU123.jpg
|
|
||||||
│ │ ├── product_2_SKU456.jpg
|
|
||||||
│ │ └── placeholder.png
|
|
||||||
│ └── barcodes/
|
|
||||||
│ ├── barcode_SKU123.png
|
|
||||||
│ ├── barcode_SKU456.png
|
|
||||||
│ └── ...
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## PDF Catalog Features
|
### PDF Layout
|
||||||
|
|
||||||
Each product in the PDF includes:
|
**Page 1 — Manifest:** table of all products with SKU, price, and stock count.
|
||||||
- **Product Image**: Downloaded from Dollar General or placeholder
|
|
||||||
- **Product Details Table**:
|
|
||||||
- Title
|
|
||||||
- Price
|
|
||||||
- Stock Status
|
|
||||||
- SKU (formatted as code)
|
|
||||||
- Product URL
|
|
||||||
- **UPC-A Barcode**: Generated from SKU for inventory management
|
|
||||||
|
|
||||||
## Data Fields Extracted
|
**Product pages:**
|
||||||
|
|
||||||
For each Pokemon TCG product:
|
```
|
||||||
- `title`: Product name
|
Product Name
|
||||||
- `price`: Current price
|
Stock status Price
|
||||||
- `stock`: Availability status
|
SKU: XXXXXXXX UPC: XXXXXXXXXXXX
|
||||||
- `sku`: Product SKU/item number
|
|
||||||
- `image_url`: Direct link to product image
|
|
||||||
- `url`: Link to product page
|
|
||||||
|
|
||||||
## Troubleshooting
|
┌─────────────────────────────┐
|
||||||
|
│ │
|
||||||
|
│ Product Image │
|
||||||
|
│ │
|
||||||
|
└─────────────────────────────┘
|
||||||
|
|
||||||
### Common Issues
|
┌─────────────────────────────┐
|
||||||
|
│ UPC-A Barcode │
|
||||||
|
└─────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
1. **No products found**
|
## Capturing a HAR File
|
||||||
- Dollar General may have anti-bot protection
|
|
||||||
- The script will automatically retry with Selenium
|
|
||||||
- Website structure may have changed
|
|
||||||
|
|
||||||
2. **PDF generation fails**
|
The HAR file provides product data from Dollar General's internal API. To capture one:
|
||||||
- Ensure pandoc is installed: `pandoc --version`
|
|
||||||
- Try alternative LaTeX engines if available
|
|
||||||
- Markdown file is still generated for manual conversion
|
|
||||||
|
|
||||||
3. **Image download failures**
|
1. Open your browser (Brave, Chrome, Firefox)
|
||||||
- Network connectivity issues
|
2. Open DevTools → **Network** tab
|
||||||
- Placeholder images will be used automatically
|
3. Visit `https://www.dollargeneral.com/c/toys/pokemon?q=`
|
||||||
|
4. Wait for products to load, toggle any filters you want
|
||||||
|
5. Right-click in the Network tab → **Save all as HAR**
|
||||||
|
6. Place the `.har` file in the project root
|
||||||
|
|
||||||
4. **Browser/Selenium issues**
|
`disco.py` looks for any `.har` file matching the default name pattern. Edit the `HAR_FILE` constant at the top of `disco.py` if your filename differs.
|
||||||
- **Brave browser supported**: Configured to use Brave at `/usr/bin/brave`
|
|
||||||
- **ChromeDriver compatibility**: May require version matching (Brave 146 vs ChromeDriver 114)
|
|
||||||
- **Alternative browsers**: Chrome, Chromium, or Firefox with geckodriver
|
|
||||||
- Script falls back to requests-only mode if Selenium fails
|
|
||||||
|
|
||||||
**For Brave users**: If you see ChromeDriver version mismatch:
|
|
||||||
```bash
|
|
||||||
# Test browser integration
|
|
||||||
python test_brave.py
|
|
||||||
|
|
||||||
# Solutions for version mismatch:
|
|
||||||
pip install --upgrade webdriver-manager
|
|
||||||
# or manually install compatible ChromeDriver
|
|
||||||
```
|
|
||||||
|
|
||||||
### Debug Mode
|
## Files
|
||||||
|
|
||||||
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
| File | Purpose |
|
||||||
- Which products are found and filtered
|
|------|---------|
|
||||||
- Network request status
|
| `disco.py` | Main tool — scrape, filter, generate PDF |
|
||||||
- File generation progress
|
| `scraper.py` | Reference site scraper (HTML + Selenium/Brave) |
|
||||||
|
| `requirements.txt` | Python dependencies |
|
||||||
|
| `*.har` | Browser HAR capture with API data |
|
||||||
|
|
||||||
## API Discovery Success 🎉
|
## API Details (Reference)
|
||||||
|
|
||||||
**Pokemon Discovery has successfully discovered Dollar General's internal API endpoint!**
|
The product data comes from this internal API:
|
||||||
|
|
||||||
- **Endpoint Found**: `https://dggo.dollargeneral.com/omni/api/v2/category/search/provider`
|
```
|
||||||
- **Method**: POST with JSON payload
|
POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
|
||||||
- **Category ID**: `723960` (Pokemon products)
|
Content-Type: application/json
|
||||||
- **Response Format**: Complete product details including your test product (SKU: `41936301`)
|
Authorization: Bearer <session-token>
|
||||||
- **Status**: Documented and integrated, requires authentication token
|
|
||||||
|
|
||||||
**Current Status**: Individual product extraction works perfectly. API bulk scraping available once authentication is implemented.
|
{
|
||||||
|
"StoreNbr": 17506,
|
||||||
|
"Id": 723960, // Pokemon category
|
||||||
|
"PageSize": 24,
|
||||||
|
"Filters": {
|
||||||
|
"soldAtStore": true,
|
||||||
|
"inStock": false // false = include out of stock
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## Technical Details
|
Response contains `ItemList.Items[]` with fields: `Description`, `UPC`, `Price`, `Image`, `AvailableQty`, `rootSV` (internal ID → SKU).
|
||||||
|
|
||||||
### Scraping Strategy
|
The bearer token is session-scoped and short-lived. `disco.py` sidesteps this by reading the API responses directly from a HAR capture.
|
||||||
1. **Primary Method**: Uses requests with browser-like headers
|
|
||||||
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
|
||||||
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
|
||||||
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
|
||||||
|
|
||||||
### Barcode Generation
|
|
||||||
- Converts SKUs to 11-digit numeric format
|
|
||||||
- Generates UPC-A barcodes with check digits
|
|
||||||
- High-quality PNG images suitable for printing
|
|
||||||
|
|
||||||
### PDF Generation
|
|
||||||
- Uses pandoc with LaTeX for professional formatting
|
|
||||||
- Includes table of contents
|
|
||||||
- Optimized for printing and digital viewing
|
|
||||||
- Images scaled appropriately for page layout
|
|
||||||
|
|
||||||
## Customization
|
|
||||||
|
|
||||||
### Modifying Product Filters
|
|
||||||
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
|
||||||
|
|
||||||
### Changing PDF Layout
|
|
||||||
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
|
||||||
|
|
||||||
### Adding New Data Fields
|
|
||||||
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
|
||||||
|
|
||||||
## Support
|
|
||||||
|
|
||||||
If you encounter issues:
|
|
||||||
1. Check the console output for error messages
|
|
||||||
2. Ensure all system requirements are installed
|
|
||||||
3. Verify internet connectivity
|
|
||||||
4. Check if the Dollar General website structure has changed
|
|
||||||
|
|
||||||
Generated files include timestamps for easy organization and version tracking.
|
|
||||||
|
|||||||
114
TEST_RESULTS.md
114
TEST_RESULTS.md
@@ -1,114 +0,0 @@
|
|||||||
# Pokemon Discovery - Test Results
|
|
||||||
|
|
||||||
## Testing Overview
|
|
||||||
Date: 2026-03-21
|
|
||||||
System: CachyOS (Arch Linux)
|
|
||||||
|
|
||||||
## ✅ Successfully Tested Components
|
|
||||||
|
|
||||||
### 1. Virtual Environment Setup
|
|
||||||
- ✅ Virtual environment creation works
|
|
||||||
- ✅ All Python dependencies install correctly
|
|
||||||
- ✅ Requirements.txt includes all necessary packages
|
|
||||||
|
|
||||||
### 2. Barcode Generation
|
|
||||||
- ✅ UPC-A barcode generation from SKUs works perfectly
|
|
||||||
- ✅ High-quality PNG images generated (3-6KB each)
|
|
||||||
- ✅ Proper barcode formatting with check digits
|
|
||||||
- ✅ File naming fixed (no double .png extension)
|
|
||||||
|
|
||||||
### 3. PDF Generation
|
|
||||||
- ✅ Markdown catalog generation works
|
|
||||||
- ✅ Professional table formatting for product details
|
|
||||||
- ✅ PDF generation works with pdflatex (fallback from xelatex)
|
|
||||||
- ✅ Unix-friendly timestamped filenames
|
|
||||||
- ✅ Proper directory structure creation
|
|
||||||
|
|
||||||
### 4. Core Functionality
|
|
||||||
- ✅ JSON data parsing and processing
|
|
||||||
- ✅ Product filtering logic
|
|
||||||
- ✅ Image placeholder generation
|
|
||||||
- ✅ Error handling and graceful fallbacks
|
|
||||||
|
|
||||||
### 5. Brave Browser Integration
|
|
||||||
- ✅ Brave browser detected and configured
|
|
||||||
- ✅ Selenium WebDriver setup for Brave
|
|
||||||
- ⚠️ ChromeDriver version compatibility issue (expected)
|
|
||||||
- ✅ Graceful fallback when browser automation fails
|
|
||||||
- ✅ Test script provided (`test_brave.py`) for troubleshooting
|
|
||||||
|
|
||||||
## ⚠️ Current Limitations
|
|
||||||
|
|
||||||
### 1. Web Scraping
|
|
||||||
- **Issue**: Dollar General uses dynamic JavaScript loading
|
|
||||||
- **Status**: Basic HTML parsing works, but product links require JavaScript execution
|
|
||||||
- **Solution**: Selenium fallback is implemented but requires Chrome/Chromium browser
|
|
||||||
- **Workaround**: Test data demonstrates full pipeline functionality
|
|
||||||
|
|
||||||
### 2. External Dependencies & Browser Integration
|
|
||||||
- **LaTeX**: Requires texlive packages for PDF generation (✅ installed)
|
|
||||||
- **Brave Browser**: Configured and detected (✅ available at /usr/bin/brave)
|
|
||||||
- **ChromeDriver Compatibility**: Version mismatch (Brave 146 vs ChromeDriver 114)
|
|
||||||
- ⚠️ Requires compatible ChromeDriver version for web scraping
|
|
||||||
- 💡 Main functionality (PDF generation) works without browser
|
|
||||||
- **Network**: External image downloads require internet connectivity
|
|
||||||
|
|
||||||
## 📋 Test Results Summary
|
|
||||||
|
|
||||||
### Working Pipeline Test
|
|
||||||
Using test data (`test_data.json`) with 3 Pokemon TCG products:
|
|
||||||
|
|
||||||
**Input**: 3 sample Pokemon products
|
|
||||||
**Generated**:
|
|
||||||
- ✅ Professional PDF catalog (161KB)
|
|
||||||
- ✅ 3 UPC-A barcode images (3-6KB each)
|
|
||||||
- ✅ Structured markdown source
|
|
||||||
- ✅ Proper file organization
|
|
||||||
|
|
||||||
**PDF Contents**:
|
|
||||||
- Table of contents
|
|
||||||
- Product details tables (title, price, stock, SKU, URL)
|
|
||||||
- Barcode images for each product
|
|
||||||
- Professional formatting suitable for printing
|
|
||||||
|
|
||||||
### File Structure Generated
|
|
||||||
```
|
|
||||||
catalog_output/
|
|
||||||
├── pokemon_tcg_catalog_20260321_144548.pdf # Final catalog
|
|
||||||
├── pokemon_tcg_catalog_20260321_144548.md # Markdown source
|
|
||||||
├── barcodes/
|
|
||||||
│ ├── barcode_DG12345678.png # UPC-A barcodes
|
|
||||||
│ ├── barcode_DG87654321.png
|
|
||||||
│ └── barcode_DG11223344.png
|
|
||||||
└── images/
|
|
||||||
└── placeholder.png # Image placeholders
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🚀 Deployment Status
|
|
||||||
|
|
||||||
- **Repository**: Successfully pushed to public Git repository
|
|
||||||
- **Documentation**: Complete with README.md and USAGE.md
|
|
||||||
- **Dependencies**: All Python packages working in virtual environment
|
|
||||||
- **Core Features**: PDF generation and barcode creation fully functional
|
|
||||||
|
|
||||||
## 💡 Recommendations
|
|
||||||
|
|
||||||
1. **For Production Use**: Install Chrome/Chromium for better web scraping
|
|
||||||
```bash
|
|
||||||
sudo pacman -S chromium
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **For Complete Testing**: Test with live website when network allows
|
|
||||||
3. **Alternative Approach**: The tool can be easily adapted for other product sites
|
|
||||||
4. **Data Integration**: JSON output format allows easy integration with other systems
|
|
||||||
|
|
||||||
## ✅ Conclusion
|
|
||||||
|
|
||||||
**Pokemon Discovery is fully functional** for the core use case:
|
|
||||||
- ✅ Processes product data (from any source)
|
|
||||||
- ✅ Generates professional PDF catalogs
|
|
||||||
- ✅ Creates scannable UPC-A barcodes
|
|
||||||
- ✅ Handles Unix-friendly file management
|
|
||||||
- ✅ Ready for production deployment
|
|
||||||
|
|
||||||
The web scraping component requires additional browser setup for full dynamic content handling, but the complete data processing and catalog generation pipeline works perfectly.
|
|
||||||
115
USAGE.md
115
USAGE.md
@@ -1,115 +0,0 @@
|
|||||||
# Quick Start Guide
|
|
||||||
|
|
||||||
## Simple Usage (Recommended)
|
|
||||||
|
|
||||||
1. **Make sure you're in the project directory:**
|
|
||||||
```bash
|
|
||||||
cd pokemon-disco
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Run the complete scraper and PDF generator:**
|
|
||||||
```bash
|
|
||||||
./run.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
This single command will:
|
|
||||||
- Set up the Python virtual environment
|
|
||||||
- Install all required packages
|
|
||||||
- Scrape Pokemon TCG products from Dollar General
|
|
||||||
- Generate a professional PDF catalog with barcodes
|
|
||||||
- Create timestamped files for easy organization
|
|
||||||
|
|
||||||
## What You'll Get
|
|
||||||
|
|
||||||
### Generated Files:
|
|
||||||
- **`pokemon_tcg_products_YYYYMMDD_HHMMSS.json`** - Raw data in JSON format
|
|
||||||
- **`catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`** - Professional PDF catalog
|
|
||||||
|
|
||||||
### PDF Catalog Contents:
|
|
||||||
- Product images (downloaded automatically)
|
|
||||||
- Product details (title, price, stock, SKU)
|
|
||||||
- UPC-A barcodes for each product (generated from SKU)
|
|
||||||
- Table of contents for easy navigation
|
|
||||||
- Professional formatting suitable for printing
|
|
||||||
|
|
||||||
## Alternative Commands
|
|
||||||
|
|
||||||
If you prefer more control:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Activate virtual environment first
|
|
||||||
source venv/bin/activate
|
|
||||||
|
|
||||||
# Run only the scraper
|
|
||||||
python scraper.py
|
|
||||||
|
|
||||||
# Run only the PDF generator (after scraping)
|
|
||||||
python pdf_generator.py pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
|
||||||
|
|
||||||
# Run everything (installs requirements automatically)
|
|
||||||
python run_scraper.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Output Location
|
|
||||||
|
|
||||||
All generated files will be in:
|
|
||||||
- JSON data: Current directory
|
|
||||||
- PDF catalog: `catalog_output/` directory
|
|
||||||
- Product images: `catalog_output/images/`
|
|
||||||
- Barcode images: `catalog_output/barcodes/`
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
- Python 3.7+
|
|
||||||
- pandoc (for PDF generation)
|
|
||||||
- Internet connection (for scraping)
|
|
||||||
|
|
||||||
The script will automatically handle Python dependencies via virtual environment.
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
If you encounter issues:
|
|
||||||
|
|
||||||
1. **Permission denied:** Make sure the script is executable:
|
|
||||||
```bash
|
|
||||||
chmod +x run.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Pandoc not found:** Install pandoc for your system:
|
|
||||||
```bash
|
|
||||||
# Ubuntu/Debian
|
|
||||||
sudo apt install pandoc
|
|
||||||
|
|
||||||
# Arch Linux
|
|
||||||
sudo pacman -S pandoc
|
|
||||||
|
|
||||||
# macOS
|
|
||||||
brew install pandoc
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **No products found:** The website may have anti-bot protection or changed structure. The script includes fallback mechanisms.
|
|
||||||
|
|
||||||
4. **PDF generation fails:** The markdown file will still be generated, which you can manually convert or view.
|
|
||||||
|
|
||||||
## File Naming Convention
|
|
||||||
|
|
||||||
All output files include Unix-friendly timestamps:
|
|
||||||
- Format: `YYYYMMDD_HHMMSS` (e.g., `20241221_143025`)
|
|
||||||
- This ensures chronological sorting with `ls` command
|
|
||||||
- No spaces or special characters for script-friendly handling
|
|
||||||
|
|
||||||
## Example Output
|
|
||||||
|
|
||||||
```
|
|
||||||
pokemon-disco/
|
|
||||||
├── pokemon_tcg_products_20241221_143025.json # Scraped data
|
|
||||||
├── catalog_output/
|
|
||||||
│ ├── pokemon_tcg_catalog_20241221_143025.pdf # Final catalog
|
|
||||||
│ ├── pokemon_tcg_catalog_20241221_143025.md # Markdown source
|
|
||||||
│ ├── images/
|
|
||||||
│ │ ├── product_1_SKU123456.jpg # Product images
|
|
||||||
│ │ └── product_2_SKU789012.jpg
|
|
||||||
│ └── barcodes/
|
|
||||||
│ ├── barcode_SKU123456.png # UPC-A barcodes
|
|
||||||
│ └── barcode_SKU789012.png
|
|
||||||
```
|
|
||||||
181
analyze_har.py
181
analyze_har.py
@@ -1,181 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Analyze HAR file to find product loading endpoints
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
from urllib.parse import urlparse, parse_qs
|
|
||||||
|
|
||||||
def analyze_har_file(har_file):
|
|
||||||
"""Analyze HAR file to find product-related API calls"""
|
|
||||||
|
|
||||||
print(f"Analyzing HAR file: {har_file}")
|
|
||||||
|
|
||||||
try:
|
|
||||||
with open(har_file, 'r', encoding='utf-8') as f:
|
|
||||||
har_data = json.load(f)
|
|
||||||
|
|
||||||
entries = har_data.get('log', {}).get('entries', [])
|
|
||||||
print(f"Found {len(entries)} network requests")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Filter for API calls that might contain product data
|
|
||||||
api_calls = []
|
|
||||||
product_calls = []
|
|
||||||
|
|
||||||
for entry in entries:
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
url = request.get('url', '')
|
|
||||||
method = request.get('method', '')
|
|
||||||
status = response.get('status', 0)
|
|
||||||
|
|
||||||
# Look for API calls
|
|
||||||
parsed_url = urlparse(url)
|
|
||||||
path = parsed_url.path.lower()
|
|
||||||
query = parsed_url.query.lower()
|
|
||||||
|
|
||||||
# Check if this might be a product-related API call
|
|
||||||
is_api = any(keyword in path for keyword in ['/api/', '/search', '/products', '/inventory', '/catalog'])
|
|
||||||
contains_pokemon = 'pokemon' in query or 'pokemon' in path
|
|
||||||
is_json_response = any(h.get('name', '').lower() == 'content-type' and 'json' in h.get('value', '')
|
|
||||||
for h in response.get('headers', []))
|
|
||||||
|
|
||||||
if is_api or is_json_response:
|
|
||||||
api_calls.append({
|
|
||||||
'url': url,
|
|
||||||
'method': method,
|
|
||||||
'status': status,
|
|
||||||
'is_pokemon': contains_pokemon,
|
|
||||||
'response_size': response.get('bodySize', 0)
|
|
||||||
})
|
|
||||||
|
|
||||||
if contains_pokemon or 'product' in path or 'search' in path:
|
|
||||||
product_calls.append(entry)
|
|
||||||
|
|
||||||
print(f"Found {len(api_calls)} potential API calls")
|
|
||||||
print(f"Found {len(product_calls)} product-related calls")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Show interesting API calls
|
|
||||||
print("=== API CALLS ===")
|
|
||||||
for call in api_calls[:20]: # Show first 20
|
|
||||||
url = call['url']
|
|
||||||
pokemon_flag = "🎯" if call['is_pokemon'] else " "
|
|
||||||
print(f"{pokemon_flag} {call['method']} {call['status']} - {url}")
|
|
||||||
if call['response_size'] > 1000:
|
|
||||||
print(f" 📦 Response size: {call['response_size']} bytes")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Analyze product-specific calls in detail
|
|
||||||
if product_calls:
|
|
||||||
print("=== DETAILED PRODUCT CALL ANALYSIS ===")
|
|
||||||
|
|
||||||
for i, entry in enumerate(product_calls[:5]): # Analyze first 5 product calls
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
|
|
||||||
print(f"\n--- Product Call {i+1} ---")
|
|
||||||
print(f"URL: {request.get('url', '')}")
|
|
||||||
print(f"Method: {request.get('method', '')}")
|
|
||||||
print(f"Status: {response.get('status', 0)}")
|
|
||||||
|
|
||||||
# Show headers
|
|
||||||
headers = request.get('headers', [])
|
|
||||||
important_headers = [h for h in headers if h.get('name', '').lower() in
|
|
||||||
['accept', 'content-type', 'authorization', 'x-api-key', 'referer']]
|
|
||||||
if important_headers:
|
|
||||||
print("Important Headers:")
|
|
||||||
for header in important_headers:
|
|
||||||
print(f" {header.get('name')}: {header.get('value', '')[:100]}")
|
|
||||||
|
|
||||||
# Show query parameters
|
|
||||||
parsed = urlparse(request.get('url', ''))
|
|
||||||
if parsed.query:
|
|
||||||
params = parse_qs(parsed.query)
|
|
||||||
print("Query Parameters:")
|
|
||||||
for key, values in params.items():
|
|
||||||
print(f" {key}: {values}")
|
|
||||||
|
|
||||||
# Show POST data if any
|
|
||||||
post_data = request.get('postData', {})
|
|
||||||
if post_data.get('text'):
|
|
||||||
print(f"POST Data: {post_data.get('text')[:200]}...")
|
|
||||||
|
|
||||||
# Check response content
|
|
||||||
response_content = response.get('content', {})
|
|
||||||
response_text = response_content.get('text', '')
|
|
||||||
|
|
||||||
if response_text:
|
|
||||||
print(f"Response size: {len(response_text)} characters")
|
|
||||||
|
|
||||||
# Try to parse as JSON
|
|
||||||
try:
|
|
||||||
response_json = json.loads(response_text)
|
|
||||||
print("✓ Valid JSON response")
|
|
||||||
|
|
||||||
# Look for product-like structures
|
|
||||||
def find_products_in_json(obj, path=""):
|
|
||||||
products = []
|
|
||||||
if isinstance(obj, dict):
|
|
||||||
for key, value in obj.items():
|
|
||||||
new_path = f"{path}.{key}" if path else key
|
|
||||||
if key.lower() in ['products', 'items', 'results', 'data']:
|
|
||||||
if isinstance(value, list):
|
|
||||||
products.append((new_path, len(value)))
|
|
||||||
products.extend(find_products_in_json(value, new_path))
|
|
||||||
elif isinstance(obj, list):
|
|
||||||
for idx, item in enumerate(obj):
|
|
||||||
products.extend(find_products_in_json(item, f"{path}[{idx}]"))
|
|
||||||
return products
|
|
||||||
|
|
||||||
product_arrays = find_products_in_json(response_json)
|
|
||||||
if product_arrays:
|
|
||||||
print("Potential product arrays found:")
|
|
||||||
for path, count in product_arrays:
|
|
||||||
print(f" {path}: {count} items")
|
|
||||||
|
|
||||||
# Check for our specific product
|
|
||||||
response_str = str(response_json).lower()
|
|
||||||
if '41936301' in response_str:
|
|
||||||
print("🎯 CONTAINS OUR TEST PRODUCT SKU!")
|
|
||||||
if '728192558375' in response_str:
|
|
||||||
print("🎯 CONTAINS OUR TEST PRODUCT UPC!")
|
|
||||||
if 'pokemon' in response_str:
|
|
||||||
print("🎯 CONTAINS POKEMON REFERENCES!")
|
|
||||||
|
|
||||||
except json.JSONDecodeError:
|
|
||||||
print("Response is not JSON")
|
|
||||||
# Check if it contains our product anyway
|
|
||||||
if '41936301' in response_text:
|
|
||||||
print("🎯 CONTAINS OUR TEST PRODUCT SKU!")
|
|
||||||
|
|
||||||
# Return the most promising API calls
|
|
||||||
return api_calls, product_calls
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error analyzing HAR file: {e}")
|
|
||||||
return [], []
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
har_files = ['www.dollargeneral.com_Archive [26-03-21 15-14-28].har']
|
|
||||||
|
|
||||||
for har_file in har_files:
|
|
||||||
try:
|
|
||||||
api_calls, product_calls = analyze_har_file(har_file)
|
|
||||||
print(f"\n🎯 SUMMARY:")
|
|
||||||
print(f" Total API calls: {len(api_calls)}")
|
|
||||||
print(f" Product-related calls: {len(product_calls)}")
|
|
||||||
|
|
||||||
if product_calls:
|
|
||||||
print(f"\n💡 NEXT STEPS:")
|
|
||||||
print(f" 1. Test the identified API endpoints")
|
|
||||||
print(f" 2. Replicate the headers and parameters")
|
|
||||||
print(f" 3. Integrate successful calls into Pokemon Discovery")
|
|
||||||
|
|
||||||
except FileNotFoundError:
|
|
||||||
print(f"HAR file not found: {har_file}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error processing {har_file}: {e}")
|
|
||||||
@@ -1,41 +0,0 @@
|
|||||||
{
|
|
||||||
"endpoint": "https://dggo.dollargeneral.com/omni/api/v2/category/search/provider",
|
|
||||||
"method": "POST",
|
|
||||||
"headers": {
|
|
||||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0",
|
|
||||||
"Accept": "application/json, text/plain, */*",
|
|
||||||
"Content-Type": "application/json",
|
|
||||||
"Authorization": "Bearer eyJ0eXAiOiJhdCtKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6Ik5qRTJNemczTXpSRVFrUXpNak5GUmprMU1FUkNNRUZDTVRBek1FWTFRa0pCTXpRM1EwTkNNZyJ9.eyJzY29wZSI6bnVsbCwiaWF0IjoxNzc0MTI3Nzc5LCJleHAiOjE3NzQxMzEzNzksImF1ZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImlzcyI6Imh0dHBzOi8vcHJvZC1kZ2dvLyIsInN1YiI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsInNpZCI6IlNrWk9makF5TURRMU1EVXpOVFEwWWpBM016SXpNak14TXpFek9ETTNNekV3TWpreFl6VitUVUZXYVhwbk56SXpVRGg2VWxkcmEySkRkMk5EZUdVNFlUWm5XVXBHVDBveVExTlRNVWxXWlhSalQzRnFWazVWZGtGWlIwOWtZV2x0WVVwRVRucG5SVlZvUTE5SE5VcHVObGhuTURSb2JuUkVhVlF3UTBzelNIND0iLCJqdGkiOiJzdDIucy5BdEx0VlphRHFnLnZrdW5OV2RWNjN2ZlJTTG00Y3VUd2d5bmc2X0pJNmxKRjA5a2lXTXVQeGZkVDRvT0NhMXhwa1VoRlRkM2tocHZUaFhsRUVwLWw0QzJrZnoycjkzVlYzeldBaUw5Y2x6Snl0amFJamJ4TEJnLkJOZy1CeUdpZnV0WnppQWhhMV8xRDBXTUFWR3JpNVVCX0pKbTRCNVRNYVhTWkZneXpxeUZERjJxZ3B3UTgyajZ2eGVtcnA5RERFTHZnM3hvdlZmZzBnLnNjMyIsImNsaWVudF9pZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImF6cCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiJ9.I6ou9atkJ8ndkr2m2Trpg53fMIL3hpofCLUHoHYgZkOJnLnbmL0CQu7_pIChQ6nIDK03GagK6aqxd97E8B8vv9nweSmb7zXhrt43dKLEIdhxIGFkJ4xYgNNg-3cVjSlThBQ_AwCx924lOGjEfikEw4NrvGvrlNvrg1lnNz4hf629hUH-5ccVSdgo1w_LQzsLOeMCjuC_bmAoRxT5KLI9oESd4tPJZU5Nlt2ICbWJD9h-zNrt-ijwYCvb7j8amGbpMGhJZqtzu9f3wN0JUFxDg5rAN-WOtLjwEmR_NxDKq0NEeuU16uhaB8AJzy217XAgJ87bKZldZowsWs-Q9oAH3g",
|
|
||||||
"Referer": "https://www.dollargeneral.com/"
|
|
||||||
},
|
|
||||||
"post_data": {
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": null,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": false,
|
|
||||||
"dgPickUp": false,
|
|
||||||
"dgShipTohome": false,
|
|
||||||
"soldAtStore": true,
|
|
||||||
"inStock": true,
|
|
||||||
"onlyActivatedDeals": false
|
|
||||||
},
|
|
||||||
"IncludeSponsored": true,
|
|
||||||
"IncludeShipToHome": true,
|
|
||||||
"IncludeDeals": true,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960,
|
|
||||||
"IncludeProducts": false,
|
|
||||||
"DoNotSave": false,
|
|
||||||
"OptOut": false,
|
|
||||||
"SearchType": 1
|
|
||||||
},
|
|
||||||
"example_response": {
|
|
||||||
"total_items": 4,
|
|
||||||
"pokemon_items": 0,
|
|
||||||
"sample_pokemon_product": null
|
|
||||||
}
|
|
||||||
}
|
|
||||||
514
disco.py
Normal file
514
disco.py
Normal file
@@ -0,0 +1,514 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Pokemon Discovery (disco.py)
|
||||||
|
Scrapes Pokemon TCG pack & tin products from Dollar General and generates a PDF catalog.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python disco.py # Full run: scrape + generate PDF
|
||||||
|
python disco.py --scrape-only # Just scrape, output JSON
|
||||||
|
python disco.py --pdf-only FILE.json # Just generate PDF from existing JSON
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import requests
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from urllib.parse import urljoin, quote
|
||||||
|
|
||||||
|
import barcode
|
||||||
|
from barcode.writer import ImageWriter
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from PIL import Image, ImageDraw, ImageFont
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Configuration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
HAR_FILE = "www.dollargeneral.com_Archive [26-03-21 15-14-28].har"
|
||||||
|
BASE_URL = "https://www.dollargeneral.com"
|
||||||
|
OUTPUT_DIR = Path("catalog_output")
|
||||||
|
IMAGES_DIR = OUTPUT_DIR / "images"
|
||||||
|
BARCODES_DIR = OUTPUT_DIR / "barcodes"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0",
|
||||||
|
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||||
|
"Accept-Language": "en-US,en;q=0.9",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Keywords that identify card packs and tins (case-insensitive)
|
||||||
|
CARD_TIN_KEYWORDS = ["pack", "tin", "booster", "card game", "tcg"]
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Step 1 — Product Discovery (from HAR file API responses)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def extract_products_from_har(har_path: str) -> list[dict]:
|
||||||
|
"""Parse HAR file and extract all Pokemon products from API responses."""
|
||||||
|
print(f"📦 Reading HAR file: {har_path}")
|
||||||
|
|
||||||
|
with open(har_path, "r", encoding="utf-8") as f:
|
||||||
|
har = json.load(f)
|
||||||
|
|
||||||
|
api_url = "https://dggo.dollargeneral.com/omni/api/v2/category/search/provider"
|
||||||
|
unique: dict[str, dict] = {}
|
||||||
|
|
||||||
|
for entry in har["log"]["entries"]:
|
||||||
|
req = entry["request"]
|
||||||
|
resp = entry["response"]
|
||||||
|
if req["url"] != api_url or req["method"] != "POST":
|
||||||
|
continue
|
||||||
|
text = resp.get("content", {}).get("text", "")
|
||||||
|
if not text:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
data = json.loads(text)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
continue
|
||||||
|
for item in data.get("ItemList", {}).get("Items", []):
|
||||||
|
upc = str(item.get("UPC", ""))
|
||||||
|
if upc and upc not in unique:
|
||||||
|
unique[upc] = item
|
||||||
|
|
||||||
|
print(f" Found {len(unique)} unique products in HAR data")
|
||||||
|
return list(unique.values())
|
||||||
|
|
||||||
|
|
||||||
|
def rootsv_to_sku(rootsv: str) -> str:
|
||||||
|
"""Convert rootSV like '0419363_1' to SKU like '41936301'.
|
||||||
|
|
||||||
|
The rootSV base (minus leading zero) + '01' gives the DG item number.
|
||||||
|
The '_N' suffix is a variant/image index, not part of the SKU.
|
||||||
|
"""
|
||||||
|
if not rootsv:
|
||||||
|
return ""
|
||||||
|
base = rootsv.split("_")[0].lstrip("0")
|
||||||
|
return base + "01"
|
||||||
|
|
||||||
|
|
||||||
|
def build_product_url(upc: str) -> str:
|
||||||
|
"""Construct a Dollar General product page URL from a UPC."""
|
||||||
|
return f"{BASE_URL}/p/pokemon-product/{upc}"
|
||||||
|
|
||||||
|
|
||||||
|
def filter_card_and_tin_products(raw_items: list[dict]) -> list[dict]:
|
||||||
|
"""Keep only products whose description contains card/pack/tin keywords."""
|
||||||
|
filtered = []
|
||||||
|
for item in raw_items:
|
||||||
|
desc = item.get("Description", "").lower()
|
||||||
|
if any(kw in desc for kw in CARD_TIN_KEYWORDS):
|
||||||
|
filtered.append(item)
|
||||||
|
return filtered
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_product(item: dict) -> dict:
|
||||||
|
"""Convert raw API item into a clean product dict."""
|
||||||
|
upc = str(item.get("UPC", ""))
|
||||||
|
rootsv = item.get("rootSV", "")
|
||||||
|
sku = rootsv_to_sku(rootsv)
|
||||||
|
qty = item.get("AvailableQty", 0)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"title": item.get("Description", "Unknown Product"),
|
||||||
|
"sku": sku,
|
||||||
|
"upc": upc,
|
||||||
|
"price": f"${item.get('Price', 0):.2f}",
|
||||||
|
"stock": f"In Stock ({qty})" if qty and qty > 0 else "Out of Stock",
|
||||||
|
"quantity": qty,
|
||||||
|
"image_url": item.get("Image", ""),
|
||||||
|
"rating": item.get("AverageRating", 0),
|
||||||
|
"reviews": item.get("RatingReviewCount", 0),
|
||||||
|
"url": build_product_url(upc),
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Step 2 — Enrich from product pages (get real URL slug, extra details)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def enrich_from_product_page(product: dict) -> dict:
|
||||||
|
"""Visit the actual product page to get the real URL and any missing data."""
|
||||||
|
upc = product["upc"]
|
||||||
|
sku = product["sku"]
|
||||||
|
|
||||||
|
# Try to get the real product page
|
||||||
|
# DG product pages can be accessed by UPC at search
|
||||||
|
search_url = f"{BASE_URL}/search?q={upc}"
|
||||||
|
try:
|
||||||
|
resp = requests.get(search_url, headers=HEADERS, timeout=15)
|
||||||
|
if resp.status_code == 200:
|
||||||
|
soup = BeautifulSoup(resp.text, "html.parser")
|
||||||
|
# Look for the canonical product link
|
||||||
|
links = soup.select(f'a[href*="/p/"][href*="{upc}"]')
|
||||||
|
if links:
|
||||||
|
href = links[0].get("href", "")
|
||||||
|
product["url"] = urljoin(BASE_URL, href)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Also try visiting the product page directly by known pattern
|
||||||
|
# The image URL contains the DG item number: dg-XXXXXXXX-1
|
||||||
|
img_url = product.get("image_url", "")
|
||||||
|
match = re.search(r"dg-(\d+)-", img_url)
|
||||||
|
if match:
|
||||||
|
dg_item = match.group(1)
|
||||||
|
# This is the item number used in the SKU
|
||||||
|
if not product.get("sku"):
|
||||||
|
product["sku"] = dg_item
|
||||||
|
|
||||||
|
return product
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Step 3 — Download images & generate barcodes
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def download_image(url: str, dest: Path) -> Path | None:
|
||||||
|
"""Download image from URL, convert to PNG for LaTeX compatibility."""
|
||||||
|
if not url:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
resp = requests.get(url, headers=HEADERS, timeout=15)
|
||||||
|
resp.raise_for_status()
|
||||||
|
# Convert to PNG regardless of source format (handles WebP, etc.)
|
||||||
|
from io import BytesIO
|
||||||
|
img = Image.open(BytesIO(resp.content)).convert("RGB")
|
||||||
|
png_dest = dest.with_suffix(".png")
|
||||||
|
img.save(png_dest, "PNG")
|
||||||
|
return png_dest
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ⚠ Image download failed: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def make_placeholder(dest: Path, text: str = "No Image") -> Path:
|
||||||
|
"""Create a simple placeholder image."""
|
||||||
|
img = Image.new("RGB", (300, 300), "#e0e0e0")
|
||||||
|
draw = ImageDraw.Draw(img)
|
||||||
|
try:
|
||||||
|
font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 20)
|
||||||
|
except Exception:
|
||||||
|
font = ImageFont.load_default()
|
||||||
|
bbox = draw.textbbox((0, 0), text, font=font)
|
||||||
|
tw, th = bbox[2] - bbox[0], bbox[3] - bbox[1]
|
||||||
|
draw.text(((300 - tw) / 2, (300 - th) / 2), text, fill="#888", font=font)
|
||||||
|
img.save(dest)
|
||||||
|
return dest
|
||||||
|
|
||||||
|
|
||||||
|
def generate_barcode(upc: str, dest_dir: Path) -> Path | None:
|
||||||
|
"""Generate a UPC-A barcode PNG from a UPC number. Returns path to the .png file."""
|
||||||
|
digits = re.sub(r"\D", "", upc)
|
||||||
|
if not digits:
|
||||||
|
return None
|
||||||
|
# UPC-A: pass first 11 digits, library auto-calculates the 12th (check digit)
|
||||||
|
# A full UPC is 12 digits where the 12th is already the check digit
|
||||||
|
digits = digits[:11].zfill(11)
|
||||||
|
try:
|
||||||
|
upc_cls = barcode.get_barcode_class("upca")
|
||||||
|
bc = upc_cls(digits, writer=ImageWriter())
|
||||||
|
# barcode lib appends .png automatically
|
||||||
|
out = dest_dir / f"barcode_{upc}"
|
||||||
|
saved = bc.save(
|
||||||
|
str(out),
|
||||||
|
options={
|
||||||
|
"module_width": 0.3,
|
||||||
|
"module_height": 15.0,
|
||||||
|
"quiet_zone": 6.5,
|
||||||
|
"font_size": 10,
|
||||||
|
"text_distance": 5.0,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
return Path(saved)
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ⚠ Barcode generation failed for {upc}: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Step 4 — Generate PDF via pandoc
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def generate_catalog_pdf(products: list[dict]) -> Path | None:
|
||||||
|
"""Build a LaTeX file and convert to PDF with pandoc.
|
||||||
|
|
||||||
|
Layout per page (matching product.png mockup):
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ │
|
||||||
|
│ Product Image │ ← large, centered, bordered
|
||||||
|
│ │
|
||||||
|
└─────────────────────┘
|
||||||
|
Name ← product title, bold
|
||||||
|
Stk ← stock / price info
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ UPC-A Barcode │ ← centered, bordered
|
||||||
|
└─────────────────────┘
|
||||||
|
SKU: XXXXXXX ← small text
|
||||||
|
UPC: XXXXXXXXXXXX ← small text
|
||||||
|
"""
|
||||||
|
timestamp_label = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||||
|
timestamp_file = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
|
|
||||||
|
# Build LaTeX document directly for precise layout control
|
||||||
|
latex_lines = [
|
||||||
|
r"\documentclass[11pt,letterpaper]{article}",
|
||||||
|
r"\usepackage[margin=0.75in]{geometry}",
|
||||||
|
r"\usepackage{graphicx}",
|
||||||
|
r"\usepackage{fancybox}",
|
||||||
|
r"\usepackage{xcolor}",
|
||||||
|
r"\usepackage{parskip}",
|
||||||
|
r"\usepackage[utf8]{inputenc}",
|
||||||
|
r"\usepackage[T1]{fontenc}",
|
||||||
|
r"\usepackage{lmodern}",
|
||||||
|
r"\usepackage{hyperref}",
|
||||||
|
r"\pagestyle{empty}",
|
||||||
|
r"\begin{document}",
|
||||||
|
"",
|
||||||
|
# Manifest page
|
||||||
|
r"\begin{center}",
|
||||||
|
r"{\Huge\bfseries Pokemon TCG Product Catalog}\\[0.4cm]",
|
||||||
|
r"{\Large Dollar General}\\[0.2cm]",
|
||||||
|
r"{\large Generated: " + timestamp_label + r"}\\[0.2cm]",
|
||||||
|
r"{\large " + str(len(products)) + r" Cards \& Tins}",
|
||||||
|
r"\end{center}",
|
||||||
|
r"\vspace{0.8cm}",
|
||||||
|
r"\begin{tabular}{r l l r r}",
|
||||||
|
r"\hline",
|
||||||
|
r"\textbf{\#} & \textbf{Product} & \textbf{SKU} & \textbf{Price} & \textbf{Stock} \\",
|
||||||
|
r"\hline",
|
||||||
|
]
|
||||||
|
for i, prod in enumerate(products, 1):
|
||||||
|
safe = (
|
||||||
|
prod["title"][:50]
|
||||||
|
.replace("&", r"\&").replace("%", r"\%").replace("$", r"\$")
|
||||||
|
.replace("#", r"\#").replace("_", r"\_").replace("é", r"\'e")
|
||||||
|
)
|
||||||
|
price = prod["price"].replace("$", r"\$")
|
||||||
|
qty = prod.get("quantity", 0)
|
||||||
|
stock_short = str(qty) if qty else "---"
|
||||||
|
latex_lines.append(
|
||||||
|
f"{i} & {safe} & \\texttt{{{prod['sku']}}} & {price} & {stock_short} \\\\"
|
||||||
|
)
|
||||||
|
latex_lines += [
|
||||||
|
r"\hline",
|
||||||
|
r"\end{tabular}",
|
||||||
|
r"\newpage",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
for i, prod in enumerate(products, 1):
|
||||||
|
title = prod["title"]
|
||||||
|
sku = prod["sku"]
|
||||||
|
upc = prod["upc"]
|
||||||
|
price = prod["price"]
|
||||||
|
stock = prod["stock"]
|
||||||
|
|
||||||
|
# Download product image
|
||||||
|
img_dest = IMAGES_DIR / f"product_{i}_{sku}.jpg"
|
||||||
|
img_path = download_image(prod.get("image_url"), img_dest)
|
||||||
|
if not img_path:
|
||||||
|
img_path = make_placeholder(
|
||||||
|
IMAGES_DIR / f"product_{i}_{sku}_placeholder.png", title[:30]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Generate barcode from UPC (not SKU)
|
||||||
|
bc_path = generate_barcode(upc, BARCODES_DIR)
|
||||||
|
|
||||||
|
# Escape LaTeX special characters in text fields
|
||||||
|
safe_title = (
|
||||||
|
title.replace("&", r"\&")
|
||||||
|
.replace("%", r"\%")
|
||||||
|
.replace("$", r"\$")
|
||||||
|
.replace("#", r"\#")
|
||||||
|
.replace("_", r"\_")
|
||||||
|
.replace("é", r"\'e")
|
||||||
|
)
|
||||||
|
safe_stock = stock.replace("&", r"\&")
|
||||||
|
safe_price = price.replace("$", r"\$")
|
||||||
|
|
||||||
|
# Absolute paths for LaTeX
|
||||||
|
abs_img = str(img_path.resolve())
|
||||||
|
abs_bc = str(bc_path.resolve()) if bc_path else None
|
||||||
|
|
||||||
|
latex_lines += [
|
||||||
|
# Name — bold, large
|
||||||
|
r"{\Large\bfseries " + safe_title + r"}",
|
||||||
|
"",
|
||||||
|
r"\vspace{0.15cm}",
|
||||||
|
"",
|
||||||
|
# Stock and price
|
||||||
|
r"{\large " + safe_stock + r" \hfill " + safe_price + r"}",
|
||||||
|
"",
|
||||||
|
r"\vspace{0.1cm}",
|
||||||
|
"",
|
||||||
|
# SKU and UPC
|
||||||
|
r"{\small SKU: \texttt{" + sku + r"} \hfill UPC: \texttt{" + upc + r"}}",
|
||||||
|
"",
|
||||||
|
r"\vspace{0.3cm}",
|
||||||
|
"",
|
||||||
|
r"\begin{center}",
|
||||||
|
# Product image — large, centered, with border
|
||||||
|
r"\fbox{\includegraphics[width=0.7\textwidth,height=0.40\textheight,keepaspectratio]{"
|
||||||
|
+ abs_img
|
||||||
|
+ r"}}",
|
||||||
|
r"\end{center}",
|
||||||
|
"",
|
||||||
|
r"\vfill",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Barcode — centered, bordered, pushed to bottom
|
||||||
|
if abs_bc:
|
||||||
|
latex_lines += [
|
||||||
|
r"\begin{center}",
|
||||||
|
r"\fbox{\includegraphics[width=0.55\textwidth]{"
|
||||||
|
+ abs_bc
|
||||||
|
+ r"}}",
|
||||||
|
r"\end{center}",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Page break between products (not after last)
|
||||||
|
if i < len(products):
|
||||||
|
latex_lines.append(r"\newpage")
|
||||||
|
latex_lines.append("")
|
||||||
|
|
||||||
|
print(f" ✅ [{i}/{len(products)}] {title}")
|
||||||
|
|
||||||
|
latex_lines.append(r"\end{document}")
|
||||||
|
|
||||||
|
# Write .tex file
|
||||||
|
tex_file = OUTPUT_DIR / f"pokemon_catalog_{timestamp_file}.tex"
|
||||||
|
tex_file.write_text("\n".join(latex_lines), encoding="utf-8")
|
||||||
|
print(f"\n📝 LaTeX source: {tex_file}")
|
||||||
|
|
||||||
|
# Compile to PDF with pdflatex directly (pandoc strips images from raw .tex)
|
||||||
|
pdf_file = OUTPUT_DIR / f"pokemon_catalog_{timestamp_file}.pdf"
|
||||||
|
|
||||||
|
for engine in ["pdflatex", "xelatex"]:
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
[engine, "-interaction=nonstopmode",
|
||||||
|
f"-output-directory={OUTPUT_DIR}", str(tex_file)],
|
||||||
|
capture_output=True, text=True, timeout=120,
|
||||||
|
)
|
||||||
|
if pdf_file.exists() and pdf_file.stat().st_size > 1000:
|
||||||
|
# Clean up LaTeX temp files
|
||||||
|
for ext in [".aux", ".log", ".out"]:
|
||||||
|
tmp = pdf_file.with_suffix(ext)
|
||||||
|
if tmp.exists():
|
||||||
|
tmp.unlink()
|
||||||
|
print(
|
||||||
|
f"📄 PDF generated: {pdf_file} ({pdf_file.stat().st_size // 1024} KB)"
|
||||||
|
)
|
||||||
|
return pdf_file
|
||||||
|
except FileNotFoundError:
|
||||||
|
continue
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
|
||||||
|
print(f"⚠ PDF generation failed. LaTeX source available at: {tex_file}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def main():
|
||||||
|
args = sys.argv[1:]
|
||||||
|
|
||||||
|
# Handle --pdf-only mode
|
||||||
|
if "--pdf-only" in args:
|
||||||
|
idx = args.index("--pdf-only")
|
||||||
|
json_file = args[idx + 1] if idx + 1 < len(args) else None
|
||||||
|
if not json_file or not Path(json_file).exists():
|
||||||
|
print(f"Usage: {sys.argv[0]} --pdf-only <products.json>")
|
||||||
|
sys.exit(1)
|
||||||
|
products = json.loads(Path(json_file).read_text())
|
||||||
|
for d in [OUTPUT_DIR, IMAGES_DIR, BARCODES_DIR]:
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
print(f"\n🖨️ Generating PDF from {json_file} ({len(products)} products)...")
|
||||||
|
generate_catalog_pdf(products)
|
||||||
|
return
|
||||||
|
|
||||||
|
scrape_only = "--scrape-only" in args
|
||||||
|
|
||||||
|
# --- Banner ---
|
||||||
|
timestamp_file = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
|
print("=" * 60)
|
||||||
|
print(" 🔍 Pokemon Discovery (pokemon-disco)")
|
||||||
|
print(" Dollar General — Pokemon TCG Cards & Tins")
|
||||||
|
print(f" {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
# --- Step 1: Extract from HAR ---
|
||||||
|
if not Path(HAR_FILE).exists():
|
||||||
|
print(f"\n❌ HAR file not found: {HAR_FILE}")
|
||||||
|
print(" Capture a HAR file from the Pokemon page in your browser")
|
||||||
|
print(" and place it in the project directory.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
raw_items = extract_products_from_har(HAR_FILE)
|
||||||
|
|
||||||
|
# --- Step 2: Filter for Cards & Tins ---
|
||||||
|
print(f"\n🎯 Filtering for card packs and tins...")
|
||||||
|
card_tin_items = filter_card_and_tin_products(raw_items)
|
||||||
|
print(f" {len(card_tin_items)} of {len(raw_items)} products match (pack/tin/booster/tcg)")
|
||||||
|
|
||||||
|
if not card_tin_items:
|
||||||
|
print("❌ No card or tin products found.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Show what was filtered out
|
||||||
|
excluded = [i for i in raw_items if i not in card_tin_items]
|
||||||
|
if excluded:
|
||||||
|
print(f"\n Excluded {len(excluded)} non-card/tin products:")
|
||||||
|
for item in excluded:
|
||||||
|
print(f" ✗ {item.get('Description', '?')}")
|
||||||
|
|
||||||
|
# --- Step 3: Normalize ---
|
||||||
|
print(f"\n📋 Processing {len(card_tin_items)} products...")
|
||||||
|
products = [normalize_product(item) for item in card_tin_items]
|
||||||
|
|
||||||
|
# Print summary table
|
||||||
|
print()
|
||||||
|
print(f" {'#':<3} {'Title':<55} {'SKU':<12} {'Price':<8} {'Stock'}")
|
||||||
|
print(f" {'—'*3} {'—'*55} {'—'*12} {'—'*8} {'—'*15}")
|
||||||
|
for i, p in enumerate(products, 1):
|
||||||
|
title = p['title'][:53]
|
||||||
|
print(f" {i:<3} {title:<55} {p['sku']:<12} {p['price']:<8} {p['stock']}")
|
||||||
|
|
||||||
|
# --- Step 4: Save JSON ---
|
||||||
|
json_file = f"pokemon_tcg_products_{timestamp_file}.json"
|
||||||
|
Path(json_file).write_text(json.dumps(products, indent=2, ensure_ascii=False))
|
||||||
|
print(f"\n💾 Product data: {json_file}")
|
||||||
|
|
||||||
|
if scrape_only:
|
||||||
|
print("\n✅ Scrape complete (--scrape-only). Run with --pdf-only to generate catalog.")
|
||||||
|
return
|
||||||
|
|
||||||
|
# --- Step 5: Generate PDF ---
|
||||||
|
for d in [OUTPUT_DIR, IMAGES_DIR, BARCODES_DIR]:
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
print(f"\n🖨️ Generating PDF catalog...")
|
||||||
|
pdf_path = generate_catalog_pdf(products)
|
||||||
|
|
||||||
|
# --- Done ---
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
if pdf_path:
|
||||||
|
print(f" ✅ COMPLETE!")
|
||||||
|
print(f" 📄 PDF Catalog: {pdf_path}")
|
||||||
|
print(f" 💾 Product JSON: {json_file}")
|
||||||
|
print(f" 🏷️ Barcodes: {BARCODES_DIR}/")
|
||||||
|
print(f" 🖼️ Images: {IMAGES_DIR}/")
|
||||||
|
else:
|
||||||
|
print(f" ⚠ PDF generation failed — markdown file available in {OUTPUT_DIR}/")
|
||||||
|
print(f" 💾 Product JSON: {json_file}")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -1,135 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Extract exact API request details from HAR file
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
from urllib.parse import urlparse, parse_qs
|
|
||||||
|
|
||||||
def extract_api_request_details():
|
|
||||||
"""Extract the exact API request format"""
|
|
||||||
|
|
||||||
har_file = 'www.dollargeneral.com_Archive [26-03-21 15-14-28].har'
|
|
||||||
|
|
||||||
with open(har_file, 'r', encoding='utf-8') as f:
|
|
||||||
har_data = json.load(f)
|
|
||||||
|
|
||||||
entries = har_data.get('log', {}).get('entries', [])
|
|
||||||
|
|
||||||
# Find the API calls that contain our product
|
|
||||||
api_endpoint = "https://dggo.dollargeneral.com/omni/api/v2/category/search/provider"
|
|
||||||
|
|
||||||
successful_calls = []
|
|
||||||
|
|
||||||
for entry in entries:
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
|
|
||||||
if (request.get('url') == api_endpoint and
|
|
||||||
request.get('method') == 'POST' and
|
|
||||||
response.get('status') == 200):
|
|
||||||
|
|
||||||
# Check if response contains our product
|
|
||||||
response_text = response.get('content', {}).get('text', '')
|
|
||||||
if '41936301' in response_text and 'pokemon' in response_text.lower():
|
|
||||||
successful_calls.append(entry)
|
|
||||||
|
|
||||||
print(f"Found {len(successful_calls)} successful API calls with Pokemon products")
|
|
||||||
print()
|
|
||||||
|
|
||||||
for i, entry in enumerate(successful_calls):
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
|
|
||||||
print(f"=== API Call {i+1} ===")
|
|
||||||
print(f"URL: {request.get('url')}")
|
|
||||||
print(f"Method: {request.get('method')}")
|
|
||||||
|
|
||||||
# Extract headers
|
|
||||||
headers = {}
|
|
||||||
for header in request.get('headers', []):
|
|
||||||
name = header.get('name')
|
|
||||||
value = header.get('value')
|
|
||||||
if name.lower() in ['authorization', 'content-type', 'accept', 'referer', 'user-agent']:
|
|
||||||
headers[name] = value
|
|
||||||
|
|
||||||
print("Headers:")
|
|
||||||
for name, value in headers.items():
|
|
||||||
if name.lower() == 'authorization':
|
|
||||||
print(f" {name}: {value[:50]}... (Bearer token)")
|
|
||||||
else:
|
|
||||||
print(f" {name}: {value}")
|
|
||||||
|
|
||||||
# Extract POST data
|
|
||||||
post_data = request.get('postData', {})
|
|
||||||
if post_data.get('text'):
|
|
||||||
try:
|
|
||||||
post_json = json.loads(post_data.get('text'))
|
|
||||||
print("POST Data:")
|
|
||||||
print(json.dumps(post_json, indent=2))
|
|
||||||
except:
|
|
||||||
print(f"POST Data (raw): {post_data.get('text')}")
|
|
||||||
|
|
||||||
# Analyze response
|
|
||||||
response_text = response.get('content', {}).get('text', '')
|
|
||||||
if response_text:
|
|
||||||
try:
|
|
||||||
response_json = json.loads(response_text)
|
|
||||||
print(f"Response size: {len(response_text)} characters")
|
|
||||||
|
|
||||||
# Extract product information
|
|
||||||
items = response_json.get('ItemList', {}).get('Items', [])
|
|
||||||
print(f"Products found: {len(items)}")
|
|
||||||
|
|
||||||
# Show Pokemon products
|
|
||||||
pokemon_products = []
|
|
||||||
for item in items:
|
|
||||||
title = item.get('Title', '').lower()
|
|
||||||
if 'pokemon' in title or 'pokémon' in title:
|
|
||||||
pokemon_products.append({
|
|
||||||
'title': item.get('Title'),
|
|
||||||
'sku': item.get('ItemNbr'),
|
|
||||||
'upc': item.get('UPC'),
|
|
||||||
'price': item.get('Price', {}).get('Amount'),
|
|
||||||
'url': item.get('ProductUrl'),
|
|
||||||
'in_stock': item.get('Inventory', {}).get('InStock'),
|
|
||||||
'available_online': item.get('Inventory', {}).get('AvailableOnline')
|
|
||||||
})
|
|
||||||
|
|
||||||
if pokemon_products:
|
|
||||||
print(f"\nPokemon products in this response: {len(pokemon_products)}")
|
|
||||||
for prod in pokemon_products:
|
|
||||||
print(f" • {prod['title']}")
|
|
||||||
print(f" SKU: {prod['sku']}, UPC: {prod['upc']}")
|
|
||||||
print(f" Price: ${prod['price']}, In Stock: {prod['in_stock']}")
|
|
||||||
print(f" URL: {prod['url']}")
|
|
||||||
|
|
||||||
# Extract the store number and filters used
|
|
||||||
if i == 0: # Save the working request format
|
|
||||||
with open('api_request_template.json', 'w') as f:
|
|
||||||
json.dump({
|
|
||||||
'endpoint': api_endpoint,
|
|
||||||
'method': 'POST',
|
|
||||||
'headers': headers,
|
|
||||||
'post_data': post_json,
|
|
||||||
'example_response': {
|
|
||||||
'total_items': len(items),
|
|
||||||
'pokemon_items': len(pokemon_products),
|
|
||||||
'sample_pokemon_product': pokemon_products[0] if pokemon_products else None
|
|
||||||
}
|
|
||||||
}, f, indent=2)
|
|
||||||
print(f"\n✅ Saved working API template to: api_request_template.json")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error parsing response: {e}")
|
|
||||||
|
|
||||||
print("\n" + "="*60 + "\n")
|
|
||||||
|
|
||||||
return successful_calls
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
successful_calls = extract_api_request_details()
|
|
||||||
|
|
||||||
print("🎯 SUMMARY:")
|
|
||||||
print(f" Successfully extracted {len(successful_calls)} working API calls")
|
|
||||||
print(" Next step: Implement this API call in Pokemon Discovery scraper")
|
|
||||||
@@ -1,297 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Implement API-based scraping for Pokemon Discovery
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import requests
|
|
||||||
import sys
|
|
||||||
from datetime import datetime
|
|
||||||
from urllib.parse import urljoin
|
|
||||||
|
|
||||||
class DollarGeneralAPIScaper:
|
|
||||||
def __init__(self):
|
|
||||||
self.base_url = "https://www.dollargeneral.com"
|
|
||||||
self.api_base = "https://dggo.dollargeneral.com"
|
|
||||||
self.session = requests.Session()
|
|
||||||
|
|
||||||
# Headers that mimic a real browser session
|
|
||||||
self.headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Accept-Language': 'en-US,en;q=0.9',
|
|
||||||
'Accept-Encoding': 'gzip, deflate, br',
|
|
||||||
'DNT': '1',
|
|
||||||
'Connection': 'keep-alive',
|
|
||||||
'Sec-Fetch-Dest': 'empty',
|
|
||||||
'Sec-Fetch-Mode': 'cors',
|
|
||||||
'Sec-Fetch-Site': 'cross-site',
|
|
||||||
}
|
|
||||||
self.session.headers.update(self.headers)
|
|
||||||
|
|
||||||
self.auth_token = None
|
|
||||||
|
|
||||||
def get_auth_token(self):
|
|
||||||
"""Try multiple methods to get authentication token"""
|
|
||||||
|
|
||||||
print("🔑 Attempting to get authentication token...")
|
|
||||||
|
|
||||||
# Method 1: Get token from main page
|
|
||||||
try:
|
|
||||||
print(" - Visiting main Pokemon page...")
|
|
||||||
pokemon_url = f"{self.base_url}/c/toys/pokemon?q=&soldAtStore=true"
|
|
||||||
response = self.session.get(pokemon_url, timeout=30)
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
# Look for embedded tokens in the page
|
|
||||||
import re
|
|
||||||
|
|
||||||
# Look for bearer tokens in script tags
|
|
||||||
token_patterns = [
|
|
||||||
r'Bearer\s+([A-Za-z0-9\-_\.]+)',
|
|
||||||
r'"access_token":\s*"([^"]+)"',
|
|
||||||
r'"token":\s*"([^"]+)"',
|
|
||||||
r'authorization:\s*["\'](Bearer\s+[^"\']+)["\']'
|
|
||||||
]
|
|
||||||
|
|
||||||
for pattern in token_patterns:
|
|
||||||
matches = re.findall(pattern, response.text, re.IGNORECASE)
|
|
||||||
if matches:
|
|
||||||
token = matches[0]
|
|
||||||
if token.startswith('Bearer '):
|
|
||||||
token = token[7:] # Remove 'Bearer ' prefix
|
|
||||||
print(f" ✅ Found token via pattern: {token[:50]}...")
|
|
||||||
self.auth_token = token
|
|
||||||
return token
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Main page method failed: {e}")
|
|
||||||
|
|
||||||
# Method 2: Try token endpoint
|
|
||||||
try:
|
|
||||||
print(" - Trying token endpoint...")
|
|
||||||
token_url = f"{self.base_url}/bin/omni/userTokens"
|
|
||||||
response = self.session.get(token_url, timeout=30)
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
if 'access_token' in data:
|
|
||||||
token = data['access_token']
|
|
||||||
print(f" ✅ Got token from endpoint: {token[:50]}...")
|
|
||||||
self.auth_token = token
|
|
||||||
return token
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Token endpoint failed: {e}")
|
|
||||||
|
|
||||||
# Method 3: Try CSRF token endpoint
|
|
||||||
try:
|
|
||||||
print(" - Trying CSRF token...")
|
|
||||||
csrf_url = f"{self.base_url}/libs/granite/csrf/token.json"
|
|
||||||
response = self.session.get(csrf_url, timeout=30)
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
data = response.json()
|
|
||||||
if 'token' in data:
|
|
||||||
# This might not be the right token, but let's try
|
|
||||||
print(f" ⚠️ Got CSRF token (may not work for API): {str(data)[:100]}...")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ CSRF method failed: {e}")
|
|
||||||
|
|
||||||
print(" ❌ Could not obtain authentication token")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def search_products_api(self, store_nbr=17506, category_id=723960, include_out_of_stock=True):
|
|
||||||
"""Search for products using the API endpoint"""
|
|
||||||
|
|
||||||
print(f"🔍 Searching products via API...")
|
|
||||||
print(f" Store: {store_nbr}, Category: {category_id}")
|
|
||||||
|
|
||||||
if not self.auth_token:
|
|
||||||
print(" ❌ No authentication token available")
|
|
||||||
return []
|
|
||||||
|
|
||||||
endpoint = f"{self.api_base}/omni/api/v2/category/search/provider"
|
|
||||||
|
|
||||||
# Headers for API request
|
|
||||||
api_headers = self.headers.copy()
|
|
||||||
api_headers.update({
|
|
||||||
'Content-Type': 'application/json',
|
|
||||||
'Authorization': f'Bearer {self.auth_token}',
|
|
||||||
'Referer': f'{self.base_url}/',
|
|
||||||
'Origin': self.base_url,
|
|
||||||
})
|
|
||||||
|
|
||||||
# Request payload based on HAR analysis
|
|
||||||
payload = {
|
|
||||||
"StoreNbr": store_nbr,
|
|
||||||
"SearchTerm": None,
|
|
||||||
"PageSize": 48, # Request more items
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": False,
|
|
||||||
"dgPickUp": False,
|
|
||||||
"dgShipTohome": False,
|
|
||||||
"soldAtStore": True,
|
|
||||||
"inStock": not include_out_of_stock, # False = include out of stock
|
|
||||||
"onlyActivatedDeals": False
|
|
||||||
},
|
|
||||||
"IncludeSponsored": True,
|
|
||||||
"IncludeShipToHome": True,
|
|
||||||
"IncludeDeals": True,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": category_id,
|
|
||||||
"IncludeProducts": False,
|
|
||||||
"DoNotSave": False,
|
|
||||||
"OptOut": False,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
|
|
||||||
try:
|
|
||||||
print(f" POST {endpoint}")
|
|
||||||
response = self.session.post(endpoint,
|
|
||||||
headers=api_headers,
|
|
||||||
json=payload,
|
|
||||||
timeout=30)
|
|
||||||
|
|
||||||
print(f" Status: {response.status_code}")
|
|
||||||
print(f" Response size: {len(response.text)} characters")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
if len(response.text) == 0:
|
|
||||||
print(" ⚠️ Empty response (token may be expired)")
|
|
||||||
return []
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
items = data.get('ItemList', {}).get('Items', [])
|
|
||||||
print(f" ✅ Found {len(items)} total items")
|
|
||||||
return items
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ JSON parsing error: {e}")
|
|
||||||
print(f" Response preview: {response.text[:200]}...")
|
|
||||||
return []
|
|
||||||
|
|
||||||
elif response.status_code == 401:
|
|
||||||
print(" ❌ Authentication failed - token expired or invalid")
|
|
||||||
return []
|
|
||||||
else:
|
|
||||||
print(f" ❌ API error: {response.status_code}")
|
|
||||||
print(f" Response: {response.text[:200]}...")
|
|
||||||
return []
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Request failed: {e}")
|
|
||||||
return []
|
|
||||||
|
|
||||||
def filter_pokemon_products(self, items):
|
|
||||||
"""Filter for Pokemon TCG products"""
|
|
||||||
|
|
||||||
pokemon_products = []
|
|
||||||
|
|
||||||
for item in items:
|
|
||||||
title = item.get('Title', '').lower()
|
|
||||||
description = item.get('Description', '').lower()
|
|
||||||
brand = item.get('Brand', '').lower()
|
|
||||||
|
|
||||||
# Check if this is a Pokemon TCG product
|
|
||||||
pokemon_keywords = ['pokemon', 'pokémon']
|
|
||||||
tcg_keywords = ['trading card', 'tcg', 'cards', 'pack', 'tin', 'box', 'collection']
|
|
||||||
|
|
||||||
has_pokemon = any(keyword in title or keyword in description for keyword in pokemon_keywords)
|
|
||||||
has_tcg = any(keyword in title or keyword in description for keyword in tcg_keywords)
|
|
||||||
|
|
||||||
if has_pokemon and has_tcg:
|
|
||||||
product = {
|
|
||||||
'title': item.get('Title'),
|
|
||||||
'sku': item.get('ItemNbr'),
|
|
||||||
'upc': item.get('UPC'),
|
|
||||||
'price': f"${item.get('Price', {}).get('Amount', 0):.2f}",
|
|
||||||
'url': urljoin(self.base_url, item.get('ProductUrl', '')),
|
|
||||||
'stock': 'In Stock' if item.get('Inventory', {}).get('InStock') else 'Out of Stock',
|
|
||||||
'image_url': item.get('ImageURL'),
|
|
||||||
'description': item.get('Description', ''),
|
|
||||||
'brand': item.get('Brand', '')
|
|
||||||
}
|
|
||||||
pokemon_products.append(product)
|
|
||||||
|
|
||||||
print(f" 🎯 Found: {product['title']}")
|
|
||||||
print(f" SKU: {product['sku']}, Price: {product['price']}")
|
|
||||||
print(f" Stock: {product['stock']}")
|
|
||||||
|
|
||||||
return pokemon_products
|
|
||||||
|
|
||||||
def scrape_pokemon_products(self):
|
|
||||||
"""Main scraping method"""
|
|
||||||
|
|
||||||
print("Pokemon Discovery - API-based Scraping")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Get authentication token
|
|
||||||
if not self.get_auth_token():
|
|
||||||
print("❌ Authentication failed - cannot access API")
|
|
||||||
print()
|
|
||||||
print("💡 Alternative approaches:")
|
|
||||||
print(" 1. Use browser automation with proper session")
|
|
||||||
print(" 2. Extract products manually from individual pages")
|
|
||||||
print(" 3. Use the working individual product scraper")
|
|
||||||
return []
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Search for products
|
|
||||||
all_items = self.search_products_api()
|
|
||||||
|
|
||||||
if not all_items:
|
|
||||||
print("❌ No items returned from API")
|
|
||||||
return []
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Filter for Pokemon products
|
|
||||||
pokemon_products = self.filter_pokemon_products(all_items)
|
|
||||||
|
|
||||||
print()
|
|
||||||
print(f"🎉 SUCCESS! Found {len(pokemon_products)} Pokemon TCG products")
|
|
||||||
|
|
||||||
if pokemon_products:
|
|
||||||
# Save results
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
||||||
filename = f'pokemon_tcg_api_scrape_{timestamp}.json'
|
|
||||||
|
|
||||||
with open(filename, 'w') as f:
|
|
||||||
json.dump(pokemon_products, f, indent=2)
|
|
||||||
|
|
||||||
print(f"💾 Saved to: {filename}")
|
|
||||||
|
|
||||||
# Show summary
|
|
||||||
print()
|
|
||||||
print("📋 Product Summary:")
|
|
||||||
for i, product in enumerate(pokemon_products, 1):
|
|
||||||
print(f" {i}. {product['title']}")
|
|
||||||
print(f" SKU: {product['sku']} | Price: {product['price']} | {product['stock']}")
|
|
||||||
|
|
||||||
return pokemon_products
|
|
||||||
|
|
||||||
def main():
|
|
||||||
scraper = DollarGeneralAPIScaper()
|
|
||||||
products = scraper.scrape_pokemon_products()
|
|
||||||
|
|
||||||
if products:
|
|
||||||
print()
|
|
||||||
print("🚀 Ready for PDF generation!")
|
|
||||||
print("Run: python pdf_generator.py pokemon_tcg_api_scrape_[timestamp].json")
|
|
||||||
else:
|
|
||||||
print()
|
|
||||||
print("📝 Note: Individual product scraping still works perfectly!")
|
|
||||||
print("The issue is authentication for bulk API access.")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
279
pdf_generator.py
279
pdf_generator.py
@@ -1,279 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Pokemon Discovery - TCG Product Catalog PDF Generator
|
|
||||||
Generates PDF catalog with product images, details, and UPC-A barcodes
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import requests
|
|
||||||
import subprocess
|
|
||||||
from datetime import datetime
|
|
||||||
from pathlib import Path
|
|
||||||
import barcode
|
|
||||||
from barcode.writer import ImageWriter
|
|
||||||
from PIL import Image, ImageDraw, ImageFont
|
|
||||||
import tempfile
|
|
||||||
import shutil
|
|
||||||
|
|
||||||
class PokemonTCGCatalogGenerator:
|
|
||||||
def __init__(self, json_file):
|
|
||||||
self.json_file = json_file
|
|
||||||
self.output_dir = Path("catalog_output")
|
|
||||||
self.images_dir = self.output_dir / "images"
|
|
||||||
self.barcodes_dir = self.output_dir / "barcodes"
|
|
||||||
|
|
||||||
# Create output directories
|
|
||||||
self.output_dir.mkdir(exist_ok=True)
|
|
||||||
self.images_dir.mkdir(exist_ok=True)
|
|
||||||
self.barcodes_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# Load product data
|
|
||||||
with open(json_file, 'r') as f:
|
|
||||||
self.products = json.load(f)
|
|
||||||
|
|
||||||
def download_image(self, url, filename):
|
|
||||||
"""Download product image"""
|
|
||||||
if not url:
|
|
||||||
return None
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.get(url, timeout=30)
|
|
||||||
response.raise_for_status()
|
|
||||||
|
|
||||||
filepath = self.images_dir / filename
|
|
||||||
with open(filepath, 'wb') as f:
|
|
||||||
f.write(response.content)
|
|
||||||
|
|
||||||
return filepath
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to download image {url}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def generate_upc_barcode(self, sku):
|
|
||||||
"""Generate UPC-A barcode from SKU"""
|
|
||||||
try:
|
|
||||||
# Convert SKU to 12-digit UPC-A format
|
|
||||||
# Remove non-digits and pad/truncate to 11 digits (12th is check digit)
|
|
||||||
digits_only = ''.join(filter(str.isdigit, str(sku)))
|
|
||||||
|
|
||||||
if len(digits_only) < 11:
|
|
||||||
# Pad with zeros at the start
|
|
||||||
upc_base = digits_only.zfill(11)
|
|
||||||
else:
|
|
||||||
# Take the last 11 digits
|
|
||||||
upc_base = digits_only[-11:]
|
|
||||||
|
|
||||||
# Generate UPC-A barcode
|
|
||||||
upc_generator = barcode.get_barcode_class('upca')
|
|
||||||
upc = upc_generator(upc_base, writer=ImageWriter())
|
|
||||||
|
|
||||||
# Save barcode image
|
|
||||||
barcode_filename = f"barcode_{sku.replace('/', '_').replace(' ', '_')}"
|
|
||||||
barcode_path = self.barcodes_dir / barcode_filename
|
|
||||||
|
|
||||||
# Save with specific options for better appearance
|
|
||||||
upc.save(str(barcode_path).replace('.png', ''), options={
|
|
||||||
'module_width': 0.2,
|
|
||||||
'module_height': 15.0,
|
|
||||||
'quiet_zone': 6.5,
|
|
||||||
'font_size': 10,
|
|
||||||
'text_distance': 5.0,
|
|
||||||
'background': 'white',
|
|
||||||
'foreground': 'black'
|
|
||||||
})
|
|
||||||
|
|
||||||
final_path = f"{barcode_path}.png"
|
|
||||||
return final_path
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to generate barcode for SKU {sku}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def create_placeholder_image(self, width=300, height=200):
|
|
||||||
"""Create a placeholder image when product image is not available"""
|
|
||||||
img = Image.new('RGB', (width, height), color='lightgray')
|
|
||||||
draw = ImageDraw.Draw(img)
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Try to use a system font
|
|
||||||
font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf', 24)
|
|
||||||
except:
|
|
||||||
try:
|
|
||||||
font = ImageFont.truetype('arial.ttf', 24)
|
|
||||||
except:
|
|
||||||
font = ImageFont.load_default()
|
|
||||||
|
|
||||||
text = "No Image\nAvailable"
|
|
||||||
|
|
||||||
# Get text bounding box for centering
|
|
||||||
lines = text.split('\n')
|
|
||||||
y_offset = height // 2 - (len(lines) * 30) // 2
|
|
||||||
|
|
||||||
for line in lines:
|
|
||||||
bbox = draw.textbbox((0, 0), line, font=font)
|
|
||||||
text_width = bbox[2] - bbox[0]
|
|
||||||
x_offset = (width - text_width) // 2
|
|
||||||
draw.text((x_offset, y_offset), line, fill='darkgray', font=font)
|
|
||||||
y_offset += 30
|
|
||||||
|
|
||||||
placeholder_path = self.images_dir / "placeholder.png"
|
|
||||||
img.save(placeholder_path)
|
|
||||||
return placeholder_path
|
|
||||||
|
|
||||||
def generate_markdown(self):
|
|
||||||
"""Generate markdown content for the catalog"""
|
|
||||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
|
||||||
markdown = f"""---
|
|
||||||
title: "Pokemon TCG Product Catalog"
|
|
||||||
subtitle: "Dollar General - Generated {timestamp}"
|
|
||||||
author: "Automated Scraper"
|
|
||||||
date: "{timestamp}"
|
|
||||||
geometry: margin=1in
|
|
||||||
fontsize: 11pt
|
|
||||||
documentclass: article
|
|
||||||
---
|
|
||||||
|
|
||||||
# Pokemon TCG Product Catalog
|
|
||||||
|
|
||||||
Generated on: {timestamp}
|
|
||||||
Source: Dollar General
|
|
||||||
Total Products: {len(self.products)}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
for i, product in enumerate(self.products, 1):
|
|
||||||
print(f"Processing product {i}/{len(self.products)}: {product.get('title', 'Unknown')}")
|
|
||||||
|
|
||||||
# Download product image
|
|
||||||
image_path = None
|
|
||||||
if product.get('image_url'):
|
|
||||||
filename = f"product_{i}_{product.get('sku', 'unknown').replace('/', '_').replace(' ', '_')}.jpg"
|
|
||||||
image_path = self.download_image(product.get('image_url'), filename)
|
|
||||||
|
|
||||||
if not image_path:
|
|
||||||
# Use placeholder
|
|
||||||
image_path = self.create_placeholder_image()
|
|
||||||
|
|
||||||
# Generate barcode
|
|
||||||
barcode_path = None
|
|
||||||
if product.get('sku'):
|
|
||||||
barcode_path = self.generate_upc_barcode(product.get('sku'))
|
|
||||||
|
|
||||||
# Add product section to markdown
|
|
||||||
markdown += f"## {i}. {product.get('title', 'Unknown Product')}\n\n"
|
|
||||||
|
|
||||||
# Product image
|
|
||||||
if image_path:
|
|
||||||
rel_image_path = os.path.relpath(image_path, self.output_dir)
|
|
||||||
markdown += f"{{width=300px}}\n\n"
|
|
||||||
|
|
||||||
# Product details in a table
|
|
||||||
markdown += "| Field | Value |\n"
|
|
||||||
markdown += "|-------|-------|\n"
|
|
||||||
markdown += f"| **Title** | {product.get('title', 'N/A')} |\n"
|
|
||||||
markdown += f"| **Price** | {product.get('price', 'N/A')} |\n"
|
|
||||||
markdown += f"| **Stock** | {product.get('stock', 'N/A')} |\n"
|
|
||||||
markdown += f"| **SKU** | `{product.get('sku', 'N/A')}` |\n"
|
|
||||||
markdown += f"| **URL** | {product.get('url', 'N/A')} |\n"
|
|
||||||
markdown += "\n"
|
|
||||||
|
|
||||||
# Barcode
|
|
||||||
if barcode_path:
|
|
||||||
rel_barcode_path = os.path.relpath(barcode_path, self.output_dir)
|
|
||||||
markdown += f"**UPC-A Barcode:**\n\n"
|
|
||||||
markdown += f"{{width=200px}}\n\n"
|
|
||||||
|
|
||||||
markdown += "---\n\n"
|
|
||||||
|
|
||||||
return markdown
|
|
||||||
|
|
||||||
def generate_pdf(self):
|
|
||||||
"""Generate PDF catalog using pandoc"""
|
|
||||||
print("Generating markdown content...")
|
|
||||||
markdown_content = self.generate_markdown()
|
|
||||||
|
|
||||||
# Save markdown file
|
|
||||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
|
||||||
markdown_file = self.output_dir / f"pokemon_tcg_catalog_{timestamp}.md"
|
|
||||||
|
|
||||||
with open(markdown_file, 'w', encoding='utf-8') as f:
|
|
||||||
f.write(markdown_content)
|
|
||||||
|
|
||||||
print(f"Markdown saved to: {markdown_file}")
|
|
||||||
|
|
||||||
# Generate PDF using pandoc
|
|
||||||
pdf_file = self.output_dir / f"pokemon_tcg_catalog_{timestamp}.pdf"
|
|
||||||
|
|
||||||
print("Converting to PDF using pandoc...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
subprocess.run([
|
|
||||||
'pandoc',
|
|
||||||
str(markdown_file),
|
|
||||||
'-o', str(pdf_file),
|
|
||||||
'--pdf-engine=xelatex',
|
|
||||||
'-V', 'colorlinks=true',
|
|
||||||
'-V', 'linkcolor=blue',
|
|
||||||
'-V', 'filecolor=magenta',
|
|
||||||
'-V', 'urlcolor=cyan',
|
|
||||||
'--toc',
|
|
||||||
'--toc-depth=2'
|
|
||||||
], check=True)
|
|
||||||
|
|
||||||
print(f"PDF generated successfully: {pdf_file}")
|
|
||||||
return pdf_file
|
|
||||||
|
|
||||||
except subprocess.CalledProcessError as e:
|
|
||||||
print(f"Pandoc conversion failed: {e}")
|
|
||||||
print("Trying with pdflatex instead...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
subprocess.run([
|
|
||||||
'pandoc',
|
|
||||||
str(markdown_file),
|
|
||||||
'-o', str(pdf_file),
|
|
||||||
'--pdf-engine=pdflatex',
|
|
||||||
'--toc'
|
|
||||||
], check=True)
|
|
||||||
|
|
||||||
print(f"PDF generated successfully: {pdf_file}")
|
|
||||||
return pdf_file
|
|
||||||
|
|
||||||
except subprocess.CalledProcessError as e2:
|
|
||||||
print(f"PDF generation failed with both engines: {e2}")
|
|
||||||
print(f"Markdown file available at: {markdown_file}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
except FileNotFoundError:
|
|
||||||
print("Error: pandoc not found. Please install pandoc to generate PDF.")
|
|
||||||
print(f"Markdown file available at: {markdown_file}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def main():
|
|
||||||
if len(sys.argv) != 2:
|
|
||||||
print("Usage: python3 pdf_generator.py <json_file>")
|
|
||||||
print("Example: python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
json_file = sys.argv[1]
|
|
||||||
|
|
||||||
if not os.path.exists(json_file):
|
|
||||||
print(f"Error: JSON file '{json_file}' not found")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
generator = PokemonTCGCatalogGenerator(json_file)
|
|
||||||
pdf_file = generator.generate_pdf()
|
|
||||||
|
|
||||||
if pdf_file:
|
|
||||||
print(f"\nCatalog generation completed!")
|
|
||||||
print(f"PDF file: {pdf_file}")
|
|
||||||
print(f"Output directory: {generator.output_dir}")
|
|
||||||
else:
|
|
||||||
print(f"\nPDF generation failed, but markdown file is available in: {generator.output_dir}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
31
run.sh
31
run.sh
@@ -1,31 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
# Pokemon Discovery - Scraper & Catalog Generator Launcher
|
|
||||||
# Automatically activates virtual environment and runs the scraper
|
|
||||||
|
|
||||||
set -e
|
|
||||||
|
|
||||||
cd "$(dirname "$0")"
|
|
||||||
|
|
||||||
echo "Pokemon Discovery - Product Scraper & Catalog Generator"
|
|
||||||
echo "================================================"
|
|
||||||
|
|
||||||
# Check if virtual environment exists
|
|
||||||
if [[ ! -d "venv" ]]; then
|
|
||||||
echo "Creating virtual environment..."
|
|
||||||
python3 -m venv venv
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Activate virtual environment
|
|
||||||
source venv/bin/activate
|
|
||||||
|
|
||||||
# Check if requirements are installed
|
|
||||||
if ! python -c "import requests, bs4, barcode, selenium" 2>/dev/null; then
|
|
||||||
echo "Installing Python requirements..."
|
|
||||||
pip install -r requirements.txt
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Run the main script
|
|
||||||
python run_scraper.py
|
|
||||||
|
|
||||||
echo ""
|
|
||||||
echo "Script completed. Check the output above for results."
|
|
||||||
139
run_scraper.py
139
run_scraper.py
@@ -1,139 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Pokemon Discovery - Scraper and Catalog Generator
|
|
||||||
Main script that runs both scraping and PDF generation
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import subprocess
|
|
||||||
from datetime import datetime
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
def install_requirements():
|
|
||||||
"""Install Python requirements"""
|
|
||||||
print("Installing Python requirements...")
|
|
||||||
try:
|
|
||||||
subprocess.run([sys.executable, '-m', 'pip', 'install', '-r', 'requirements.txt'],
|
|
||||||
check=True)
|
|
||||||
print("Requirements installed successfully!")
|
|
||||||
except subprocess.CalledProcessError as e:
|
|
||||||
print(f"Failed to install requirements: {e}")
|
|
||||||
return False
|
|
||||||
return True
|
|
||||||
|
|
||||||
def run_scraper():
|
|
||||||
"""Run the scraper to collect product data"""
|
|
||||||
print("=" * 60)
|
|
||||||
print("STEP 1: SCRAPING POKEMON TCG PRODUCTS")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = subprocess.run([sys.executable, 'scraper.py'],
|
|
||||||
capture_output=True, text=True)
|
|
||||||
|
|
||||||
if result.returncode == 0:
|
|
||||||
print("Scraping completed successfully!")
|
|
||||||
print(result.stdout)
|
|
||||||
|
|
||||||
# Find the generated JSON file
|
|
||||||
json_files = list(Path('.').glob('pokemon_tcg_products_*.json'))
|
|
||||||
if json_files:
|
|
||||||
latest_file = max(json_files, key=os.path.getctime)
|
|
||||||
return str(latest_file)
|
|
||||||
else:
|
|
||||||
print("No JSON file was generated")
|
|
||||||
return None
|
|
||||||
else:
|
|
||||||
print("Scraping failed:")
|
|
||||||
print(result.stderr)
|
|
||||||
return None
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error running scraper: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def run_pdf_generator(json_file):
|
|
||||||
"""Run the PDF generator with the scraped data"""
|
|
||||||
print("=" * 60)
|
|
||||||
print("STEP 2: GENERATING PDF CATALOG")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = subprocess.run([sys.executable, 'pdf_generator.py', json_file],
|
|
||||||
capture_output=True, text=True)
|
|
||||||
|
|
||||||
if result.returncode == 0:
|
|
||||||
print("PDF generation completed successfully!")
|
|
||||||
print(result.stdout)
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
print("PDF generation failed:")
|
|
||||||
print(result.stderr)
|
|
||||||
return False
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error running PDF generator: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("Pokemon Discovery - Product Scraper & Catalog Generator")
|
|
||||||
print("=" * 60)
|
|
||||||
print(f"Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Check if requirements are installed
|
|
||||||
try:
|
|
||||||
import requests, bs4, barcode, PIL
|
|
||||||
print("✓ Required packages are available")
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"✗ Missing required package: {e}")
|
|
||||||
print("Installing requirements...")
|
|
||||||
if not install_requirements():
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Check if pandoc is available
|
|
||||||
try:
|
|
||||||
subprocess.run(['pandoc', '--version'],
|
|
||||||
capture_output=True, check=True)
|
|
||||||
print("✓ Pandoc is available for PDF generation")
|
|
||||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
|
||||||
print("⚠ Pandoc not found. PDF generation may fail.")
|
|
||||||
print(" Install pandoc with: sudo apt install pandoc (Ubuntu/Debian)")
|
|
||||||
print(" or: brew install pandoc (macOS)")
|
|
||||||
print(" or: pacman -S pandoc (Arch Linux)")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Run scraper
|
|
||||||
json_file = run_scraper()
|
|
||||||
if not json_file:
|
|
||||||
print("Scraping failed. Exiting.")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Run PDF generator
|
|
||||||
if run_pdf_generator(json_file):
|
|
||||||
print("=" * 60)
|
|
||||||
print("SUCCESS! Both scraping and PDF generation completed.")
|
|
||||||
print("=" * 60)
|
|
||||||
print(f"JSON data: {json_file}")
|
|
||||||
print("PDF catalog: Check the catalog_output/ directory")
|
|
||||||
print()
|
|
||||||
print("Files generated:")
|
|
||||||
|
|
||||||
# List generated files
|
|
||||||
for file_pattern in ['pokemon_tcg_products_*.json', 'catalog_output/pokemon_tcg_catalog_*.pdf']:
|
|
||||||
files = list(Path('.').glob(file_pattern))
|
|
||||||
if files:
|
|
||||||
latest = max(files, key=os.path.getctime)
|
|
||||||
print(f" - {latest}")
|
|
||||||
else:
|
|
||||||
print("=" * 60)
|
|
||||||
print("PARTIAL SUCCESS: Scraping completed, but PDF generation failed.")
|
|
||||||
print("=" * 60)
|
|
||||||
print(f"JSON data: {json_file}")
|
|
||||||
print("You can manually run the PDF generator with:")
|
|
||||||
print(f" python3 pdf_generator.py {json_file}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
32
scraper.py
32
scraper.py
@@ -1,7 +1,20 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
Pokemon Discovery - TCG Product Scraper for Dollar General
|
Pokemon Discovery — Site Scraper (Reference)
|
||||||
Scrapes product information and saves to JSON for PDF generation
|
|
||||||
|
HTML + Selenium/Brave scraper for Dollar General product pages.
|
||||||
|
Kept as a reference implementation. The primary tool is disco.py,
|
||||||
|
which reads product data from a HAR capture instead of scraping live.
|
||||||
|
|
||||||
|
This scraper can:
|
||||||
|
- Fetch individual product pages and extract title, SKU, price, stock
|
||||||
|
- Attempt to find product links from the category page (limited by
|
||||||
|
dynamic JS loading — products are injected via API after page load)
|
||||||
|
- Fall back to Brave browser via Selenium for JS-rendered content
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python scraper.py # Attempt full category scrape
|
||||||
|
# Or import and use PokemonTCGScraper class directly for individual pages
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
@@ -28,6 +41,14 @@ except ImportError:
|
|||||||
print("Selenium not available, using requests only (install selenium for Brave browser support)")
|
print("Selenium not available, using requests only (install selenium for Brave browser support)")
|
||||||
|
|
||||||
class PokemonTCGScraper:
|
class PokemonTCGScraper:
|
||||||
|
"""HTML/Selenium scraper for Dollar General Pokemon product pages.
|
||||||
|
|
||||||
|
Can extract product details (title, SKU, price, stock) from individual
|
||||||
|
product page URLs. Category-level scraping is limited because Dollar
|
||||||
|
General loads products dynamically via a JS API call after page load.
|
||||||
|
See disco.py for the HAR-based approach that bypasses this limitation.
|
||||||
|
"""
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.base_url = "https://www.dollargeneral.com"
|
self.base_url = "https://www.dollargeneral.com"
|
||||||
self.search_url = "https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true"
|
self.search_url = "https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true"
|
||||||
@@ -300,9 +321,10 @@ class PokemonTCGScraper:
|
|||||||
return has_pokemon and has_tcg
|
return has_pokemon and has_tcg
|
||||||
|
|
||||||
def try_api_scraping(self):
|
def try_api_scraping(self):
|
||||||
"""
|
"""Stub for API-based scraping (requires auth token).
|
||||||
Try to scrape products using the discovered API endpoint
|
|
||||||
This method contains the exact API call found via HAR analysis
|
Documents the discovered API endpoint and request format.
|
||||||
|
Not functional — use disco.py with a HAR file instead.
|
||||||
"""
|
"""
|
||||||
print("🔬 Attempting API-based scraping...")
|
print("🔬 Attempting API-based scraping...")
|
||||||
print(" Endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider")
|
print(" Endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider")
|
||||||
|
|||||||
@@ -1,246 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test the Dollar General API endpoint for Pokemon products
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import requests
|
|
||||||
import sys
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
def get_auth_token():
|
|
||||||
"""Get authentication token from Dollar General"""
|
|
||||||
try:
|
|
||||||
# Try to get token from the token endpoint
|
|
||||||
token_url = 'https://www.dollargeneral.com/bin/omni/userTokens'
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Referer': 'https://www.dollargeneral.com/'
|
|
||||||
}
|
|
||||||
|
|
||||||
response = requests.get(token_url, headers=headers, timeout=30)
|
|
||||||
if response.status_code == 200:
|
|
||||||
data = response.json()
|
|
||||||
# Look for access token in the response
|
|
||||||
if 'access_token' in data:
|
|
||||||
return data['access_token']
|
|
||||||
elif 'token' in data:
|
|
||||||
return data['token']
|
|
||||||
else:
|
|
||||||
print("Token response structure:", list(data.keys()))
|
|
||||||
return None
|
|
||||||
else:
|
|
||||||
print(f"Failed to get token: {response.status_code}")
|
|
||||||
return None
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error getting token: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def test_api_with_existing_token():
|
|
||||||
"""Test with the token from HAR file"""
|
|
||||||
|
|
||||||
# Token extracted from HAR file (may expire)
|
|
||||||
har_token = "eyJ0eXAiOiJhdCtKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6Ik5qRTJNemczTXpSRVFrUXpNak5GUmprMU1FUkNNRUZDTVRBek1FWTFRa0pCTXpRM1EwTkNNZyJ9.eyJzY29wZSI6bnVsbCwiaWF0IjoxNzc0MTI3Nzc5LCJleHAiOjE3NzQxMzEzNzksImF1ZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImlzcyI6Imh0dHBzOi8vcHJvZC1kZ2dvLyIsInN1YiI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsInNpZCI6IlNrWk9makF5TURRMU1EVXpOVFEwWWpBM016SXpNak14TXpFek9ETTNNekV3TWpreFl6VitUVUZXYVhwbk56SXpVRGg2VWxkcmEySkRkMk5EZUdVNFlUWm5XVXBHVDBveVExTlRNVWxXWlhSalQzRnFWazVWZGtGWlIwOWtZV2x0WVVwRVRucG5SVlZvUTE5SE5VcHVObGhuTURSb2JuUkVhVlF3UTBzelNIND0iLCJqdGkiOiJzdDIucy5BdEx0VlphRHFnLnZrdW5OV2RWNjN2ZlJTTG00Y3VUd2d5bmc2X0pJNmxKRjA5a2lXTXVQeGZkVDRvT0NhMXhwa1VoRlRkM2tocHZUaFhsRUVwLWw0QzJrZnoycjkzVlYzeldBaUw5Y2x6Snl0amFJamJ4TEJnLkJOZy1CeUdpZnV0WnppQWhhMV8xRDBXTUFWR3JpNVVCX0pKbTRCNVRNYVhTWkZneXpxeUZERjJxZ3B3UTgyajZ2eGVtcnA5RERFTHZnM3hvdlZmZzBnLnNjMyIsImNsaWVudF9pZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImF6cCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiJ9.I6ou9atkJ8ndkr2m2Trpg53fMIL3hpofCLUHoHYgZkOJnLnbmL0CQu7_pIChQ6nIDK03GagK6aqxd97E8B8vv9nweSmb7zXhrt43dKLEIdhxIGFkJ4xYgNNg-3cVjSlThBQ_AwCx924lOGjEfikEw4NrvGvrlNvrg1lnNz4hf629hUH-5ccVSdgo1w_LQzsLOeMCjuC_bmAoRxT5KLI9oESd4tPJZU5Nlt2ICbWJD9h-zNrt-ijwYCvb7j8amGbpMGhJZqtzu9f3wN0JUFxDg5rAN-WOtLjwEmR_NxDKq0NEeuU16uhaB8AJzy217XAgJ87bKZldZowsWs-Q9oAH3g"
|
|
||||||
|
|
||||||
endpoint = "https://dggo.dollargeneral.com/omni/api/v2/category/search/provider"
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Content-Type': 'application/json',
|
|
||||||
'Authorization': f'Bearer {har_token}',
|
|
||||||
'Referer': 'https://www.dollargeneral.com/'
|
|
||||||
}
|
|
||||||
|
|
||||||
# Test different filter combinations
|
|
||||||
test_requests = [
|
|
||||||
{
|
|
||||||
"name": "In Stock Pokemon Products",
|
|
||||||
"payload": {
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": None,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": False,
|
|
||||||
"dgPickUp": False,
|
|
||||||
"dgShipTohome": False,
|
|
||||||
"soldAtStore": True,
|
|
||||||
"inStock": True,
|
|
||||||
"onlyActivatedDeals": False
|
|
||||||
},
|
|
||||||
"IncludeSponsored": True,
|
|
||||||
"IncludeShipToHome": True,
|
|
||||||
"IncludeDeals": True,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960, # Pokemon category ID
|
|
||||||
"IncludeProducts": False,
|
|
||||||
"DoNotSave": False,
|
|
||||||
"OptOut": False,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "All Pokemon Products (including out of stock)",
|
|
||||||
"payload": {
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": None,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": False,
|
|
||||||
"dgPickUp": False,
|
|
||||||
"dgShipTohome": False,
|
|
||||||
"soldAtStore": True,
|
|
||||||
"inStock": False, # Include out of stock
|
|
||||||
"onlyActivatedDeals": False
|
|
||||||
},
|
|
||||||
"IncludeSponsored": True,
|
|
||||||
"IncludeShipToHome": True,
|
|
||||||
"IncludeDeals": True,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960,
|
|
||||||
"IncludeProducts": False,
|
|
||||||
"DoNotSave": False,
|
|
||||||
"OptOut": False,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
}
|
|
||||||
]
|
|
||||||
|
|
||||||
all_pokemon_products = []
|
|
||||||
|
|
||||||
for test in test_requests:
|
|
||||||
print(f"=== Testing: {test['name']} ===")
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.post(endpoint,
|
|
||||||
headers=headers,
|
|
||||||
json=test['payload'],
|
|
||||||
timeout=30)
|
|
||||||
|
|
||||||
print(f"Status Code: {response.status_code}")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
print(f"Response length: {len(response.text)} characters")
|
|
||||||
print(f"Response preview: {response.text[:200]}...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
items = data.get('ItemList', {}).get('Items', [])
|
|
||||||
print(f"Total products: {len(items)}")
|
|
||||||
except Exception as json_error:
|
|
||||||
print(f"JSON parsing error: {json_error}")
|
|
||||||
print(f"Full response: {response.text}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Filter for Pokemon products
|
|
||||||
pokemon_products = []
|
|
||||||
for item in items:
|
|
||||||
title = item.get('Title', '').lower()
|
|
||||||
if any(keyword in title for keyword in ['pokemon', 'pokémon', 'trading card']):
|
|
||||||
product_info = {
|
|
||||||
'title': item.get('Title'),
|
|
||||||
'sku': item.get('ItemNbr'),
|
|
||||||
'upc': item.get('UPC'),
|
|
||||||
'price': item.get('Price', {}).get('Amount'),
|
|
||||||
'url': f"https://www.dollargeneral.com{item.get('ProductUrl', '')}",
|
|
||||||
'in_stock': item.get('Inventory', {}).get('InStock'),
|
|
||||||
'image_url': item.get('ImageURL'),
|
|
||||||
'description': item.get('Description', ''),
|
|
||||||
'brand': item.get('Brand', '')
|
|
||||||
}
|
|
||||||
pokemon_products.append(product_info)
|
|
||||||
all_pokemon_products.append(product_info)
|
|
||||||
|
|
||||||
print(f"Pokemon products found: {len(pokemon_products)}")
|
|
||||||
|
|
||||||
for i, prod in enumerate(pokemon_products, 1):
|
|
||||||
print(f" {i}. {prod['title']}")
|
|
||||||
print(f" SKU: {prod['sku']}, UPC: {prod['upc']}")
|
|
||||||
print(f" Price: ${prod['price']}, In Stock: {prod['in_stock']}")
|
|
||||||
print(f" URL: {prod['url']}")
|
|
||||||
|
|
||||||
# Check if this is our test product
|
|
||||||
if prod['sku'] == '41936301':
|
|
||||||
print(f" 🎯 THIS IS OUR TEST PRODUCT!")
|
|
||||||
print()
|
|
||||||
|
|
||||||
elif response.status_code == 401:
|
|
||||||
print("❌ Authentication failed - token may be expired")
|
|
||||||
print("Response:", response.text)
|
|
||||||
return None
|
|
||||||
else:
|
|
||||||
print(f"❌ API call failed: {response.status_code}")
|
|
||||||
print("Response:", response.text[:500])
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ Error: {e}")
|
|
||||||
|
|
||||||
print("="*60)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Save results
|
|
||||||
if all_pokemon_products:
|
|
||||||
# Remove duplicates based on SKU
|
|
||||||
unique_products = {prod['sku']: prod for prod in all_pokemon_products}.values()
|
|
||||||
unique_products = list(unique_products)
|
|
||||||
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
||||||
filename = f'pokemon_tcg_api_results_{timestamp}.json'
|
|
||||||
|
|
||||||
with open(filename, 'w') as f:
|
|
||||||
json.dump(unique_products, f, indent=2)
|
|
||||||
|
|
||||||
print(f"🎉 SUCCESS!")
|
|
||||||
print(f"Found {len(unique_products)} unique Pokemon TCG products")
|
|
||||||
print(f"Saved to: {filename}")
|
|
||||||
|
|
||||||
return unique_products
|
|
||||||
|
|
||||||
return None
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("Pokemon Discovery - API Endpoint Test")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# First try to get a fresh token
|
|
||||||
print("Attempting to get fresh authentication token...")
|
|
||||||
fresh_token = get_auth_token()
|
|
||||||
|
|
||||||
if fresh_token:
|
|
||||||
print(f"✅ Got fresh token: {fresh_token[:50]}...")
|
|
||||||
else:
|
|
||||||
print("⚠️ Could not get fresh token, using HAR token")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Test API with existing token from HAR
|
|
||||||
products = test_api_with_existing_token()
|
|
||||||
|
|
||||||
if products:
|
|
||||||
print()
|
|
||||||
print("🚀 READY FOR INTEGRATION!")
|
|
||||||
print("The API endpoint is working and can be integrated into Pokemon Discovery")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Check if our known product is in the results
|
|
||||||
known_sku = '41936301'
|
|
||||||
known_product = next((p for p in products if p['sku'] == known_sku), None)
|
|
||||||
|
|
||||||
if known_product:
|
|
||||||
print(f"✅ Confirmed: Our test product (SKU {known_sku}) was found via API!")
|
|
||||||
print(f" Title: {known_product['title']}")
|
|
||||||
print(f" URL: {known_product['url']}")
|
|
||||||
print(f" Stock: {known_product['in_stock']}")
|
|
||||||
|
|
||||||
else:
|
|
||||||
print("❌ API test failed - may need fresh authentication")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,55 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test script to verify barcode generation functionality
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add current directory to path if running in venv
|
|
||||||
sys.path.insert(0, '.')
|
|
||||||
|
|
||||||
try:
|
|
||||||
import barcode
|
|
||||||
from barcode.writer import ImageWriter
|
|
||||||
print("✓ Barcode generation libraries are available")
|
|
||||||
|
|
||||||
# Test barcode generation
|
|
||||||
test_sku = "123456789012"
|
|
||||||
|
|
||||||
upc_generator = barcode.get_barcode_class('upca')
|
|
||||||
test_barcode = upc_generator("12345678901", writer=ImageWriter())
|
|
||||||
|
|
||||||
# Create test output directory
|
|
||||||
test_dir = Path("test_output")
|
|
||||||
test_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# Generate test barcode
|
|
||||||
barcode_path = test_dir / "test_barcode"
|
|
||||||
test_barcode.save(str(barcode_path), options={
|
|
||||||
'module_width': 0.2,
|
|
||||||
'module_height': 15.0,
|
|
||||||
'quiet_zone': 6.5,
|
|
||||||
'font_size': 10,
|
|
||||||
'text_distance': 5.0,
|
|
||||||
'background': 'white',
|
|
||||||
'foreground': 'black'
|
|
||||||
})
|
|
||||||
|
|
||||||
final_path = f"{barcode_path}.png"
|
|
||||||
if os.path.exists(final_path):
|
|
||||||
print(f"✓ Test barcode generated successfully: {final_path}")
|
|
||||||
print(f" File size: {os.path.getsize(final_path)} bytes")
|
|
||||||
else:
|
|
||||||
print(f"✗ Failed to generate test barcode")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"✗ Missing barcode library: {e}")
|
|
||||||
sys.exit(1)
|
|
||||||
except Exception as e:
|
|
||||||
print(f"✗ Barcode generation failed: {e}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
print("✓ All barcode generation tests passed!")
|
|
||||||
@@ -1,67 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test Brave browser integration with Pokemon Discovery
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
|
|
||||||
try:
|
|
||||||
from selenium import webdriver
|
|
||||||
from selenium.webdriver.chrome.options import Options
|
|
||||||
from selenium.webdriver.chrome.service import Service
|
|
||||||
from webdriver_manager.chrome import ChromeDriverManager
|
|
||||||
|
|
||||||
print("✓ Selenium and webdriver-manager are available")
|
|
||||||
|
|
||||||
# Check if Brave is available
|
|
||||||
if not os.path.exists('/usr/bin/brave'):
|
|
||||||
print("✗ Brave browser not found at /usr/bin/brave")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
print("✓ Brave browser found at /usr/bin/brave")
|
|
||||||
|
|
||||||
# Get Brave version
|
|
||||||
import subprocess
|
|
||||||
try:
|
|
||||||
result = subprocess.run(['/usr/bin/brave', '--version'],
|
|
||||||
capture_output=True, text=True, timeout=5)
|
|
||||||
brave_version = result.stdout.strip()
|
|
||||||
print(f"✓ {brave_version}")
|
|
||||||
except:
|
|
||||||
print("⚠ Could not get Brave version")
|
|
||||||
|
|
||||||
# Test ChromeDriver compatibility
|
|
||||||
print("\nTesting ChromeDriver compatibility...")
|
|
||||||
options = Options()
|
|
||||||
options.add_argument('--headless')
|
|
||||||
options.add_argument('--no-sandbox')
|
|
||||||
options.add_argument('--disable-dev-shm-usage')
|
|
||||||
options.binary_location = '/usr/bin/brave'
|
|
||||||
|
|
||||||
try:
|
|
||||||
service = Service(ChromeDriverManager().install())
|
|
||||||
driver = webdriver.Chrome(service=service, options=options)
|
|
||||||
|
|
||||||
# Simple test page
|
|
||||||
driver.get("data:text/html,<html><body><h1>Test</h1></body></html>")
|
|
||||||
title = driver.title
|
|
||||||
driver.quit()
|
|
||||||
|
|
||||||
print("✓ Brave + ChromeDriver test successful!")
|
|
||||||
print("✓ Pokemon Discovery is ready to use Brave for dynamic content")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"✗ ChromeDriver compatibility issue: {e}")
|
|
||||||
print("\n💡 Solutions:")
|
|
||||||
print("1. Update ChromeDriver: pip install --upgrade webdriver-manager")
|
|
||||||
print("2. Install matching ChromeDriver version manually")
|
|
||||||
print("3. Use Firefox with geckodriver as alternative")
|
|
||||||
print("\nNote: The main PDF generation functionality works without browser automation")
|
|
||||||
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"✗ Missing dependency: {e}")
|
|
||||||
print("Run: pip install selenium webdriver-manager")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
print("\n🎯 Test completed!")
|
|
||||||
@@ -1,26 +0,0 @@
|
|||||||
[
|
|
||||||
{
|
|
||||||
"title": "Pokemon Trading Card Game Battle Academy",
|
|
||||||
"price": "$19.95",
|
|
||||||
"stock": "In Stock",
|
|
||||||
"sku": "DG12345678",
|
|
||||||
"image_url": "https://via.placeholder.com/300x200?text=Pokemon+Battle+Academy",
|
|
||||||
"url": "https://www.dollargeneral.com/p/pokemon-battle-academy"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "Pokemon TCG Scarlet & Violet Booster Pack",
|
|
||||||
"price": "$4.25",
|
|
||||||
"stock": "In Stock",
|
|
||||||
"sku": "DG87654321",
|
|
||||||
"image_url": "https://via.placeholder.com/300x200?text=Pokemon+Booster+Pack",
|
|
||||||
"url": "https://www.dollargeneral.com/p/pokemon-scarlet-violet-booster"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "Pokemon Tin Collection Box",
|
|
||||||
"price": "$12.95",
|
|
||||||
"stock": "Low Stock",
|
|
||||||
"sku": "DG11223344",
|
|
||||||
"image_url": "https://via.placeholder.com/300x200?text=Pokemon+Tin+Box",
|
|
||||||
"url": "https://www.dollargeneral.com/p/pokemon-tin-collection"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
@@ -1,152 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test dynamic content loading for Pokemon Discovery
|
|
||||||
"""
|
|
||||||
|
|
||||||
import requests
|
|
||||||
import json
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
import time
|
|
||||||
|
|
||||||
def test_api_endpoints():
|
|
||||||
"""Try to find API endpoints that might return product data"""
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Accept-Language': 'en-US,en;q=0.9',
|
|
||||||
'Referer': 'https://www.dollargeneral.com/c/toys/pokemon'
|
|
||||||
}
|
|
||||||
|
|
||||||
# Test potential API endpoints
|
|
||||||
api_tests = [
|
|
||||||
'https://www.dollargeneral.com/api/products/search?q=pokemon',
|
|
||||||
'https://www.dollargeneral.com/api/v1/products?category=toys&query=pokemon',
|
|
||||||
'https://www.dollargeneral.com/dg/search?q=pokemon&category=toys',
|
|
||||||
'https://www.dollargeneral.com/api/search?term=pokemon+trading+card',
|
|
||||||
]
|
|
||||||
|
|
||||||
print("=== Testing API Endpoints ===")
|
|
||||||
for url in api_tests:
|
|
||||||
try:
|
|
||||||
print(f"Testing: {url}")
|
|
||||||
response = requests.get(url, headers=headers, timeout=10)
|
|
||||||
print(f" Status: {response.status_code}")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
print(f" JSON Response: {len(str(data))} characters")
|
|
||||||
if 'products' in str(data).lower():
|
|
||||||
print(" ✓ Contains 'products'")
|
|
||||||
if 'pokemon' in str(data).lower():
|
|
||||||
print(" ✓ Contains 'pokemon'")
|
|
||||||
except:
|
|
||||||
print(f" Text Response: {len(response.text)} characters")
|
|
||||||
print()
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Error: {e}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
def test_network_requests():
|
|
||||||
"""Analyze the search page to find AJAX calls"""
|
|
||||||
|
|
||||||
url = 'https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true'
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
|
||||||
}
|
|
||||||
|
|
||||||
print("=== Analyzing Search Page for API Calls ===")
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.get(url, headers=headers, timeout=30)
|
|
||||||
soup = BeautifulSoup(response.text, 'html.parser')
|
|
||||||
|
|
||||||
# Look for API endpoints in JavaScript
|
|
||||||
scripts = soup.find_all('script')
|
|
||||||
api_patterns = []
|
|
||||||
|
|
||||||
for script in scripts:
|
|
||||||
if script.string:
|
|
||||||
content = script.string
|
|
||||||
|
|
||||||
# Look for API endpoints
|
|
||||||
import re
|
|
||||||
patterns = [
|
|
||||||
r'(?:api|Api|API)["\'\s]*[:=]["\'\s]*([^"\']+)',
|
|
||||||
r'(?:endpoint|url|baseURL)["\'\s]*[:=]["\'\s]*([^"\']+)',
|
|
||||||
r'fetch\s*\(\s*["\']([^"\']+)["\']',
|
|
||||||
r'xhr\.open\s*\(\s*["\'][^"\']*["\'],\s*["\']([^"\']+)["\']',
|
|
||||||
r'/api/[^"\'\\s]+',
|
|
||||||
r'/search[^"\'\\s]*',
|
|
||||||
]
|
|
||||||
|
|
||||||
for pattern in patterns:
|
|
||||||
matches = re.findall(pattern, content, re.IGNORECASE)
|
|
||||||
for match in matches:
|
|
||||||
if 'dollargeneral' in match or match.startswith('/'):
|
|
||||||
api_patterns.append(match)
|
|
||||||
|
|
||||||
# Remove duplicates and clean up
|
|
||||||
unique_apis = list(set(api_patterns))
|
|
||||||
|
|
||||||
print(f"Found {len(unique_apis)} potential API endpoints:")
|
|
||||||
for api in unique_apis[:10]: # Show first 10
|
|
||||||
print(f" -> {api}")
|
|
||||||
|
|
||||||
return unique_apis
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error analyzing page: {e}")
|
|
||||||
return []
|
|
||||||
|
|
||||||
def test_sitemap_approach():
|
|
||||||
"""Try to find products via sitemap"""
|
|
||||||
|
|
||||||
print("=== Testing Sitemap Approach ===")
|
|
||||||
|
|
||||||
sitemap_urls = [
|
|
||||||
'https://www.dollargeneral.com/sitemap.xml',
|
|
||||||
'https://www.dollargeneral.com/robots.txt'
|
|
||||||
]
|
|
||||||
|
|
||||||
for url in sitemap_urls:
|
|
||||||
try:
|
|
||||||
print(f"Testing: {url}")
|
|
||||||
response = requests.get(url, timeout=10)
|
|
||||||
print(f" Status: {response.status_code}")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
content = response.text
|
|
||||||
if 'pokemon' in content.lower():
|
|
||||||
print(" ✓ Contains Pokemon references")
|
|
||||||
if '/p/' in content:
|
|
||||||
print(" ✓ Contains product URLs (/p/)")
|
|
||||||
print(f" Content length: {len(content)} characters")
|
|
||||||
print()
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Error: {e}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
print("Pokemon Discovery - Dynamic Content Testing")
|
|
||||||
print("=" * 60)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Test various approaches to find products
|
|
||||||
test_api_endpoints()
|
|
||||||
print()
|
|
||||||
|
|
||||||
apis = test_network_requests()
|
|
||||||
print()
|
|
||||||
|
|
||||||
test_sitemap_approach()
|
|
||||||
print()
|
|
||||||
|
|
||||||
print("=" * 60)
|
|
||||||
print("Summary:")
|
|
||||||
print("- Individual product extraction: ✅ WORKING")
|
|
||||||
print("- Product URLs can be processed if found")
|
|
||||||
print("- Main challenge: Finding product URLs from search page")
|
|
||||||
print("- Dynamic content requires browser automation or API discovery")
|
|
||||||
@@ -1,165 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test Pokemon Discovery with real Dollar General Pokemon products
|
|
||||||
Demonstrates full working pipeline with known products
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
# Add current directory to path
|
|
||||||
sys.path.insert(0, '.')
|
|
||||||
|
|
||||||
from scraper import PokemonTCGScraper
|
|
||||||
from pdf_generator import PokemonTCGCatalogGenerator
|
|
||||||
|
|
||||||
def test_known_products():
|
|
||||||
"""Test with known Pokemon TCG products from Dollar General"""
|
|
||||||
|
|
||||||
# Known Pokemon TCG products (you can add more as you find them)
|
|
||||||
known_products = [
|
|
||||||
'https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/728192558375',
|
|
||||||
# Add more product URLs here as they're discovered
|
|
||||||
]
|
|
||||||
|
|
||||||
print("Pokemon Discovery - Real Product Test")
|
|
||||||
print("=" * 50)
|
|
||||||
print(f"Testing with {len(known_products)} known products")
|
|
||||||
print()
|
|
||||||
|
|
||||||
scraper = PokemonTCGScraper()
|
|
||||||
products_found = []
|
|
||||||
|
|
||||||
for i, url in enumerate(known_products, 1):
|
|
||||||
print(f"Testing product {i}/{len(known_products)}")
|
|
||||||
print(f"URL: {url}")
|
|
||||||
|
|
||||||
# Get product page
|
|
||||||
html = scraper.get_page_content(url)
|
|
||||||
|
|
||||||
if html:
|
|
||||||
# Extract product information
|
|
||||||
product = scraper.extract_product_info(url, html)
|
|
||||||
|
|
||||||
# Check if it's a Pokemon TCG product
|
|
||||||
if scraper.is_pokemon_tcg_product(product):
|
|
||||||
products_found.append(product)
|
|
||||||
print(f"✓ FOUND: {product.get('title', 'Unknown')}")
|
|
||||||
print(f" SKU: {product.get('sku', 'N/A')}")
|
|
||||||
print(f" Price: {product.get('price', 'N/A')}")
|
|
||||||
|
|
||||||
# Try to get additional data we might have missed
|
|
||||||
if not product.get('price'):
|
|
||||||
print(" (Attempting to find price...)")
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
soup = BeautifulSoup(html, 'html.parser')
|
|
||||||
|
|
||||||
# More price selectors
|
|
||||||
price_selectors = ['[data-testid="price"]', '.price-display', '.current-price', '[class*="price"]']
|
|
||||||
for selector in price_selectors:
|
|
||||||
price_elem = soup.select_one(selector)
|
|
||||||
if price_elem and not product.get('price'):
|
|
||||||
price_text = price_elem.get_text().strip()
|
|
||||||
if '$' in price_text:
|
|
||||||
product['price'] = price_text
|
|
||||||
print(f" Found price: {price_text}")
|
|
||||||
break
|
|
||||||
|
|
||||||
# Try to get stock info
|
|
||||||
if not product.get('stock'):
|
|
||||||
print(" (Attempting to find stock status...)")
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
soup = BeautifulSoup(html, 'html.parser')
|
|
||||||
|
|
||||||
# Look for stock indicators
|
|
||||||
if 'in stock' in html.lower():
|
|
||||||
product['stock'] = 'In Stock'
|
|
||||||
elif 'out of stock' in html.lower():
|
|
||||||
product['stock'] = 'Out of Stock'
|
|
||||||
elif 'available' in html.lower():
|
|
||||||
product['stock'] = 'Available'
|
|
||||||
else:
|
|
||||||
product['stock'] = 'Unknown'
|
|
||||||
|
|
||||||
print(f" Stock: {product.get('stock')}")
|
|
||||||
else:
|
|
||||||
print("✗ Not a Pokemon TCG product")
|
|
||||||
else:
|
|
||||||
print("✗ Failed to get product page")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
if products_found:
|
|
||||||
print(f"SUCCESS! Found {len(products_found)} Pokemon TCG products")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Save to JSON file
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
||||||
json_file = f'pokemon_tcg_products_real_{timestamp}.json'
|
|
||||||
|
|
||||||
with open(json_file, 'w') as f:
|
|
||||||
json.dump(products_found, f, indent=2)
|
|
||||||
|
|
||||||
print(f"✓ Saved product data: {json_file}")
|
|
||||||
|
|
||||||
# Generate PDF catalog
|
|
||||||
print("✓ Generating PDF catalog...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
generator = PokemonTCGCatalogGenerator(json_file)
|
|
||||||
pdf_file = generator.generate_pdf()
|
|
||||||
|
|
||||||
if pdf_file:
|
|
||||||
print(f"✓ PDF catalog generated: {pdf_file}")
|
|
||||||
|
|
||||||
# Show file sizes
|
|
||||||
import os
|
|
||||||
if os.path.exists(pdf_file):
|
|
||||||
size = os.path.getsize(pdf_file) / 1024
|
|
||||||
print(f" PDF size: {size:.1f} KB")
|
|
||||||
|
|
||||||
# Count barcodes generated
|
|
||||||
barcode_dir = generator.barcodes_dir
|
|
||||||
if barcode_dir.exists():
|
|
||||||
barcodes = list(barcode_dir.glob('*.png'))
|
|
||||||
print(f" Barcodes generated: {len(barcodes)}")
|
|
||||||
|
|
||||||
print()
|
|
||||||
print("🎉 COMPLETE SUCCESS!")
|
|
||||||
print("Pokemon Discovery successfully:")
|
|
||||||
print(f" • Scraped {len(products_found)} real products from Dollar General")
|
|
||||||
print(" • Generated professional PDF catalog")
|
|
||||||
print(" • Created scannable UPC-A barcodes")
|
|
||||||
print(" • Used Unix-friendly timestamped files")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error generating PDF: {e}")
|
|
||||||
print("But product scraping was successful!")
|
|
||||||
return True
|
|
||||||
|
|
||||||
else:
|
|
||||||
print("No Pokemon TCG products found.")
|
|
||||||
print()
|
|
||||||
print("This could be due to:")
|
|
||||||
print("- Products no longer available")
|
|
||||||
print("- Changed product URLs")
|
|
||||||
print("- Need to find more current product URLs")
|
|
||||||
|
|
||||||
return False
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
success = test_known_products()
|
|
||||||
|
|
||||||
print()
|
|
||||||
print("=" * 50)
|
|
||||||
if success:
|
|
||||||
print("✅ Pokemon Discovery is fully functional!")
|
|
||||||
print(" Ready for production use with product URLs")
|
|
||||||
else:
|
|
||||||
print("⚠️ Product URL discovery needed")
|
|
||||||
print(" Core functionality confirmed working")
|
|
||||||
print("=" * 50)
|
|
||||||
Reference in New Issue
Block a user