Clean up: remove obsolete files, update docs and docstrings
Removed 20 files: old test scripts, debug tools, duplicate docs,
generated JSON, old PDF generator, launcher scripts.
Kept:
disco.py — main tool (scrape HAR + generate PDF)
scraper.py — reference site scraper (HTML + Selenium/Brave)
requirements.txt
*.har — browser capture with API data
Updated:
README.md — rewritten to reflect current tool and usage
.gitignore — simplified
scraper.py — module/class/method docstrings updated to clarify
this is a reference implementation, disco.py is primary
This commit is contained in:
32
.gitignore
vendored
32
.gitignore
vendored
@@ -1,37 +1,11 @@
|
|||||||
# Virtual environment
|
|
||||||
venv/
|
venv/
|
||||||
env/
|
|
||||||
.env
|
|
||||||
|
|
||||||
# Python cache
|
|
||||||
__pycache__/
|
__pycache__/
|
||||||
*.pyc
|
*.pyc
|
||||||
*.pyo
|
|
||||||
*.pyd
|
|
||||||
.Python
|
|
||||||
*.so
|
|
||||||
.pytest_cache/
|
|
||||||
|
|
||||||
# Output files
|
# Generated output
|
||||||
pokemon_tcg_products_*.json
|
|
||||||
catalog_output/
|
catalog_output/
|
||||||
test_output/
|
pokemon_tcg_products_*.json
|
||||||
|
|
||||||
# Logs
|
# OS / editor
|
||||||
*.log
|
|
||||||
|
|
||||||
# OS files
|
|
||||||
.DS_Store
|
.DS_Store
|
||||||
Thumbs.db
|
|
||||||
.directory
|
|
||||||
|
|
||||||
# IDE files
|
|
||||||
.vscode/
|
|
||||||
.idea/
|
|
||||||
*.swp
|
*.swp
|
||||||
*.swo
|
|
||||||
|
|
||||||
# Temporary files
|
|
||||||
*.tmp
|
|
||||||
*.temp
|
|
||||||
.cache/
|
|
||||||
@@ -1,169 +0,0 @@
|
|||||||
# Pokemon Discovery - URL Discovery SUCCESS! 🎉
|
|
||||||
|
|
||||||
## ✅ **API Endpoint Successfully Discovered**
|
|
||||||
|
|
||||||
**Your HAR file revealed the exact API endpoint used by Dollar General!**
|
|
||||||
|
|
||||||
### 🔍 **Discovered API Details**
|
|
||||||
|
|
||||||
**Endpoint**: `https://dggo.dollargeneral.com/omni/api/v2/category/search/provider`
|
|
||||||
**Method**: POST
|
|
||||||
**Content-Type**: application/json
|
|
||||||
**Authentication**: Bearer token required
|
|
||||||
|
|
||||||
### 📋 **Exact Request Format**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": null,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": false,
|
|
||||||
"dgPickUp": false,
|
|
||||||
"dgShipTohome": false,
|
|
||||||
"soldAtStore": true,
|
|
||||||
"inStock": false,
|
|
||||||
"onlyActivatedDeals": false
|
|
||||||
},
|
|
||||||
"IncludeSponsored": true,
|
|
||||||
"IncludeShipToHome": true,
|
|
||||||
"IncludeDeals": true,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960,
|
|
||||||
"IncludeProducts": false,
|
|
||||||
"DoNotSave": false,
|
|
||||||
"OptOut": false,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 🎯 **Key Findings from HAR Analysis**
|
|
||||||
|
|
||||||
1. **✅ Contains Your Test Product**: SKU `41936301` and UPC `728192558375` found!
|
|
||||||
2. **✅ Multiple Pokemon Products**: API returns 4-12 Pokemon items per request
|
|
||||||
3. **✅ Proper Filtering**: `soldAtStore: true` shows in-store products
|
|
||||||
4. **✅ Stock Control**: `inStock: false` includes out-of-stock items
|
|
||||||
5. **✅ Category ID**: `723960` is the Pokemon category identifier
|
|
||||||
6. **✅ Store Location**: `17506` is the store number used
|
|
||||||
|
|
||||||
### 📊 **API Response Contains**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ItemList": {
|
|
||||||
"Items": [
|
|
||||||
{
|
|
||||||
"Title": "Pokémon Trading Card Game, 15 Card Pack, 1 ct",
|
|
||||||
"ItemNbr": "41936301",
|
|
||||||
"UPC": "728192558375",
|
|
||||||
"Price": {"Amount": 4.25},
|
|
||||||
"ProductUrl": "/p/pok-mon-trading-card-game-card-pack-ct/728192558375",
|
|
||||||
"Inventory": {"InStock": false},
|
|
||||||
"ImageURL": "...",
|
|
||||||
"Description": "...",
|
|
||||||
"Brand": "..."
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🔧 **Implementation Status**
|
|
||||||
|
|
||||||
### ✅ **Completed**
|
|
||||||
- [x] API endpoint discovery via HAR analysis
|
|
||||||
- [x] Request format extraction and documentation
|
|
||||||
- [x] Response structure mapping
|
|
||||||
- [x] Pokemon product filtering logic
|
|
||||||
- [x] Integration into Pokemon Discovery scraper
|
|
||||||
- [x] Individual product extraction (100% working)
|
|
||||||
|
|
||||||
### ⚠️ **Authentication Challenge**
|
|
||||||
- **Issue**: API requires Bearer token from authenticated session
|
|
||||||
- **Status**: Token extraction attempted but expires quickly
|
|
||||||
- **Solutions Available**:
|
|
||||||
1. **Browser Automation**: Use Selenium with proper session management
|
|
||||||
2. **Session Replication**: Implement full authentication flow
|
|
||||||
3. **Individual Products**: Current working approach (proven successful)
|
|
||||||
|
|
||||||
## 🚀 **Current Capabilities**
|
|
||||||
|
|
||||||
### 1. **Individual Product Extraction** (✅ WORKING)
|
|
||||||
```bash
|
|
||||||
# Test with your specific product
|
|
||||||
python test_real_products.py
|
|
||||||
# Result: Successfully extracts SKU 41936301 with all details
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. **API Framework** (✅ READY)
|
|
||||||
```python
|
|
||||||
# API call implementation ready in scraper.py
|
|
||||||
# Just needs authentication token to activate
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. **Complete Pipeline** (✅ WORKING)
|
|
||||||
```bash
|
|
||||||
# Generate PDF from any product data
|
|
||||||
python pdf_generator.py test_data.json
|
|
||||||
# Result: 153KB professional PDF with UPC-A barcodes
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📈 **Performance Comparison**
|
|
||||||
|
|
||||||
| Method | Speed | Product Count | Authentication | Status |
|
|
||||||
|--------|-------|---------------|----------------|--------|
|
|
||||||
| **API Endpoint** | Very Fast | 24+ per request | Required | Discovered ✅ |
|
|
||||||
| **Individual Products** | Moderate | 1 per request | None | Working ✅ |
|
|
||||||
| **Browser Automation** | Slower | Variable | Session-based | Possible |
|
|
||||||
|
|
||||||
## 🎯 **Next Steps**
|
|
||||||
|
|
||||||
### **Option A: Full API Implementation**
|
|
||||||
1. Implement proper browser session management
|
|
||||||
2. Extract Bearer token during session
|
|
||||||
3. Use API for bulk product discovery
|
|
||||||
4. **Result**: Very fast, bulk product scraping
|
|
||||||
|
|
||||||
### **Option B: Enhanced Individual Scraping**
|
|
||||||
1. Create list of known Pokemon product URLs
|
|
||||||
2. Process each URL individually (current working method)
|
|
||||||
3. Scale up with concurrent requests
|
|
||||||
4. **Result**: Reliable, no authentication needed
|
|
||||||
|
|
||||||
### **Option C: Hybrid Approach**
|
|
||||||
1. Use individual scraping for reliable operation
|
|
||||||
2. Add API capability when authentication is solved
|
|
||||||
3. Provide both options to users
|
|
||||||
4. **Result**: Best of both worlds
|
|
||||||
|
|
||||||
## 🏆 **SUCCESS METRICS**
|
|
||||||
|
|
||||||
- ✅ **URL Discovery**: SOLVED via HAR analysis
|
|
||||||
- ✅ **API Endpoint**: Found and documented
|
|
||||||
- ✅ **Request Format**: Complete specification extracted
|
|
||||||
- ✅ **Product Extraction**: Working with real products
|
|
||||||
- ✅ **PDF Generation**: Professional catalogs with barcodes
|
|
||||||
- ✅ **Repository**: Public and ready for use
|
|
||||||
|
|
||||||
## 💡 **Practical Usage Right Now**
|
|
||||||
|
|
||||||
**Pokemon Discovery is fully functional for product catalog generation:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone and use immediately
|
|
||||||
git clone https://git.dominat.us/pi-bot-01/pokemon-disco.git
|
|
||||||
cd pokemon-disco
|
|
||||||
./run.sh
|
|
||||||
|
|
||||||
# Add more product URLs to test_real_products.py
|
|
||||||
# Generate professional PDF catalogs with barcodes
|
|
||||||
```
|
|
||||||
|
|
||||||
**The API endpoint discovery is a major breakthrough that makes bulk scraping possible once authentication is properly implemented!** 🎉
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Repository**: https://git.dominat.us/pi-bot-01/pokemon-disco
|
|
||||||
**Status**: Production-ready with API framework for future enhancement
|
|
||||||
273
README.md
273
README.md
@@ -1,232 +1,129 @@
|
|||||||
# Pokemon Discovery (pokemon-disco)
|
# Pokemon Discovery (pokemon-disco)
|
||||||
|
|
||||||
A comprehensive tool for discovering Pokemon Trading Card Game products from Dollar General's website and generating a professional PDF catalog with product images, details, and UPC-A barcodes.
|
Scrapes Pokemon TCG card pack and tin products from Dollar General and generates a PDF product catalog with images and UPC-A barcodes.
|
||||||
|
|
||||||
## Features
|
## How It Works
|
||||||
|
|
||||||
- **🔍 API Discovery**: Discovered Dollar General's internal product API via HAR analysis
|
Dollar General's Pokemon category page loads products dynamically via an internal API. A browser HAR capture contains the API responses with all product data. `disco.py` extracts products from the HAR file, filters for card packs and tins, downloads product images, generates UPC-A barcodes, and produces a LaTeX-based PDF catalog.
|
||||||
- **📱 Product Extraction**: Successfully extracts Pokemon TCG product details (title, SKU, price, stock)
|
|
||||||
- **🏷️ Barcode Generation**: Creates scannable UPC-A barcodes for inventory management
|
### Pipeline
|
||||||
- **📄 PDF Catalogs**: Professional PDF catalogs with images, details, and barcodes
|
|
||||||
- **🕰️ Unix-Friendly**: Timestamped filenames (`YYYYMMDD_HHMMSS`) for easy scripting
|
```
|
||||||
- **🌐 Brave Browser Support**: Configured for dynamic content scraping
|
HAR file → Extract API responses → Filter packs/tins → Download images
|
||||||
- **🛡️ Anti-Bot Handling**: Multiple fallback strategies (requests → Selenium → individual products)
|
→ Generate UPC-A barcodes → Compile PDF catalog (pdflatex)
|
||||||
|
```
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
### System Requirements
|
- Python 3.10+
|
||||||
- Python 3.7+
|
- pdflatex (via `texlive-core` + `texlive-latexextra`)
|
||||||
- pandoc (for PDF generation)
|
- Python packages: `requests`, `beautifulsoup4`, `python-barcode`, `Pillow`
|
||||||
- Chrome/Chromium browser (for Selenium fallback)
|
|
||||||
|
|
||||||
### Python Dependencies
|
### Install (Arch / CachyOS)
|
||||||
All dependencies are automatically installed via `requirements.txt`:
|
|
||||||
- requests
|
|
||||||
- beautifulsoup4
|
|
||||||
- selenium
|
|
||||||
- webdriver-manager
|
|
||||||
- python-barcode
|
|
||||||
- Pillow
|
|
||||||
- pandas
|
|
||||||
- lxml
|
|
||||||
|
|
||||||
## Installation
|
```bash
|
||||||
|
sudo pacman -S texlive-basic texlive-latex texlive-latexextra texlive-fontsrecommended
|
||||||
1. **Clone/Download** this directory to your system
|
python -m venv venv
|
||||||
|
source venv/bin/activate
|
||||||
2. **Install pandoc** (required for PDF generation):
|
pip install -r requirements.txt
|
||||||
```bash
|
```
|
||||||
# Ubuntu/Debian
|
|
||||||
sudo apt install pandoc
|
|
||||||
|
|
||||||
# macOS
|
|
||||||
brew install pandoc
|
|
||||||
|
|
||||||
# Arch Linux
|
|
||||||
sudo pacman -S pandoc
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Install Python dependencies** (automatically done by the script):
|
|
||||||
```bash
|
|
||||||
cd pokemon-disco
|
|
||||||
pip3 install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
### Quick Start (Recommended)
|
### Full run (scrape + PDF)
|
||||||
|
|
||||||
Run the complete pipeline with one command:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd pokemon-disco
|
source venv/bin/activate
|
||||||
python3 run_scraper.py
|
python disco.py
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
### Scrape only (output JSON)
|
||||||
1. Check and install Python requirements
|
|
||||||
2. Scrape Pokemon TCG products from Dollar General
|
|
||||||
3. Generate a PDF catalog with images and barcodes
|
|
||||||
4. Create timestamped files for easy organization
|
|
||||||
|
|
||||||
### Manual Usage
|
|
||||||
|
|
||||||
If you prefer to run components separately:
|
|
||||||
|
|
||||||
#### 1. Scrape Products
|
|
||||||
```bash
|
```bash
|
||||||
python3 scraper.py
|
python disco.py --scrape-only
|
||||||
```
|
```
|
||||||
This creates a JSON file like `pokemon_tcg_products_20241221_143025.json`
|
|
||||||
|
|
||||||
#### 2. Generate PDF Catalog
|
### PDF only (from existing JSON)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json
|
python disco.py --pdf-only pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output Files
|
## Output
|
||||||
|
|
||||||
### Generated Files
|
|
||||||
- **JSON Data**: `pokemon_tcg_products_YYYYMMDD_HHMMSS.json`
|
|
||||||
- Raw scraped data in JSON format
|
|
||||||
- Contains all product information
|
|
||||||
|
|
||||||
- **PDF Catalog**: `catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`
|
|
||||||
- Professional PDF catalog
|
|
||||||
- Includes product images, details, and UPC-A barcodes
|
|
||||||
|
|
||||||
### Output Directory Structure
|
|
||||||
```
|
```
|
||||||
pokemon-disco/
|
pokemon_tcg_products_YYYYMMDD_HHMMSS.json Product data
|
||||||
├── pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
catalog_output/
|
||||||
├── catalog_output/
|
├── pokemon_catalog_YYYYMMDD_HHMMSS.pdf PDF catalog
|
||||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf
|
├── pokemon_catalog_YYYYMMDD_HHMMSS.tex LaTeX source
|
||||||
│ ├── pokemon_tcg_catalog_YYYYMMDD_HHMMSS.md
|
├── images/ Product images (PNG)
|
||||||
│ ├── images/
|
└── barcodes/ UPC-A barcodes (PNG)
|
||||||
│ │ ├── product_1_SKU123.jpg
|
|
||||||
│ │ ├── product_2_SKU456.jpg
|
|
||||||
│ │ └── placeholder.png
|
|
||||||
│ └── barcodes/
|
|
||||||
│ ├── barcode_SKU123.png
|
|
||||||
│ ├── barcode_SKU456.png
|
|
||||||
│ └── ...
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## PDF Catalog Features
|
### PDF Layout
|
||||||
|
|
||||||
Each product in the PDF includes:
|
**Page 1 — Manifest:** table of all products with SKU, price, and stock count.
|
||||||
- **Product Image**: Downloaded from Dollar General or placeholder
|
|
||||||
- **Product Details Table**:
|
|
||||||
- Title
|
|
||||||
- Price
|
|
||||||
- Stock Status
|
|
||||||
- SKU (formatted as code)
|
|
||||||
- Product URL
|
|
||||||
- **UPC-A Barcode**: Generated from SKU for inventory management
|
|
||||||
|
|
||||||
## Data Fields Extracted
|
**Product pages:**
|
||||||
|
|
||||||
For each Pokemon TCG product:
|
```
|
||||||
- `title`: Product name
|
Product Name
|
||||||
- `price`: Current price
|
Stock status Price
|
||||||
- `stock`: Availability status
|
SKU: XXXXXXXX UPC: XXXXXXXXXXXX
|
||||||
- `sku`: Product SKU/item number
|
|
||||||
- `image_url`: Direct link to product image
|
|
||||||
- `url`: Link to product page
|
|
||||||
|
|
||||||
## Troubleshooting
|
┌─────────────────────────────┐
|
||||||
|
│ │
|
||||||
|
│ Product Image │
|
||||||
|
│ │
|
||||||
|
└─────────────────────────────┘
|
||||||
|
|
||||||
### Common Issues
|
┌─────────────────────────────┐
|
||||||
|
│ UPC-A Barcode │
|
||||||
|
└─────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
1. **No products found**
|
## Capturing a HAR File
|
||||||
- Dollar General may have anti-bot protection
|
|
||||||
- The script will automatically retry with Selenium
|
|
||||||
- Website structure may have changed
|
|
||||||
|
|
||||||
2. **PDF generation fails**
|
The HAR file provides product data from Dollar General's internal API. To capture one:
|
||||||
- Ensure pandoc is installed: `pandoc --version`
|
|
||||||
- Try alternative LaTeX engines if available
|
|
||||||
- Markdown file is still generated for manual conversion
|
|
||||||
|
|
||||||
3. **Image download failures**
|
1. Open your browser (Brave, Chrome, Firefox)
|
||||||
- Network connectivity issues
|
2. Open DevTools → **Network** tab
|
||||||
- Placeholder images will be used automatically
|
3. Visit `https://www.dollargeneral.com/c/toys/pokemon?q=`
|
||||||
|
4. Wait for products to load, toggle any filters you want
|
||||||
|
5. Right-click in the Network tab → **Save all as HAR**
|
||||||
|
6. Place the `.har` file in the project root
|
||||||
|
|
||||||
4. **Browser/Selenium issues**
|
`disco.py` looks for any `.har` file matching the default name pattern. Edit the `HAR_FILE` constant at the top of `disco.py` if your filename differs.
|
||||||
- **Brave browser supported**: Configured to use Brave at `/usr/bin/brave`
|
|
||||||
- **ChromeDriver compatibility**: May require version matching (Brave 146 vs ChromeDriver 114)
|
|
||||||
- **Alternative browsers**: Chrome, Chromium, or Firefox with geckodriver
|
|
||||||
- Script falls back to requests-only mode if Selenium fails
|
|
||||||
|
|
||||||
**For Brave users**: If you see ChromeDriver version mismatch:
|
|
||||||
```bash
|
|
||||||
# Test browser integration
|
|
||||||
python test_brave.py
|
|
||||||
|
|
||||||
# Solutions for version mismatch:
|
|
||||||
pip install --upgrade webdriver-manager
|
|
||||||
# or manually install compatible ChromeDriver
|
|
||||||
```
|
|
||||||
|
|
||||||
### Debug Mode
|
## Files
|
||||||
|
|
||||||
To see more detailed output, check the console output during scraping. The scripts provide detailed logging of:
|
| File | Purpose |
|
||||||
- Which products are found and filtered
|
|------|---------|
|
||||||
- Network request status
|
| `disco.py` | Main tool — scrape, filter, generate PDF |
|
||||||
- File generation progress
|
| `scraper.py` | Reference site scraper (HTML + Selenium/Brave) |
|
||||||
|
| `requirements.txt` | Python dependencies |
|
||||||
|
| `*.har` | Browser HAR capture with API data |
|
||||||
|
|
||||||
## API Discovery Success 🎉
|
## API Details (Reference)
|
||||||
|
|
||||||
**Pokemon Discovery has successfully discovered Dollar General's internal API endpoint!**
|
The product data comes from this internal API:
|
||||||
|
|
||||||
- **Endpoint Found**: `https://dggo.dollargeneral.com/omni/api/v2/category/search/provider`
|
```
|
||||||
- **Method**: POST with JSON payload
|
POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
|
||||||
- **Category ID**: `723960` (Pokemon products)
|
Content-Type: application/json
|
||||||
- **Response Format**: Complete product details including your test product (SKU: `41936301`)
|
Authorization: Bearer <session-token>
|
||||||
- **Status**: Documented and integrated, requires authentication token
|
|
||||||
|
|
||||||
**Current Status**: Individual product extraction works perfectly. API bulk scraping available once authentication is implemented.
|
{
|
||||||
|
"StoreNbr": 17506,
|
||||||
|
"Id": 723960, // Pokemon category
|
||||||
|
"PageSize": 24,
|
||||||
|
"Filters": {
|
||||||
|
"soldAtStore": true,
|
||||||
|
"inStock": false // false = include out of stock
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## Technical Details
|
Response contains `ItemList.Items[]` with fields: `Description`, `UPC`, `Price`, `Image`, `AvailableQty`, `rootSV` (internal ID → SKU).
|
||||||
|
|
||||||
### Scraping Strategy
|
The bearer token is session-scoped and short-lived. `disco.py` sidesteps this by reading the API responses directly from a HAR capture.
|
||||||
1. **Primary Method**: Uses requests with browser-like headers
|
|
||||||
2. **Fallback Method**: Selenium with headless Chrome for dynamic content
|
|
||||||
3. **Product Filtering**: Only includes products matching Pokemon TCG keywords
|
|
||||||
4. **Rate Limiting**: 1-second delay between requests to be respectful
|
|
||||||
|
|
||||||
### Barcode Generation
|
|
||||||
- Converts SKUs to 11-digit numeric format
|
|
||||||
- Generates UPC-A barcodes with check digits
|
|
||||||
- High-quality PNG images suitable for printing
|
|
||||||
|
|
||||||
### PDF Generation
|
|
||||||
- Uses pandoc with LaTeX for professional formatting
|
|
||||||
- Includes table of contents
|
|
||||||
- Optimized for printing and digital viewing
|
|
||||||
- Images scaled appropriately for page layout
|
|
||||||
|
|
||||||
## Customization
|
|
||||||
|
|
||||||
### Modifying Product Filters
|
|
||||||
Edit the `is_pokemon_tcg_product()` method in `scraper.py` to change which products are included.
|
|
||||||
|
|
||||||
### Changing PDF Layout
|
|
||||||
Modify the markdown generation in `pdf_generator.py` or add custom pandoc templates.
|
|
||||||
|
|
||||||
### Adding New Data Fields
|
|
||||||
Extend the `extract_product_info()` method in `scraper.py` to capture additional product information.
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
This tool is for educational and personal use. Please respect Dollar General's terms of service and robots.txt when using this scraper.
|
|
||||||
|
|
||||||
## Support
|
|
||||||
|
|
||||||
If you encounter issues:
|
|
||||||
1. Check the console output for error messages
|
|
||||||
2. Ensure all system requirements are installed
|
|
||||||
3. Verify internet connectivity
|
|
||||||
4. Check if the Dollar General website structure has changed
|
|
||||||
|
|
||||||
Generated files include timestamps for easy organization and version tracking.
|
|
||||||
|
|||||||
114
TEST_RESULTS.md
114
TEST_RESULTS.md
@@ -1,114 +0,0 @@
|
|||||||
# Pokemon Discovery - Test Results
|
|
||||||
|
|
||||||
## Testing Overview
|
|
||||||
Date: 2026-03-21
|
|
||||||
System: CachyOS (Arch Linux)
|
|
||||||
|
|
||||||
## ✅ Successfully Tested Components
|
|
||||||
|
|
||||||
### 1. Virtual Environment Setup
|
|
||||||
- ✅ Virtual environment creation works
|
|
||||||
- ✅ All Python dependencies install correctly
|
|
||||||
- ✅ Requirements.txt includes all necessary packages
|
|
||||||
|
|
||||||
### 2. Barcode Generation
|
|
||||||
- ✅ UPC-A barcode generation from SKUs works perfectly
|
|
||||||
- ✅ High-quality PNG images generated (3-6KB each)
|
|
||||||
- ✅ Proper barcode formatting with check digits
|
|
||||||
- ✅ File naming fixed (no double .png extension)
|
|
||||||
|
|
||||||
### 3. PDF Generation
|
|
||||||
- ✅ Markdown catalog generation works
|
|
||||||
- ✅ Professional table formatting for product details
|
|
||||||
- ✅ PDF generation works with pdflatex (fallback from xelatex)
|
|
||||||
- ✅ Unix-friendly timestamped filenames
|
|
||||||
- ✅ Proper directory structure creation
|
|
||||||
|
|
||||||
### 4. Core Functionality
|
|
||||||
- ✅ JSON data parsing and processing
|
|
||||||
- ✅ Product filtering logic
|
|
||||||
- ✅ Image placeholder generation
|
|
||||||
- ✅ Error handling and graceful fallbacks
|
|
||||||
|
|
||||||
### 5. Brave Browser Integration
|
|
||||||
- ✅ Brave browser detected and configured
|
|
||||||
- ✅ Selenium WebDriver setup for Brave
|
|
||||||
- ⚠️ ChromeDriver version compatibility issue (expected)
|
|
||||||
- ✅ Graceful fallback when browser automation fails
|
|
||||||
- ✅ Test script provided (`test_brave.py`) for troubleshooting
|
|
||||||
|
|
||||||
## ⚠️ Current Limitations
|
|
||||||
|
|
||||||
### 1. Web Scraping
|
|
||||||
- **Issue**: Dollar General uses dynamic JavaScript loading
|
|
||||||
- **Status**: Basic HTML parsing works, but product links require JavaScript execution
|
|
||||||
- **Solution**: Selenium fallback is implemented but requires Chrome/Chromium browser
|
|
||||||
- **Workaround**: Test data demonstrates full pipeline functionality
|
|
||||||
|
|
||||||
### 2. External Dependencies & Browser Integration
|
|
||||||
- **LaTeX**: Requires texlive packages for PDF generation (✅ installed)
|
|
||||||
- **Brave Browser**: Configured and detected (✅ available at /usr/bin/brave)
|
|
||||||
- **ChromeDriver Compatibility**: Version mismatch (Brave 146 vs ChromeDriver 114)
|
|
||||||
- ⚠️ Requires compatible ChromeDriver version for web scraping
|
|
||||||
- 💡 Main functionality (PDF generation) works without browser
|
|
||||||
- **Network**: External image downloads require internet connectivity
|
|
||||||
|
|
||||||
## 📋 Test Results Summary
|
|
||||||
|
|
||||||
### Working Pipeline Test
|
|
||||||
Using test data (`test_data.json`) with 3 Pokemon TCG products:
|
|
||||||
|
|
||||||
**Input**: 3 sample Pokemon products
|
|
||||||
**Generated**:
|
|
||||||
- ✅ Professional PDF catalog (161KB)
|
|
||||||
- ✅ 3 UPC-A barcode images (3-6KB each)
|
|
||||||
- ✅ Structured markdown source
|
|
||||||
- ✅ Proper file organization
|
|
||||||
|
|
||||||
**PDF Contents**:
|
|
||||||
- Table of contents
|
|
||||||
- Product details tables (title, price, stock, SKU, URL)
|
|
||||||
- Barcode images for each product
|
|
||||||
- Professional formatting suitable for printing
|
|
||||||
|
|
||||||
### File Structure Generated
|
|
||||||
```
|
|
||||||
catalog_output/
|
|
||||||
├── pokemon_tcg_catalog_20260321_144548.pdf # Final catalog
|
|
||||||
├── pokemon_tcg_catalog_20260321_144548.md # Markdown source
|
|
||||||
├── barcodes/
|
|
||||||
│ ├── barcode_DG12345678.png # UPC-A barcodes
|
|
||||||
│ ├── barcode_DG87654321.png
|
|
||||||
│ └── barcode_DG11223344.png
|
|
||||||
└── images/
|
|
||||||
└── placeholder.png # Image placeholders
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🚀 Deployment Status
|
|
||||||
|
|
||||||
- **Repository**: Successfully pushed to public Git repository
|
|
||||||
- **Documentation**: Complete with README.md and USAGE.md
|
|
||||||
- **Dependencies**: All Python packages working in virtual environment
|
|
||||||
- **Core Features**: PDF generation and barcode creation fully functional
|
|
||||||
|
|
||||||
## 💡 Recommendations
|
|
||||||
|
|
||||||
1. **For Production Use**: Install Chrome/Chromium for better web scraping
|
|
||||||
```bash
|
|
||||||
sudo pacman -S chromium
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **For Complete Testing**: Test with live website when network allows
|
|
||||||
3. **Alternative Approach**: The tool can be easily adapted for other product sites
|
|
||||||
4. **Data Integration**: JSON output format allows easy integration with other systems
|
|
||||||
|
|
||||||
## ✅ Conclusion
|
|
||||||
|
|
||||||
**Pokemon Discovery is fully functional** for the core use case:
|
|
||||||
- ✅ Processes product data (from any source)
|
|
||||||
- ✅ Generates professional PDF catalogs
|
|
||||||
- ✅ Creates scannable UPC-A barcodes
|
|
||||||
- ✅ Handles Unix-friendly file management
|
|
||||||
- ✅ Ready for production deployment
|
|
||||||
|
|
||||||
The web scraping component requires additional browser setup for full dynamic content handling, but the complete data processing and catalog generation pipeline works perfectly.
|
|
||||||
115
USAGE.md
115
USAGE.md
@@ -1,115 +0,0 @@
|
|||||||
# Quick Start Guide
|
|
||||||
|
|
||||||
## Simple Usage (Recommended)
|
|
||||||
|
|
||||||
1. **Make sure you're in the project directory:**
|
|
||||||
```bash
|
|
||||||
cd pokemon-disco
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Run the complete scraper and PDF generator:**
|
|
||||||
```bash
|
|
||||||
./run.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
This single command will:
|
|
||||||
- Set up the Python virtual environment
|
|
||||||
- Install all required packages
|
|
||||||
- Scrape Pokemon TCG products from Dollar General
|
|
||||||
- Generate a professional PDF catalog with barcodes
|
|
||||||
- Create timestamped files for easy organization
|
|
||||||
|
|
||||||
## What You'll Get
|
|
||||||
|
|
||||||
### Generated Files:
|
|
||||||
- **`pokemon_tcg_products_YYYYMMDD_HHMMSS.json`** - Raw data in JSON format
|
|
||||||
- **`catalog_output/pokemon_tcg_catalog_YYYYMMDD_HHMMSS.pdf`** - Professional PDF catalog
|
|
||||||
|
|
||||||
### PDF Catalog Contents:
|
|
||||||
- Product images (downloaded automatically)
|
|
||||||
- Product details (title, price, stock, SKU)
|
|
||||||
- UPC-A barcodes for each product (generated from SKU)
|
|
||||||
- Table of contents for easy navigation
|
|
||||||
- Professional formatting suitable for printing
|
|
||||||
|
|
||||||
## Alternative Commands
|
|
||||||
|
|
||||||
If you prefer more control:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Activate virtual environment first
|
|
||||||
source venv/bin/activate
|
|
||||||
|
|
||||||
# Run only the scraper
|
|
||||||
python scraper.py
|
|
||||||
|
|
||||||
# Run only the PDF generator (after scraping)
|
|
||||||
python pdf_generator.py pokemon_tcg_products_YYYYMMDD_HHMMSS.json
|
|
||||||
|
|
||||||
# Run everything (installs requirements automatically)
|
|
||||||
python run_scraper.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Output Location
|
|
||||||
|
|
||||||
All generated files will be in:
|
|
||||||
- JSON data: Current directory
|
|
||||||
- PDF catalog: `catalog_output/` directory
|
|
||||||
- Product images: `catalog_output/images/`
|
|
||||||
- Barcode images: `catalog_output/barcodes/`
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
- Python 3.7+
|
|
||||||
- pandoc (for PDF generation)
|
|
||||||
- Internet connection (for scraping)
|
|
||||||
|
|
||||||
The script will automatically handle Python dependencies via virtual environment.
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
If you encounter issues:
|
|
||||||
|
|
||||||
1. **Permission denied:** Make sure the script is executable:
|
|
||||||
```bash
|
|
||||||
chmod +x run.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Pandoc not found:** Install pandoc for your system:
|
|
||||||
```bash
|
|
||||||
# Ubuntu/Debian
|
|
||||||
sudo apt install pandoc
|
|
||||||
|
|
||||||
# Arch Linux
|
|
||||||
sudo pacman -S pandoc
|
|
||||||
|
|
||||||
# macOS
|
|
||||||
brew install pandoc
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **No products found:** The website may have anti-bot protection or changed structure. The script includes fallback mechanisms.
|
|
||||||
|
|
||||||
4. **PDF generation fails:** The markdown file will still be generated, which you can manually convert or view.
|
|
||||||
|
|
||||||
## File Naming Convention
|
|
||||||
|
|
||||||
All output files include Unix-friendly timestamps:
|
|
||||||
- Format: `YYYYMMDD_HHMMSS` (e.g., `20241221_143025`)
|
|
||||||
- This ensures chronological sorting with `ls` command
|
|
||||||
- No spaces or special characters for script-friendly handling
|
|
||||||
|
|
||||||
## Example Output
|
|
||||||
|
|
||||||
```
|
|
||||||
pokemon-disco/
|
|
||||||
├── pokemon_tcg_products_20241221_143025.json # Scraped data
|
|
||||||
├── catalog_output/
|
|
||||||
│ ├── pokemon_tcg_catalog_20241221_143025.pdf # Final catalog
|
|
||||||
│ ├── pokemon_tcg_catalog_20241221_143025.md # Markdown source
|
|
||||||
│ ├── images/
|
|
||||||
│ │ ├── product_1_SKU123456.jpg # Product images
|
|
||||||
│ │ └── product_2_SKU789012.jpg
|
|
||||||
│ └── barcodes/
|
|
||||||
│ ├── barcode_SKU123456.png # UPC-A barcodes
|
|
||||||
│ └── barcode_SKU789012.png
|
|
||||||
```
|
|
||||||
@@ -1,203 +0,0 @@
|
|||||||
# Why Only One Product? - The Dynamic Loading Mystery 🕵️
|
|
||||||
|
|
||||||
## **🎯 ANSWER: The Pokemon page IS being scraped, but it's empty!**
|
|
||||||
|
|
||||||
**You asked about**: `https://www.dollargeneral.com/c/toys/pokemon?q=`
|
|
||||||
**Reality**: This page loads successfully but contains **ZERO products** in the static HTML.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **📊 The Numbers Tell the Story**
|
|
||||||
|
|
||||||
### **What We GET (Static HTML Scraping):**
|
|
||||||
```
|
|
||||||
✅ Page loads: 200 OK
|
|
||||||
✅ Content size: 139,146 characters
|
|
||||||
✅ Pokemon mentions: 20 times
|
|
||||||
✅ Category ID found: 723960
|
|
||||||
❌ Product links found: 0
|
|
||||||
❌ Products with "pack": 0
|
|
||||||
❌ Products with "tin": 0
|
|
||||||
❌ Your test SKU 41936301: Not found
|
|
||||||
```
|
|
||||||
|
|
||||||
### **What SHOULD BE There (Dynamic Content):**
|
|
||||||
```
|
|
||||||
🎯 Pokemon TCG products: 4-12 items
|
|
||||||
🎯 Your test product: SKU 41936301 ✓
|
|
||||||
🎯 Products with "pack": Multiple ✓
|
|
||||||
🎯 Products with "tin": Multiple ✓
|
|
||||||
🎯 Complete product data: Title, price, stock ✓
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **🔬 The Technical Explanation**
|
|
||||||
|
|
||||||
### **Step-by-Step: What Actually Happens**
|
|
||||||
|
|
||||||
1. **Browser visits page** → Gets basic HTML structure
|
|
||||||
2. **JavaScript executes** → Makes API call to get products
|
|
||||||
3. **API returns JSON** → Contains all the Pokemon products
|
|
||||||
4. **JavaScript renders** → Inserts products into the page DOM
|
|
||||||
5. **User sees products** → But they're not in the original HTML!
|
|
||||||
|
|
||||||
### **Our Scraper vs Browser:**
|
|
||||||
```
|
|
||||||
OUR SCRAPER: BROWSER WITH JAVASCRIPT:
|
|
||||||
┌─────────────┐ ┌─────────────┐
|
|
||||||
│ Step 1 │ │ Step 1 │
|
|
||||||
│ Get HTML │ ✅ │ Get HTML │ ✅
|
|
||||||
└─────────────┘ └─────────────┘
|
|
||||||
│
|
|
||||||
┌─────────────┐
|
|
||||||
│ Step 2 │
|
|
||||||
│Execute JS │ ✅
|
|
||||||
└─────────────┘
|
|
||||||
│
|
|
||||||
┌─────────────┐
|
|
||||||
│ Step 3 │
|
|
||||||
│Call API │ ✅
|
|
||||||
└─────────────┘
|
|
||||||
│
|
|
||||||
┌─────────────┐
|
|
||||||
│ Step 4 │
|
|
||||||
│Render Items │ ✅
|
|
||||||
└─────────────┘
|
|
||||||
|
|
||||||
Result: Empty page Result: 4-12 products!
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **🎉 The Discovery Success**
|
|
||||||
|
|
||||||
### **We Found the Missing Piece!**
|
|
||||||
|
|
||||||
**Through your HAR file, we discovered the exact API call:**
|
|
||||||
|
|
||||||
```json
|
|
||||||
POST https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
|
|
||||||
{
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"Id": 723960, ← Pokemon category
|
|
||||||
"PageSize": 24,
|
|
||||||
"Filters": {
|
|
||||||
"soldAtStore": true,
|
|
||||||
"inStock": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**This API call returns:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ItemList": {
|
|
||||||
"Items": [
|
|
||||||
{
|
|
||||||
"Title": "Pokémon Trading Card Game, 15 Card Pack, 1 ct",
|
|
||||||
"ItemNbr": "41936301", ← Your test product!
|
|
||||||
"ProductUrl": "/p/pok-mon-trading-card-game-card-pack-ct/728192558375"
|
|
||||||
}
|
|
||||||
// ... more Pokemon products
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **🚧 Current Barriers**
|
|
||||||
|
|
||||||
### **Why We Can't Use the API Yet:**
|
|
||||||
|
|
||||||
1. **Authentication Required**: API needs Bearer token
|
|
||||||
2. **Token Expires**: Security measure, needs refresh
|
|
||||||
3. **Session Management**: Complex authentication flow
|
|
||||||
|
|
||||||
### **Why Browser Automation Fails:**
|
|
||||||
|
|
||||||
1. **ChromeDriver Version**: Mismatch with Brave browser
|
|
||||||
2. **Dynamic Loading**: Takes time for products to appear
|
|
||||||
3. **Anti-Bot Detection**: Sophisticated protection
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **✅ What Works RIGHT NOW**
|
|
||||||
|
|
||||||
### **Individual Product Processing:**
|
|
||||||
```bash
|
|
||||||
# Your test product works perfectly
|
|
||||||
URL: https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/728192558375
|
|
||||||
✅ Title: "Pokémon Trading Card Game, 15 Card Pack, 1 ct"
|
|
||||||
✅ SKU: 41936301
|
|
||||||
✅ Contains "pack": YES
|
|
||||||
✅ PDF Generated: 154KB with UPC-A barcode
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **💡 Solutions to Get ALL Products**
|
|
||||||
|
|
||||||
### **🔧 Option 1: Fix API Authentication**
|
|
||||||
```python
|
|
||||||
# Get valid Bearer token → Use API → Get all products
|
|
||||||
# Challenge: Complex authentication flow
|
|
||||||
# Reward: 24+ products automatically
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🔧 Option 2: Fix Browser Automation**
|
|
||||||
```python
|
|
||||||
# Update ChromeDriver → Wait for JS → Scrape dynamic content
|
|
||||||
# Challenge: Browser compatibility + timing
|
|
||||||
# Reward: See exactly what users see
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🔧 Option 3: Manual URL Collection (Working Now)**
|
|
||||||
```python
|
|
||||||
# Find more product URLs → Add to list → Process individually
|
|
||||||
# Challenge: Manual discovery needed
|
|
||||||
# Reward: Guaranteed to work, scalable
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🔧 Option 4: Alternative Discovery**
|
|
||||||
```python
|
|
||||||
# Social media → Product announcements → URL extraction
|
|
||||||
# RSS feeds → New product alerts → Automated collection
|
|
||||||
# Challenge: Multiple sources to monitor
|
|
||||||
# Reward: Comprehensive coverage
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **🎯 SUMMARY**
|
|
||||||
|
|
||||||
### **Why Only One Product?**
|
|
||||||
- ✅ **Pokemon page IS scraped** (139KB of HTML)
|
|
||||||
- ❌ **Products load via JavaScript** (not in static HTML)
|
|
||||||
- ✅ **API endpoint discovered** (contains all products)
|
|
||||||
- ❌ **Authentication barrier** (Bearer token required)
|
|
||||||
- ✅ **Individual products work** (your test case proves it)
|
|
||||||
|
|
||||||
### **The Path Forward:**
|
|
||||||
1. **Short-term**: Add known product URLs manually
|
|
||||||
2. **Long-term**: Solve API authentication for bulk discovery
|
|
||||||
3. **Current**: Generate professional catalogs from any product data
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## **🏆 The Real Success**
|
|
||||||
|
|
||||||
**We've reverse-engineered Dollar General's product system!**
|
|
||||||
|
|
||||||
- ✅ **Found the API endpoint** used internally
|
|
||||||
- ✅ **Documented the exact request format**
|
|
||||||
- ✅ **Confirmed your products exist** in their database
|
|
||||||
- ✅ **Built working extraction** for individual products
|
|
||||||
- ✅ **Created professional PDF catalogs** with barcodes
|
|
||||||
|
|
||||||
**The framework is complete - we just need to feed it more product URLs!**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Bottom line**: The Pokemon page loads perfectly, but it's designed for browsers with JavaScript. We found the API that powers it, and now we can work around the authentication to get all the products. 🎉
|
|
||||||
181
analyze_har.py
181
analyze_har.py
@@ -1,181 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Analyze HAR file to find product loading endpoints
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
from urllib.parse import urlparse, parse_qs
|
|
||||||
|
|
||||||
def analyze_har_file(har_file):
|
|
||||||
"""Analyze HAR file to find product-related API calls"""
|
|
||||||
|
|
||||||
print(f"Analyzing HAR file: {har_file}")
|
|
||||||
|
|
||||||
try:
|
|
||||||
with open(har_file, 'r', encoding='utf-8') as f:
|
|
||||||
har_data = json.load(f)
|
|
||||||
|
|
||||||
entries = har_data.get('log', {}).get('entries', [])
|
|
||||||
print(f"Found {len(entries)} network requests")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Filter for API calls that might contain product data
|
|
||||||
api_calls = []
|
|
||||||
product_calls = []
|
|
||||||
|
|
||||||
for entry in entries:
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
url = request.get('url', '')
|
|
||||||
method = request.get('method', '')
|
|
||||||
status = response.get('status', 0)
|
|
||||||
|
|
||||||
# Look for API calls
|
|
||||||
parsed_url = urlparse(url)
|
|
||||||
path = parsed_url.path.lower()
|
|
||||||
query = parsed_url.query.lower()
|
|
||||||
|
|
||||||
# Check if this might be a product-related API call
|
|
||||||
is_api = any(keyword in path for keyword in ['/api/', '/search', '/products', '/inventory', '/catalog'])
|
|
||||||
contains_pokemon = 'pokemon' in query or 'pokemon' in path
|
|
||||||
is_json_response = any(h.get('name', '').lower() == 'content-type' and 'json' in h.get('value', '')
|
|
||||||
for h in response.get('headers', []))
|
|
||||||
|
|
||||||
if is_api or is_json_response:
|
|
||||||
api_calls.append({
|
|
||||||
'url': url,
|
|
||||||
'method': method,
|
|
||||||
'status': status,
|
|
||||||
'is_pokemon': contains_pokemon,
|
|
||||||
'response_size': response.get('bodySize', 0)
|
|
||||||
})
|
|
||||||
|
|
||||||
if contains_pokemon or 'product' in path or 'search' in path:
|
|
||||||
product_calls.append(entry)
|
|
||||||
|
|
||||||
print(f"Found {len(api_calls)} potential API calls")
|
|
||||||
print(f"Found {len(product_calls)} product-related calls")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Show interesting API calls
|
|
||||||
print("=== API CALLS ===")
|
|
||||||
for call in api_calls[:20]: # Show first 20
|
|
||||||
url = call['url']
|
|
||||||
pokemon_flag = "🎯" if call['is_pokemon'] else " "
|
|
||||||
print(f"{pokemon_flag} {call['method']} {call['status']} - {url}")
|
|
||||||
if call['response_size'] > 1000:
|
|
||||||
print(f" 📦 Response size: {call['response_size']} bytes")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Analyze product-specific calls in detail
|
|
||||||
if product_calls:
|
|
||||||
print("=== DETAILED PRODUCT CALL ANALYSIS ===")
|
|
||||||
|
|
||||||
for i, entry in enumerate(product_calls[:5]): # Analyze first 5 product calls
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
|
|
||||||
print(f"\n--- Product Call {i+1} ---")
|
|
||||||
print(f"URL: {request.get('url', '')}")
|
|
||||||
print(f"Method: {request.get('method', '')}")
|
|
||||||
print(f"Status: {response.get('status', 0)}")
|
|
||||||
|
|
||||||
# Show headers
|
|
||||||
headers = request.get('headers', [])
|
|
||||||
important_headers = [h for h in headers if h.get('name', '').lower() in
|
|
||||||
['accept', 'content-type', 'authorization', 'x-api-key', 'referer']]
|
|
||||||
if important_headers:
|
|
||||||
print("Important Headers:")
|
|
||||||
for header in important_headers:
|
|
||||||
print(f" {header.get('name')}: {header.get('value', '')[:100]}")
|
|
||||||
|
|
||||||
# Show query parameters
|
|
||||||
parsed = urlparse(request.get('url', ''))
|
|
||||||
if parsed.query:
|
|
||||||
params = parse_qs(parsed.query)
|
|
||||||
print("Query Parameters:")
|
|
||||||
for key, values in params.items():
|
|
||||||
print(f" {key}: {values}")
|
|
||||||
|
|
||||||
# Show POST data if any
|
|
||||||
post_data = request.get('postData', {})
|
|
||||||
if post_data.get('text'):
|
|
||||||
print(f"POST Data: {post_data.get('text')[:200]}...")
|
|
||||||
|
|
||||||
# Check response content
|
|
||||||
response_content = response.get('content', {})
|
|
||||||
response_text = response_content.get('text', '')
|
|
||||||
|
|
||||||
if response_text:
|
|
||||||
print(f"Response size: {len(response_text)} characters")
|
|
||||||
|
|
||||||
# Try to parse as JSON
|
|
||||||
try:
|
|
||||||
response_json = json.loads(response_text)
|
|
||||||
print("✓ Valid JSON response")
|
|
||||||
|
|
||||||
# Look for product-like structures
|
|
||||||
def find_products_in_json(obj, path=""):
|
|
||||||
products = []
|
|
||||||
if isinstance(obj, dict):
|
|
||||||
for key, value in obj.items():
|
|
||||||
new_path = f"{path}.{key}" if path else key
|
|
||||||
if key.lower() in ['products', 'items', 'results', 'data']:
|
|
||||||
if isinstance(value, list):
|
|
||||||
products.append((new_path, len(value)))
|
|
||||||
products.extend(find_products_in_json(value, new_path))
|
|
||||||
elif isinstance(obj, list):
|
|
||||||
for idx, item in enumerate(obj):
|
|
||||||
products.extend(find_products_in_json(item, f"{path}[{idx}]"))
|
|
||||||
return products
|
|
||||||
|
|
||||||
product_arrays = find_products_in_json(response_json)
|
|
||||||
if product_arrays:
|
|
||||||
print("Potential product arrays found:")
|
|
||||||
for path, count in product_arrays:
|
|
||||||
print(f" {path}: {count} items")
|
|
||||||
|
|
||||||
# Check for our specific product
|
|
||||||
response_str = str(response_json).lower()
|
|
||||||
if '41936301' in response_str:
|
|
||||||
print("🎯 CONTAINS OUR TEST PRODUCT SKU!")
|
|
||||||
if '728192558375' in response_str:
|
|
||||||
print("🎯 CONTAINS OUR TEST PRODUCT UPC!")
|
|
||||||
if 'pokemon' in response_str:
|
|
||||||
print("🎯 CONTAINS POKEMON REFERENCES!")
|
|
||||||
|
|
||||||
except json.JSONDecodeError:
|
|
||||||
print("Response is not JSON")
|
|
||||||
# Check if it contains our product anyway
|
|
||||||
if '41936301' in response_text:
|
|
||||||
print("🎯 CONTAINS OUR TEST PRODUCT SKU!")
|
|
||||||
|
|
||||||
# Return the most promising API calls
|
|
||||||
return api_calls, product_calls
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error analyzing HAR file: {e}")
|
|
||||||
return [], []
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
har_files = ['www.dollargeneral.com_Archive [26-03-21 15-14-28].har']
|
|
||||||
|
|
||||||
for har_file in har_files:
|
|
||||||
try:
|
|
||||||
api_calls, product_calls = analyze_har_file(har_file)
|
|
||||||
print(f"\n🎯 SUMMARY:")
|
|
||||||
print(f" Total API calls: {len(api_calls)}")
|
|
||||||
print(f" Product-related calls: {len(product_calls)}")
|
|
||||||
|
|
||||||
if product_calls:
|
|
||||||
print(f"\n💡 NEXT STEPS:")
|
|
||||||
print(f" 1. Test the identified API endpoints")
|
|
||||||
print(f" 2. Replicate the headers and parameters")
|
|
||||||
print(f" 3. Integrate successful calls into Pokemon Discovery")
|
|
||||||
|
|
||||||
except FileNotFoundError:
|
|
||||||
print(f"HAR file not found: {har_file}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error processing {har_file}: {e}")
|
|
||||||
@@ -1,41 +0,0 @@
|
|||||||
{
|
|
||||||
"endpoint": "https://dggo.dollargeneral.com/omni/api/v2/category/search/provider",
|
|
||||||
"method": "POST",
|
|
||||||
"headers": {
|
|
||||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0",
|
|
||||||
"Accept": "application/json, text/plain, */*",
|
|
||||||
"Content-Type": "application/json",
|
|
||||||
"Authorization": "Bearer eyJ0eXAiOiJhdCtKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6Ik5qRTJNemczTXpSRVFrUXpNak5GUmprMU1FUkNNRUZDTVRBek1FWTFRa0pCTXpRM1EwTkNNZyJ9.eyJzY29wZSI6bnVsbCwiaWF0IjoxNzc0MTI3Nzc5LCJleHAiOjE3NzQxMzEzNzksImF1ZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImlzcyI6Imh0dHBzOi8vcHJvZC1kZ2dvLyIsInN1YiI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsInNpZCI6IlNrWk9makF5TURRMU1EVXpOVFEwWWpBM016SXpNak14TXpFek9ETTNNekV3TWpreFl6VitUVUZXYVhwbk56SXpVRGg2VWxkcmEySkRkMk5EZUdVNFlUWm5XVXBHVDBveVExTlRNVWxXWlhSalQzRnFWazVWZGtGWlIwOWtZV2x0WVVwRVRucG5SVlZvUTE5SE5VcHVObGhuTURSb2JuUkVhVlF3UTBzelNIND0iLCJqdGkiOiJzdDIucy5BdEx0VlphRHFnLnZrdW5OV2RWNjN2ZlJTTG00Y3VUd2d5bmc2X0pJNmxKRjA5a2lXTXVQeGZkVDRvT0NhMXhwa1VoRlRkM2tocHZUaFhsRUVwLWw0QzJrZnoycjkzVlYzeldBaUw5Y2x6Snl0amFJamJ4TEJnLkJOZy1CeUdpZnV0WnppQWhhMV8xRDBXTUFWR3JpNVVCX0pKbTRCNVRNYVhTWkZneXpxeUZERjJxZ3B3UTgyajZ2eGVtcnA5RERFTHZnM3hvdlZmZzBnLnNjMyIsImNsaWVudF9pZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImF6cCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiJ9.I6ou9atkJ8ndkr2m2Trpg53fMIL3hpofCLUHoHYgZkOJnLnbmL0CQu7_pIChQ6nIDK03GagK6aqxd97E8B8vv9nweSmb7zXhrt43dKLEIdhxIGFkJ4xYgNNg-3cVjSlThBQ_AwCx924lOGjEfikEw4NrvGvrlNvrg1lnNz4hf629hUH-5ccVSdgo1w_LQzsLOeMCjuC_bmAoRxT5KLI9oESd4tPJZU5Nlt2ICbWJD9h-zNrt-ijwYCvb7j8amGbpMGhJZqtzu9f3wN0JUFxDg5rAN-WOtLjwEmR_NxDKq0NEeuU16uhaB8AJzy217XAgJ87bKZldZowsWs-Q9oAH3g",
|
|
||||||
"Referer": "https://www.dollargeneral.com/"
|
|
||||||
},
|
|
||||||
"post_data": {
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": null,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": false,
|
|
||||||
"dgPickUp": false,
|
|
||||||
"dgShipTohome": false,
|
|
||||||
"soldAtStore": true,
|
|
||||||
"inStock": true,
|
|
||||||
"onlyActivatedDeals": false
|
|
||||||
},
|
|
||||||
"IncludeSponsored": true,
|
|
||||||
"IncludeShipToHome": true,
|
|
||||||
"IncludeDeals": true,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960,
|
|
||||||
"IncludeProducts": false,
|
|
||||||
"DoNotSave": false,
|
|
||||||
"OptOut": false,
|
|
||||||
"SearchType": 1
|
|
||||||
},
|
|
||||||
"example_response": {
|
|
||||||
"total_items": 4,
|
|
||||||
"pokemon_items": 0,
|
|
||||||
"sample_pokemon_product": null
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@@ -1,182 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Debug Pokemon page loading to understand the dynamic content issue
|
|
||||||
"""
|
|
||||||
|
|
||||||
import requests
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
|
|
||||||
def test_pokemon_page():
|
|
||||||
"""Test both Pokemon URLs to understand the difference"""
|
|
||||||
|
|
||||||
print("Pokemon Page Loading Debug")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
urls_to_test = [
|
|
||||||
"https://www.dollargeneral.com/c/toys/pokemon?q=",
|
|
||||||
"https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true",
|
|
||||||
"https://www.dollargeneral.com/c/toys/pokemon"
|
|
||||||
]
|
|
||||||
|
|
||||||
for url in urls_to_test:
|
|
||||||
print(f"\n=== Testing: {url} ===")
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
|
||||||
}
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.get(url, headers=headers, timeout=30)
|
|
||||||
print(f"Status: {response.status_code}")
|
|
||||||
print(f"Content Length: {len(response.text)} characters")
|
|
||||||
|
|
||||||
# Parse HTML
|
|
||||||
soup = BeautifulSoup(response.text, 'html.parser')
|
|
||||||
|
|
||||||
# Look for specific indicators
|
|
||||||
indicators = {
|
|
||||||
"Product links (/p/)": len(soup.select('a[href*="/p/"]')),
|
|
||||||
"Pokemon mentions": response.text.lower().count('pokemon'),
|
|
||||||
"Trading card mentions": response.text.lower().count('trading card'),
|
|
||||||
"Pack mentions": response.text.lower().count('pack'),
|
|
||||||
"Scripts with 'product'": len([s for s in soup.find_all('script') if s.string and 'product' in s.string.lower()]),
|
|
||||||
"Category ID 723960": '723960' in response.text,
|
|
||||||
"Store number 17506": '17506' in response.text,
|
|
||||||
"Test SKU 41936301": '41936301' in response.text
|
|
||||||
}
|
|
||||||
|
|
||||||
for indicator, value in indicators.items():
|
|
||||||
print(f" {indicator}: {value}")
|
|
||||||
|
|
||||||
# Look for category information or product containers
|
|
||||||
category_info = soup.select('[data-category-id], [data-category], .category-info, .product-grid, .product-list')
|
|
||||||
if category_info:
|
|
||||||
print(f" Category/product containers found: {len(category_info)}")
|
|
||||||
for container in category_info[:3]:
|
|
||||||
print(f" -> {container.name} {container.get('class', [])} {container.get('data-category-id', '')}")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Error: {e}")
|
|
||||||
|
|
||||||
def demonstrate_dynamic_loading_issue():
|
|
||||||
"""Demonstrate why we're not finding products in static HTML"""
|
|
||||||
|
|
||||||
print("\n" + "=" * 60)
|
|
||||||
print("DYNAMIC LOADING ANALYSIS")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
print("""
|
|
||||||
🔍 THE ISSUE EXPLAINED:
|
|
||||||
|
|
||||||
1. ✅ STATIC HTML LOADS: The Pokemon category page loads successfully
|
|
||||||
- Page title: "Pokemon"
|
|
||||||
- Content length: 139,146 characters
|
|
||||||
- Contains Pokemon references and basic page structure
|
|
||||||
|
|
||||||
2. ❌ NO PRODUCTS IN HTML: Zero product links found in static content
|
|
||||||
- No <a href="/p/..."> links
|
|
||||||
- No product tiles, cards, or grids
|
|
||||||
- Products are NOT in the initial HTML
|
|
||||||
|
|
||||||
3. 🔬 WHAT REALLY HAPPENS (discovered via HAR):
|
|
||||||
- Page loads basic structure
|
|
||||||
- JavaScript executes and makes API calls
|
|
||||||
- API endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider
|
|
||||||
- API returns 4-12 Pokemon products as JSON
|
|
||||||
- JavaScript renders products into the page DOM
|
|
||||||
- Browser shows the products, but static scraping misses them
|
|
||||||
|
|
||||||
4. ✅ HAR ANALYSIS CONFIRMED:
|
|
||||||
- Category ID: 723960 (Pokemon)
|
|
||||||
- Store number: 17506
|
|
||||||
- Found your test product: SKU 41936301
|
|
||||||
- Found multiple Pokemon packs and tins
|
|
||||||
|
|
||||||
🎯 CONCLUSION:
|
|
||||||
The Pokemon page IS being scraped, but it's just the empty shell.
|
|
||||||
The actual products load via JavaScript API calls after page load.
|
|
||||||
""")
|
|
||||||
|
|
||||||
def show_comparison():
|
|
||||||
"""Show the difference between what we get vs what should be there"""
|
|
||||||
|
|
||||||
print("\n" + "=" * 60)
|
|
||||||
print("COMPARISON: STATIC HTML vs DYNAMIC CONTENT")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
comparison = """
|
|
||||||
WHAT WE GET (Static HTML):
|
|
||||||
━━━━━━━━━━━━━━━━━━━━━━
|
|
||||||
• Page structure: ✅
|
|
||||||
• Category title: ✅
|
|
||||||
• Navigation: ✅
|
|
||||||
• Product links: ❌ (0 found)
|
|
||||||
• Product data: ❌ (none)
|
|
||||||
|
|
||||||
WHAT SHOULD BE THERE (Dynamic Content):
|
|
||||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
||||||
• Pokemon Trading Card Game packs
|
|
||||||
• Pokemon tins and collections
|
|
||||||
• Product images and prices
|
|
||||||
• Stock availability
|
|
||||||
• Your test product (SKU 41936301)
|
|
||||||
• 4-12 total Pokemon TCG products
|
|
||||||
|
|
||||||
THE API RESPONSE WE DISCOVERED:
|
|
||||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
||||||
{
|
|
||||||
"ItemList": {
|
|
||||||
"Items": [
|
|
||||||
{
|
|
||||||
"Title": "Pokémon Trading Card Game, 15 Card Pack, 1 ct",
|
|
||||||
"ItemNbr": "41936301",
|
|
||||||
"UPC": "728192558375",
|
|
||||||
"ProductUrl": "/p/pok-mon-trading-card-game-card-pack-ct/728192558375",
|
|
||||||
"Inventory": {"InStock": false}
|
|
||||||
},
|
|
||||||
// ... more Pokemon products
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
"""
|
|
||||||
print(comparison)
|
|
||||||
|
|
||||||
def main():
|
|
||||||
test_pokemon_page()
|
|
||||||
demonstrate_dynamic_loading_issue()
|
|
||||||
show_comparison()
|
|
||||||
|
|
||||||
print("\n" + "=" * 60)
|
|
||||||
print("💡 SOLUTIONS TO GET ALL PRODUCTS:")
|
|
||||||
print("=" * 60)
|
|
||||||
print("""
|
|
||||||
OPTION 1 - API Authentication (Best Long-term):
|
|
||||||
• Solve the Bearer token authentication
|
|
||||||
• Use the discovered API endpoint directly
|
|
||||||
• Get all 24+ products per request automatically
|
|
||||||
|
|
||||||
OPTION 2 - Browser Automation (Works but Complex):
|
|
||||||
• Fix ChromeDriver compatibility with Brave
|
|
||||||
• Let JavaScript load the products completely
|
|
||||||
• Scrape the dynamically-loaded content
|
|
||||||
|
|
||||||
OPTION 3 - Manual Product URL Collection (Works Now):
|
|
||||||
• Find Pokemon product URLs from other sources
|
|
||||||
• Add them to the manual list in working_product_finder.py
|
|
||||||
• Process each product individually (current working method)
|
|
||||||
|
|
||||||
OPTION 4 - Hybrid Approach:
|
|
||||||
• Use individual product extraction for reliability
|
|
||||||
• Enhance discovery via multiple methods
|
|
||||||
• Build up a comprehensive product database over time
|
|
||||||
""")
|
|
||||||
|
|
||||||
print("\n🎯 BOTTOM LINE:")
|
|
||||||
print("The Pokemon page IS being scraped successfully!")
|
|
||||||
print("But it's just an empty shell - the products load via JavaScript.")
|
|
||||||
print("This is why we found the API endpoint - that's where the real data is!")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,135 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Extract exact API request details from HAR file
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
from urllib.parse import urlparse, parse_qs
|
|
||||||
|
|
||||||
def extract_api_request_details():
|
|
||||||
"""Extract the exact API request format"""
|
|
||||||
|
|
||||||
har_file = 'www.dollargeneral.com_Archive [26-03-21 15-14-28].har'
|
|
||||||
|
|
||||||
with open(har_file, 'r', encoding='utf-8') as f:
|
|
||||||
har_data = json.load(f)
|
|
||||||
|
|
||||||
entries = har_data.get('log', {}).get('entries', [])
|
|
||||||
|
|
||||||
# Find the API calls that contain our product
|
|
||||||
api_endpoint = "https://dggo.dollargeneral.com/omni/api/v2/category/search/provider"
|
|
||||||
|
|
||||||
successful_calls = []
|
|
||||||
|
|
||||||
for entry in entries:
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
|
|
||||||
if (request.get('url') == api_endpoint and
|
|
||||||
request.get('method') == 'POST' and
|
|
||||||
response.get('status') == 200):
|
|
||||||
|
|
||||||
# Check if response contains our product
|
|
||||||
response_text = response.get('content', {}).get('text', '')
|
|
||||||
if '41936301' in response_text and 'pokemon' in response_text.lower():
|
|
||||||
successful_calls.append(entry)
|
|
||||||
|
|
||||||
print(f"Found {len(successful_calls)} successful API calls with Pokemon products")
|
|
||||||
print()
|
|
||||||
|
|
||||||
for i, entry in enumerate(successful_calls):
|
|
||||||
request = entry.get('request', {})
|
|
||||||
response = entry.get('response', {})
|
|
||||||
|
|
||||||
print(f"=== API Call {i+1} ===")
|
|
||||||
print(f"URL: {request.get('url')}")
|
|
||||||
print(f"Method: {request.get('method')}")
|
|
||||||
|
|
||||||
# Extract headers
|
|
||||||
headers = {}
|
|
||||||
for header in request.get('headers', []):
|
|
||||||
name = header.get('name')
|
|
||||||
value = header.get('value')
|
|
||||||
if name.lower() in ['authorization', 'content-type', 'accept', 'referer', 'user-agent']:
|
|
||||||
headers[name] = value
|
|
||||||
|
|
||||||
print("Headers:")
|
|
||||||
for name, value in headers.items():
|
|
||||||
if name.lower() == 'authorization':
|
|
||||||
print(f" {name}: {value[:50]}... (Bearer token)")
|
|
||||||
else:
|
|
||||||
print(f" {name}: {value}")
|
|
||||||
|
|
||||||
# Extract POST data
|
|
||||||
post_data = request.get('postData', {})
|
|
||||||
if post_data.get('text'):
|
|
||||||
try:
|
|
||||||
post_json = json.loads(post_data.get('text'))
|
|
||||||
print("POST Data:")
|
|
||||||
print(json.dumps(post_json, indent=2))
|
|
||||||
except:
|
|
||||||
print(f"POST Data (raw): {post_data.get('text')}")
|
|
||||||
|
|
||||||
# Analyze response
|
|
||||||
response_text = response.get('content', {}).get('text', '')
|
|
||||||
if response_text:
|
|
||||||
try:
|
|
||||||
response_json = json.loads(response_text)
|
|
||||||
print(f"Response size: {len(response_text)} characters")
|
|
||||||
|
|
||||||
# Extract product information
|
|
||||||
items = response_json.get('ItemList', {}).get('Items', [])
|
|
||||||
print(f"Products found: {len(items)}")
|
|
||||||
|
|
||||||
# Show Pokemon products
|
|
||||||
pokemon_products = []
|
|
||||||
for item in items:
|
|
||||||
title = item.get('Title', '').lower()
|
|
||||||
if 'pokemon' in title or 'pokémon' in title:
|
|
||||||
pokemon_products.append({
|
|
||||||
'title': item.get('Title'),
|
|
||||||
'sku': item.get('ItemNbr'),
|
|
||||||
'upc': item.get('UPC'),
|
|
||||||
'price': item.get('Price', {}).get('Amount'),
|
|
||||||
'url': item.get('ProductUrl'),
|
|
||||||
'in_stock': item.get('Inventory', {}).get('InStock'),
|
|
||||||
'available_online': item.get('Inventory', {}).get('AvailableOnline')
|
|
||||||
})
|
|
||||||
|
|
||||||
if pokemon_products:
|
|
||||||
print(f"\nPokemon products in this response: {len(pokemon_products)}")
|
|
||||||
for prod in pokemon_products:
|
|
||||||
print(f" • {prod['title']}")
|
|
||||||
print(f" SKU: {prod['sku']}, UPC: {prod['upc']}")
|
|
||||||
print(f" Price: ${prod['price']}, In Stock: {prod['in_stock']}")
|
|
||||||
print(f" URL: {prod['url']}")
|
|
||||||
|
|
||||||
# Extract the store number and filters used
|
|
||||||
if i == 0: # Save the working request format
|
|
||||||
with open('api_request_template.json', 'w') as f:
|
|
||||||
json.dump({
|
|
||||||
'endpoint': api_endpoint,
|
|
||||||
'method': 'POST',
|
|
||||||
'headers': headers,
|
|
||||||
'post_data': post_json,
|
|
||||||
'example_response': {
|
|
||||||
'total_items': len(items),
|
|
||||||
'pokemon_items': len(pokemon_products),
|
|
||||||
'sample_pokemon_product': pokemon_products[0] if pokemon_products else None
|
|
||||||
}
|
|
||||||
}, f, indent=2)
|
|
||||||
print(f"\n✅ Saved working API template to: api_request_template.json")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error parsing response: {e}")
|
|
||||||
|
|
||||||
print("\n" + "="*60 + "\n")
|
|
||||||
|
|
||||||
return successful_calls
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
successful_calls = extract_api_request_details()
|
|
||||||
|
|
||||||
print("🎯 SUMMARY:")
|
|
||||||
print(f" Successfully extracted {len(successful_calls)} working API calls")
|
|
||||||
print(" Next step: Implement this API call in Pokemon Discovery scraper")
|
|
||||||
@@ -1,297 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Implement API-based scraping for Pokemon Discovery
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import requests
|
|
||||||
import sys
|
|
||||||
from datetime import datetime
|
|
||||||
from urllib.parse import urljoin
|
|
||||||
|
|
||||||
class DollarGeneralAPIScaper:
|
|
||||||
def __init__(self):
|
|
||||||
self.base_url = "https://www.dollargeneral.com"
|
|
||||||
self.api_base = "https://dggo.dollargeneral.com"
|
|
||||||
self.session = requests.Session()
|
|
||||||
|
|
||||||
# Headers that mimic a real browser session
|
|
||||||
self.headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Accept-Language': 'en-US,en;q=0.9',
|
|
||||||
'Accept-Encoding': 'gzip, deflate, br',
|
|
||||||
'DNT': '1',
|
|
||||||
'Connection': 'keep-alive',
|
|
||||||
'Sec-Fetch-Dest': 'empty',
|
|
||||||
'Sec-Fetch-Mode': 'cors',
|
|
||||||
'Sec-Fetch-Site': 'cross-site',
|
|
||||||
}
|
|
||||||
self.session.headers.update(self.headers)
|
|
||||||
|
|
||||||
self.auth_token = None
|
|
||||||
|
|
||||||
def get_auth_token(self):
|
|
||||||
"""Try multiple methods to get authentication token"""
|
|
||||||
|
|
||||||
print("🔑 Attempting to get authentication token...")
|
|
||||||
|
|
||||||
# Method 1: Get token from main page
|
|
||||||
try:
|
|
||||||
print(" - Visiting main Pokemon page...")
|
|
||||||
pokemon_url = f"{self.base_url}/c/toys/pokemon?q=&soldAtStore=true"
|
|
||||||
response = self.session.get(pokemon_url, timeout=30)
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
# Look for embedded tokens in the page
|
|
||||||
import re
|
|
||||||
|
|
||||||
# Look for bearer tokens in script tags
|
|
||||||
token_patterns = [
|
|
||||||
r'Bearer\s+([A-Za-z0-9\-_\.]+)',
|
|
||||||
r'"access_token":\s*"([^"]+)"',
|
|
||||||
r'"token":\s*"([^"]+)"',
|
|
||||||
r'authorization:\s*["\'](Bearer\s+[^"\']+)["\']'
|
|
||||||
]
|
|
||||||
|
|
||||||
for pattern in token_patterns:
|
|
||||||
matches = re.findall(pattern, response.text, re.IGNORECASE)
|
|
||||||
if matches:
|
|
||||||
token = matches[0]
|
|
||||||
if token.startswith('Bearer '):
|
|
||||||
token = token[7:] # Remove 'Bearer ' prefix
|
|
||||||
print(f" ✅ Found token via pattern: {token[:50]}...")
|
|
||||||
self.auth_token = token
|
|
||||||
return token
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Main page method failed: {e}")
|
|
||||||
|
|
||||||
# Method 2: Try token endpoint
|
|
||||||
try:
|
|
||||||
print(" - Trying token endpoint...")
|
|
||||||
token_url = f"{self.base_url}/bin/omni/userTokens"
|
|
||||||
response = self.session.get(token_url, timeout=30)
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
if 'access_token' in data:
|
|
||||||
token = data['access_token']
|
|
||||||
print(f" ✅ Got token from endpoint: {token[:50]}...")
|
|
||||||
self.auth_token = token
|
|
||||||
return token
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Token endpoint failed: {e}")
|
|
||||||
|
|
||||||
# Method 3: Try CSRF token endpoint
|
|
||||||
try:
|
|
||||||
print(" - Trying CSRF token...")
|
|
||||||
csrf_url = f"{self.base_url}/libs/granite/csrf/token.json"
|
|
||||||
response = self.session.get(csrf_url, timeout=30)
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
data = response.json()
|
|
||||||
if 'token' in data:
|
|
||||||
# This might not be the right token, but let's try
|
|
||||||
print(f" ⚠️ Got CSRF token (may not work for API): {str(data)[:100]}...")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ CSRF method failed: {e}")
|
|
||||||
|
|
||||||
print(" ❌ Could not obtain authentication token")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def search_products_api(self, store_nbr=17506, category_id=723960, include_out_of_stock=True):
|
|
||||||
"""Search for products using the API endpoint"""
|
|
||||||
|
|
||||||
print(f"🔍 Searching products via API...")
|
|
||||||
print(f" Store: {store_nbr}, Category: {category_id}")
|
|
||||||
|
|
||||||
if not self.auth_token:
|
|
||||||
print(" ❌ No authentication token available")
|
|
||||||
return []
|
|
||||||
|
|
||||||
endpoint = f"{self.api_base}/omni/api/v2/category/search/provider"
|
|
||||||
|
|
||||||
# Headers for API request
|
|
||||||
api_headers = self.headers.copy()
|
|
||||||
api_headers.update({
|
|
||||||
'Content-Type': 'application/json',
|
|
||||||
'Authorization': f'Bearer {self.auth_token}',
|
|
||||||
'Referer': f'{self.base_url}/',
|
|
||||||
'Origin': self.base_url,
|
|
||||||
})
|
|
||||||
|
|
||||||
# Request payload based on HAR analysis
|
|
||||||
payload = {
|
|
||||||
"StoreNbr": store_nbr,
|
|
||||||
"SearchTerm": None,
|
|
||||||
"PageSize": 48, # Request more items
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": False,
|
|
||||||
"dgPickUp": False,
|
|
||||||
"dgShipTohome": False,
|
|
||||||
"soldAtStore": True,
|
|
||||||
"inStock": not include_out_of_stock, # False = include out of stock
|
|
||||||
"onlyActivatedDeals": False
|
|
||||||
},
|
|
||||||
"IncludeSponsored": True,
|
|
||||||
"IncludeShipToHome": True,
|
|
||||||
"IncludeDeals": True,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": category_id,
|
|
||||||
"IncludeProducts": False,
|
|
||||||
"DoNotSave": False,
|
|
||||||
"OptOut": False,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
|
|
||||||
try:
|
|
||||||
print(f" POST {endpoint}")
|
|
||||||
response = self.session.post(endpoint,
|
|
||||||
headers=api_headers,
|
|
||||||
json=payload,
|
|
||||||
timeout=30)
|
|
||||||
|
|
||||||
print(f" Status: {response.status_code}")
|
|
||||||
print(f" Response size: {len(response.text)} characters")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
if len(response.text) == 0:
|
|
||||||
print(" ⚠️ Empty response (token may be expired)")
|
|
||||||
return []
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
items = data.get('ItemList', {}).get('Items', [])
|
|
||||||
print(f" ✅ Found {len(items)} total items")
|
|
||||||
return items
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ JSON parsing error: {e}")
|
|
||||||
print(f" Response preview: {response.text[:200]}...")
|
|
||||||
return []
|
|
||||||
|
|
||||||
elif response.status_code == 401:
|
|
||||||
print(" ❌ Authentication failed - token expired or invalid")
|
|
||||||
return []
|
|
||||||
else:
|
|
||||||
print(f" ❌ API error: {response.status_code}")
|
|
||||||
print(f" Response: {response.text[:200]}...")
|
|
||||||
return []
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Request failed: {e}")
|
|
||||||
return []
|
|
||||||
|
|
||||||
def filter_pokemon_products(self, items):
|
|
||||||
"""Filter for Pokemon TCG products"""
|
|
||||||
|
|
||||||
pokemon_products = []
|
|
||||||
|
|
||||||
for item in items:
|
|
||||||
title = item.get('Title', '').lower()
|
|
||||||
description = item.get('Description', '').lower()
|
|
||||||
brand = item.get('Brand', '').lower()
|
|
||||||
|
|
||||||
# Check if this is a Pokemon TCG product
|
|
||||||
pokemon_keywords = ['pokemon', 'pokémon']
|
|
||||||
tcg_keywords = ['trading card', 'tcg', 'cards', 'pack', 'tin', 'box', 'collection']
|
|
||||||
|
|
||||||
has_pokemon = any(keyword in title or keyword in description for keyword in pokemon_keywords)
|
|
||||||
has_tcg = any(keyword in title or keyword in description for keyword in tcg_keywords)
|
|
||||||
|
|
||||||
if has_pokemon and has_tcg:
|
|
||||||
product = {
|
|
||||||
'title': item.get('Title'),
|
|
||||||
'sku': item.get('ItemNbr'),
|
|
||||||
'upc': item.get('UPC'),
|
|
||||||
'price': f"${item.get('Price', {}).get('Amount', 0):.2f}",
|
|
||||||
'url': urljoin(self.base_url, item.get('ProductUrl', '')),
|
|
||||||
'stock': 'In Stock' if item.get('Inventory', {}).get('InStock') else 'Out of Stock',
|
|
||||||
'image_url': item.get('ImageURL'),
|
|
||||||
'description': item.get('Description', ''),
|
|
||||||
'brand': item.get('Brand', '')
|
|
||||||
}
|
|
||||||
pokemon_products.append(product)
|
|
||||||
|
|
||||||
print(f" 🎯 Found: {product['title']}")
|
|
||||||
print(f" SKU: {product['sku']}, Price: {product['price']}")
|
|
||||||
print(f" Stock: {product['stock']}")
|
|
||||||
|
|
||||||
return pokemon_products
|
|
||||||
|
|
||||||
def scrape_pokemon_products(self):
|
|
||||||
"""Main scraping method"""
|
|
||||||
|
|
||||||
print("Pokemon Discovery - API-based Scraping")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Get authentication token
|
|
||||||
if not self.get_auth_token():
|
|
||||||
print("❌ Authentication failed - cannot access API")
|
|
||||||
print()
|
|
||||||
print("💡 Alternative approaches:")
|
|
||||||
print(" 1. Use browser automation with proper session")
|
|
||||||
print(" 2. Extract products manually from individual pages")
|
|
||||||
print(" 3. Use the working individual product scraper")
|
|
||||||
return []
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Search for products
|
|
||||||
all_items = self.search_products_api()
|
|
||||||
|
|
||||||
if not all_items:
|
|
||||||
print("❌ No items returned from API")
|
|
||||||
return []
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Filter for Pokemon products
|
|
||||||
pokemon_products = self.filter_pokemon_products(all_items)
|
|
||||||
|
|
||||||
print()
|
|
||||||
print(f"🎉 SUCCESS! Found {len(pokemon_products)} Pokemon TCG products")
|
|
||||||
|
|
||||||
if pokemon_products:
|
|
||||||
# Save results
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
||||||
filename = f'pokemon_tcg_api_scrape_{timestamp}.json'
|
|
||||||
|
|
||||||
with open(filename, 'w') as f:
|
|
||||||
json.dump(pokemon_products, f, indent=2)
|
|
||||||
|
|
||||||
print(f"💾 Saved to: {filename}")
|
|
||||||
|
|
||||||
# Show summary
|
|
||||||
print()
|
|
||||||
print("📋 Product Summary:")
|
|
||||||
for i, product in enumerate(pokemon_products, 1):
|
|
||||||
print(f" {i}. {product['title']}")
|
|
||||||
print(f" SKU: {product['sku']} | Price: {product['price']} | {product['stock']}")
|
|
||||||
|
|
||||||
return pokemon_products
|
|
||||||
|
|
||||||
def main():
|
|
||||||
scraper = DollarGeneralAPIScaper()
|
|
||||||
products = scraper.scrape_pokemon_products()
|
|
||||||
|
|
||||||
if products:
|
|
||||||
print()
|
|
||||||
print("🚀 Ready for PDF generation!")
|
|
||||||
print("Run: python pdf_generator.py pokemon_tcg_api_scrape_[timestamp].json")
|
|
||||||
else:
|
|
||||||
print()
|
|
||||||
print("📝 Note: Individual product scraping still works perfectly!")
|
|
||||||
print("The issue is authentication for bulk API access.")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
279
pdf_generator.py
279
pdf_generator.py
@@ -1,279 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Pokemon Discovery - TCG Product Catalog PDF Generator
|
|
||||||
Generates PDF catalog with product images, details, and UPC-A barcodes
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import requests
|
|
||||||
import subprocess
|
|
||||||
from datetime import datetime
|
|
||||||
from pathlib import Path
|
|
||||||
import barcode
|
|
||||||
from barcode.writer import ImageWriter
|
|
||||||
from PIL import Image, ImageDraw, ImageFont
|
|
||||||
import tempfile
|
|
||||||
import shutil
|
|
||||||
|
|
||||||
class PokemonTCGCatalogGenerator:
|
|
||||||
def __init__(self, json_file):
|
|
||||||
self.json_file = json_file
|
|
||||||
self.output_dir = Path("catalog_output")
|
|
||||||
self.images_dir = self.output_dir / "images"
|
|
||||||
self.barcodes_dir = self.output_dir / "barcodes"
|
|
||||||
|
|
||||||
# Create output directories
|
|
||||||
self.output_dir.mkdir(exist_ok=True)
|
|
||||||
self.images_dir.mkdir(exist_ok=True)
|
|
||||||
self.barcodes_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# Load product data
|
|
||||||
with open(json_file, 'r') as f:
|
|
||||||
self.products = json.load(f)
|
|
||||||
|
|
||||||
def download_image(self, url, filename):
|
|
||||||
"""Download product image"""
|
|
||||||
if not url:
|
|
||||||
return None
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.get(url, timeout=30)
|
|
||||||
response.raise_for_status()
|
|
||||||
|
|
||||||
filepath = self.images_dir / filename
|
|
||||||
with open(filepath, 'wb') as f:
|
|
||||||
f.write(response.content)
|
|
||||||
|
|
||||||
return filepath
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to download image {url}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def generate_upc_barcode(self, sku):
|
|
||||||
"""Generate UPC-A barcode from SKU"""
|
|
||||||
try:
|
|
||||||
# Convert SKU to 12-digit UPC-A format
|
|
||||||
# Remove non-digits and pad/truncate to 11 digits (12th is check digit)
|
|
||||||
digits_only = ''.join(filter(str.isdigit, str(sku)))
|
|
||||||
|
|
||||||
if len(digits_only) < 11:
|
|
||||||
# Pad with zeros at the start
|
|
||||||
upc_base = digits_only.zfill(11)
|
|
||||||
else:
|
|
||||||
# Take the last 11 digits
|
|
||||||
upc_base = digits_only[-11:]
|
|
||||||
|
|
||||||
# Generate UPC-A barcode
|
|
||||||
upc_generator = barcode.get_barcode_class('upca')
|
|
||||||
upc = upc_generator(upc_base, writer=ImageWriter())
|
|
||||||
|
|
||||||
# Save barcode image
|
|
||||||
barcode_filename = f"barcode_{sku.replace('/', '_').replace(' ', '_')}"
|
|
||||||
barcode_path = self.barcodes_dir / barcode_filename
|
|
||||||
|
|
||||||
# Save with specific options for better appearance
|
|
||||||
upc.save(str(barcode_path).replace('.png', ''), options={
|
|
||||||
'module_width': 0.2,
|
|
||||||
'module_height': 15.0,
|
|
||||||
'quiet_zone': 6.5,
|
|
||||||
'font_size': 10,
|
|
||||||
'text_distance': 5.0,
|
|
||||||
'background': 'white',
|
|
||||||
'foreground': 'black'
|
|
||||||
})
|
|
||||||
|
|
||||||
final_path = f"{barcode_path}.png"
|
|
||||||
return final_path
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to generate barcode for SKU {sku}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def create_placeholder_image(self, width=300, height=200):
|
|
||||||
"""Create a placeholder image when product image is not available"""
|
|
||||||
img = Image.new('RGB', (width, height), color='lightgray')
|
|
||||||
draw = ImageDraw.Draw(img)
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Try to use a system font
|
|
||||||
font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf', 24)
|
|
||||||
except:
|
|
||||||
try:
|
|
||||||
font = ImageFont.truetype('arial.ttf', 24)
|
|
||||||
except:
|
|
||||||
font = ImageFont.load_default()
|
|
||||||
|
|
||||||
text = "No Image\nAvailable"
|
|
||||||
|
|
||||||
# Get text bounding box for centering
|
|
||||||
lines = text.split('\n')
|
|
||||||
y_offset = height // 2 - (len(lines) * 30) // 2
|
|
||||||
|
|
||||||
for line in lines:
|
|
||||||
bbox = draw.textbbox((0, 0), line, font=font)
|
|
||||||
text_width = bbox[2] - bbox[0]
|
|
||||||
x_offset = (width - text_width) // 2
|
|
||||||
draw.text((x_offset, y_offset), line, fill='darkgray', font=font)
|
|
||||||
y_offset += 30
|
|
||||||
|
|
||||||
placeholder_path = self.images_dir / "placeholder.png"
|
|
||||||
img.save(placeholder_path)
|
|
||||||
return placeholder_path
|
|
||||||
|
|
||||||
def generate_markdown(self):
|
|
||||||
"""Generate markdown content for the catalog"""
|
|
||||||
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
|
||||||
markdown = f"""---
|
|
||||||
title: "Pokemon TCG Product Catalog"
|
|
||||||
subtitle: "Dollar General - Generated {timestamp}"
|
|
||||||
author: "Automated Scraper"
|
|
||||||
date: "{timestamp}"
|
|
||||||
geometry: margin=1in
|
|
||||||
fontsize: 11pt
|
|
||||||
documentclass: article
|
|
||||||
---
|
|
||||||
|
|
||||||
# Pokemon TCG Product Catalog
|
|
||||||
|
|
||||||
Generated on: {timestamp}
|
|
||||||
Source: Dollar General
|
|
||||||
Total Products: {len(self.products)}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
for i, product in enumerate(self.products, 1):
|
|
||||||
print(f"Processing product {i}/{len(self.products)}: {product.get('title', 'Unknown')}")
|
|
||||||
|
|
||||||
# Download product image
|
|
||||||
image_path = None
|
|
||||||
if product.get('image_url'):
|
|
||||||
filename = f"product_{i}_{product.get('sku', 'unknown').replace('/', '_').replace(' ', '_')}.jpg"
|
|
||||||
image_path = self.download_image(product.get('image_url'), filename)
|
|
||||||
|
|
||||||
if not image_path:
|
|
||||||
# Use placeholder
|
|
||||||
image_path = self.create_placeholder_image()
|
|
||||||
|
|
||||||
# Generate barcode
|
|
||||||
barcode_path = None
|
|
||||||
if product.get('sku'):
|
|
||||||
barcode_path = self.generate_upc_barcode(product.get('sku'))
|
|
||||||
|
|
||||||
# Add product section to markdown
|
|
||||||
markdown += f"## {i}. {product.get('title', 'Unknown Product')}\n\n"
|
|
||||||
|
|
||||||
# Product image
|
|
||||||
if image_path:
|
|
||||||
rel_image_path = os.path.relpath(image_path, self.output_dir)
|
|
||||||
markdown += f"{{width=300px}}\n\n"
|
|
||||||
|
|
||||||
# Product details in a table
|
|
||||||
markdown += "| Field | Value |\n"
|
|
||||||
markdown += "|-------|-------|\n"
|
|
||||||
markdown += f"| **Title** | {product.get('title', 'N/A')} |\n"
|
|
||||||
markdown += f"| **Price** | {product.get('price', 'N/A')} |\n"
|
|
||||||
markdown += f"| **Stock** | {product.get('stock', 'N/A')} |\n"
|
|
||||||
markdown += f"| **SKU** | `{product.get('sku', 'N/A')}` |\n"
|
|
||||||
markdown += f"| **URL** | {product.get('url', 'N/A')} |\n"
|
|
||||||
markdown += "\n"
|
|
||||||
|
|
||||||
# Barcode
|
|
||||||
if barcode_path:
|
|
||||||
rel_barcode_path = os.path.relpath(barcode_path, self.output_dir)
|
|
||||||
markdown += f"**UPC-A Barcode:**\n\n"
|
|
||||||
markdown += f"{{width=200px}}\n\n"
|
|
||||||
|
|
||||||
markdown += "---\n\n"
|
|
||||||
|
|
||||||
return markdown
|
|
||||||
|
|
||||||
def generate_pdf(self):
|
|
||||||
"""Generate PDF catalog using pandoc"""
|
|
||||||
print("Generating markdown content...")
|
|
||||||
markdown_content = self.generate_markdown()
|
|
||||||
|
|
||||||
# Save markdown file
|
|
||||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
|
||||||
markdown_file = self.output_dir / f"pokemon_tcg_catalog_{timestamp}.md"
|
|
||||||
|
|
||||||
with open(markdown_file, 'w', encoding='utf-8') as f:
|
|
||||||
f.write(markdown_content)
|
|
||||||
|
|
||||||
print(f"Markdown saved to: {markdown_file}")
|
|
||||||
|
|
||||||
# Generate PDF using pandoc
|
|
||||||
pdf_file = self.output_dir / f"pokemon_tcg_catalog_{timestamp}.pdf"
|
|
||||||
|
|
||||||
print("Converting to PDF using pandoc...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
subprocess.run([
|
|
||||||
'pandoc',
|
|
||||||
str(markdown_file),
|
|
||||||
'-o', str(pdf_file),
|
|
||||||
'--pdf-engine=xelatex',
|
|
||||||
'-V', 'colorlinks=true',
|
|
||||||
'-V', 'linkcolor=blue',
|
|
||||||
'-V', 'filecolor=magenta',
|
|
||||||
'-V', 'urlcolor=cyan',
|
|
||||||
'--toc',
|
|
||||||
'--toc-depth=2'
|
|
||||||
], check=True)
|
|
||||||
|
|
||||||
print(f"PDF generated successfully: {pdf_file}")
|
|
||||||
return pdf_file
|
|
||||||
|
|
||||||
except subprocess.CalledProcessError as e:
|
|
||||||
print(f"Pandoc conversion failed: {e}")
|
|
||||||
print("Trying with pdflatex instead...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
subprocess.run([
|
|
||||||
'pandoc',
|
|
||||||
str(markdown_file),
|
|
||||||
'-o', str(pdf_file),
|
|
||||||
'--pdf-engine=pdflatex',
|
|
||||||
'--toc'
|
|
||||||
], check=True)
|
|
||||||
|
|
||||||
print(f"PDF generated successfully: {pdf_file}")
|
|
||||||
return pdf_file
|
|
||||||
|
|
||||||
except subprocess.CalledProcessError as e2:
|
|
||||||
print(f"PDF generation failed with both engines: {e2}")
|
|
||||||
print(f"Markdown file available at: {markdown_file}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
except FileNotFoundError:
|
|
||||||
print("Error: pandoc not found. Please install pandoc to generate PDF.")
|
|
||||||
print(f"Markdown file available at: {markdown_file}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def main():
|
|
||||||
if len(sys.argv) != 2:
|
|
||||||
print("Usage: python3 pdf_generator.py <json_file>")
|
|
||||||
print("Example: python3 pdf_generator.py pokemon_tcg_products_20241221_143025.json")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
json_file = sys.argv[1]
|
|
||||||
|
|
||||||
if not os.path.exists(json_file):
|
|
||||||
print(f"Error: JSON file '{json_file}' not found")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
generator = PokemonTCGCatalogGenerator(json_file)
|
|
||||||
pdf_file = generator.generate_pdf()
|
|
||||||
|
|
||||||
if pdf_file:
|
|
||||||
print(f"\nCatalog generation completed!")
|
|
||||||
print(f"PDF file: {pdf_file}")
|
|
||||||
print(f"Output directory: {generator.output_dir}")
|
|
||||||
else:
|
|
||||||
print(f"\nPDF generation failed, but markdown file is available in: {generator.output_dir}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,294 +0,0 @@
|
|||||||
|
|
||||||
<!DOCTYPE HTML>
|
|
||||||
<html lang="en">
|
|
||||||
<head>
|
|
||||||
|
|
||||||
|
|
||||||
<meta charset="UTF-8"/>
|
|
||||||
<title>
|
|
||||||
Pokemon
|
|
||||||
</title>
|
|
||||||
<!-- Iterate over preloadUrls -->
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<meta name="robots" content="index, follow"/>
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<meta name="description" content="Shop for Pokemon at Dollar General."/>
|
|
||||||
<meta name="template" content="category-page-template"/>
|
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1"/>
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<meta name="content-page-ref" content="eyzXWsPCDMW1KkhXo6-vK-QOHHXaDV4DTx4MUHOL5zAPRiNcJ9pD0H1_MbrY0VDfnmuWWl_PiDqTS8zA-qwgPQ"/>
|
|
||||||
<script defer="defer" type="text/javascript" src="/.rum/@adobe/helix-rum-js@%5E2/dist/micro.js"></script>
|
|
||||||
<script>
|
|
||||||
window.pageConfig = Object.assign(window.pageConfig || {}, {
|
|
||||||
googleApiKey: "AIzaSyDi0nb6nKeHaDJWFtAvbAIPKBrUuAc_mTY",
|
|
||||||
isEditMode: "false"
|
|
||||||
});
|
|
||||||
|
|
||||||
// Expose WCM mode information to frontend
|
|
||||||
|
|
||||||
window.DG = window.DG || {};
|
|
||||||
window.DG.wcmMode = {
|
|
||||||
isEdit: false,
|
|
||||||
isPreview: false,
|
|
||||||
isDisabled: true,
|
|
||||||
isDesign: false
|
|
||||||
};
|
|
||||||
|
|
||||||
</script>
|
|
||||||
|
|
||||||
<script>
|
|
||||||
window.DG = window.DG || {};
|
|
||||||
window.DG.aemData = window.DG.aemData || {};
|
|
||||||
window.DG.aemData.config = Object.assign(window.DG.aemData.config || {}, {
|
|
||||||
shoppingListPageUrl: "https:\/\/www.dollargeneral.com\/shopping\u002Dlist",
|
|
||||||
cartPageUrl: "https:\/\/www.dollargeneral.com\/cart",
|
|
||||||
checkOutPageUrl: "https:\/\/www.dollargeneral.com\/cart\/checkout",
|
|
||||||
orderPlacedPageUrl: "https:\/\/www.dollargeneral.com\/cart\/order\u002Dplaced?orderguid",
|
|
||||||
orderDetailsPageUrl: "https:\/\/www.dollargeneral.com\/order\u002Ddetails?orderguid",
|
|
||||||
orderHelpPageUrl: "https:\/\/www.dollargeneral.com\/order\u002Ddetails\/order\u002Dhelp",
|
|
||||||
substitutionsPageUrl: "https:\/\/www.dollargeneral.com\/cart\/substitutions",
|
|
||||||
dealsPageUrl: "https:\/\/www.dollargeneral.com\/deals",
|
|
||||||
offersPageUrl: "https:\/\/www.dollargeneral.com\/deals\/offers\/{offer\u002Dcode}",
|
|
||||||
pdpPageUrl: "https:\/\/www.dollargeneral.com\/p\/{hyphenated\u002Dproduct\u002Dname}\/{upc}",
|
|
||||||
weeklyAdsPageUrl: "https:\/\/www.dollargeneral.com\/deals\/weekly\u002Dads\/weekly\u002Dad\/{weekly\u002Dad\u002Did}?flyer_run_id={*}{weekly\u002Dad\u002Did}\x22{}{*}",
|
|
||||||
signInPageUrl: "https:\/\/www.dollargeneral.com\/sign\u002Din",
|
|
||||||
signUpPageUrl: "https:\/\/www.dollargeneral.com\/sign\u002Dup",
|
|
||||||
omniServerUrl: "https:\/\/dggo.dollargeneral.com",
|
|
||||||
deviceIdCookieMaxAge : "31536000",
|
|
||||||
cookiesMaxAge : "31536000",
|
|
||||||
useAkamaiLatLng : true,
|
|
||||||
paymentMethodsUrl : "https:\/\/www.dollargeneral.com\/my\u002Dinformation?startpage=paymentmethods",
|
|
||||||
orderHistoryUrl : "https:\/\/www.dollargeneral.com\/my\u002Dinformation?startpage=orders",
|
|
||||||
walletPageUrl : "https:\/\/www.dollargeneral.com\/mydg\/wallet",
|
|
||||||
couponsPageUrl : "https:\/\/www.dollargeneral.com\/deals\/coupons",
|
|
||||||
couponDetailsUrl : "https:\/\/www.dollargeneral.com\/deals\/coupons\/{coupon\u002Dtype}\/{coupon\u002Dcode}",
|
|
||||||
trackMyOrderPage : "https:\/\/www.dollargeneral.com\/orders",
|
|
||||||
storeDirectoryUrl : "https:\/\/www.dollargeneral.com\/store\u002Ddirectory",
|
|
||||||
myDgPageUrl : "https:\/\/www.dollargeneral.com\/mydg",
|
|
||||||
inventoryCallSearchRadius : "15",
|
|
||||||
orderSubstitutionsPageUrl : "https:\/\/www.dollargeneral.com\/order\u002Ddetails\/substitutions"
|
|
||||||
});
|
|
||||||
window.DG.aemData.sparkCodeErrorMsgs = Object.assign();
|
|
||||||
</script>
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<!-- Facebook Meta Tags -->
|
|
||||||
<meta property="og:type" content="website"/>
|
|
||||||
<meta property="og:title" content="Pokemon"/>
|
|
||||||
|
|
||||||
|
|
||||||
<meta property="og:url" content="https://www.dollargeneral.com/c/toys/pokemon"/>
|
|
||||||
|
|
||||||
<!-- Twitter Meta Tags -->
|
|
||||||
<meta name="twitter:card" content="summary_large_image"/>
|
|
||||||
<meta name="twitter:title" content="Pokemon"/>
|
|
||||||
|
|
||||||
|
|
||||||
<meta property="twitter:url" content="https://www.dollargeneral.com/c/toys/pokemon"/>
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<script type="application/ld+json">
|
|
||||||
{
|
|
||||||
"@context": "https://schema.org",
|
|
||||||
"@type": "BreadcrumbList",
|
|
||||||
"itemListElement": [
|
|
||||||
{
|
|
||||||
"@type": "ListItem",
|
|
||||||
"position": 1,
|
|
||||||
"item": {
|
|
||||||
"@type": "Thing",
|
|
||||||
"@id": "https://www.dollargeneral.com/",
|
|
||||||
"name": "Dollar General"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"@type": "ListItem",
|
|
||||||
"position": 2,
|
|
||||||
"item": {
|
|
||||||
"@type": "Thing",
|
|
||||||
"@id": "https://www.dollargeneral.com/tps://www.dollargeneral.com/content/dollargeneral/us/en/c/toys/pokemon",
|
|
||||||
"name": "tps:"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
</script>
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<script type="text/javascript">
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Store service enum in binary for "sezzle" is {@code 1000 0000 0000 0000 0000}.
|
|
||||||
*/
|
|
||||||
const SEZZLE_BIT_MASK_VALUE = 524288;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Store service enum in binary for "bopis" is {@code 0000 0000 0000 0000 1000}.
|
|
||||||
*/
|
|
||||||
const BOPIS_BIT_MASK_VALUE = 8;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Store service enum in binary for "delivery" is {@code 0000 0000 0001 0000 0000}.
|
|
||||||
*/
|
|
||||||
const DELIVERY_BIT_MASK_VALUE = 256;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* The key name for the object stored in {@link localStorage} for user store and guest store data.
|
|
||||||
*/
|
|
||||||
const PREFERRED_STORE_DATA_KEY = "preferredStoreData";
|
|
||||||
|
|
||||||
/**
|
|
||||||
* The default store to set if user is either not signed in or we are not able to
|
|
||||||
* determine a preferred store from the signed-in users data.
|
|
||||||
*/
|
|
||||||
const DEFAULT_STORE_NUMBER = 1014;
|
|
||||||
|
|
||||||
const DEFAULT_STORE_SEARCH_RADIUS = 10;
|
|
||||||
|
|
||||||
const DEFAULT_LATITUDE = 0;
|
|
||||||
|
|
||||||
const DEFAULT_LONGITUDE = 0;
|
|
||||||
|
|
||||||
const cookiesMaxAgeInSeconds = parseInt(
|
|
||||||
window?.DG?.aemData?.config?.cookiesMaxAge || "31536000"
|
|
||||||
);
|
|
||||||
|
|
||||||
const useCloudService = window.__FEATURE_FLAGS__?.useCloudServicesHeader;
|
|
||||||
const enableStoreSelectionFromURL = window.__FEATURE_FLAGS__?.enableStoreSelectionFromURL;
|
|
||||||
|
|
||||||
const isSezzle = (storeService) =>
|
|
||||||
(storeService & SEZZLE_BIT_MASK_VALUE) === SEZZLE_BIT_MASK_VALUE;
|
|
||||||
const isBopis = (storeService) =>
|
|
||||||
(storeService & BOPIS_BIT_MASK_VALUE) === BOPIS_BIT_MASK_VALUE;
|
|
||||||
const isDelivery = (storeService) =>
|
|
||||||
(storeService & DELIVERY_BIT_MASK_VALUE) === DELIVERY_BIT_MASK_VALUE;
|
|
||||||
|
|
||||||
const getQueryParam = (paramName) => {
|
|
||||||
return new URLSearchParams(window.location.search).get(paramName);
|
|
||||||
};
|
|
||||||
|
|
||||||
function getPreferredStoreDetails() {
|
|
||||||
return window.localStorage.getItem("preferredStoreData");
|
|
||||||
};
|
|
||||||
|
|
||||||
function getCookie(cname) {
|
|
||||||
let name = cname + "=";
|
|
||||||
let decodedCookie = decodeURIComponent(document.cookie);
|
|
||||||
let ca = decodedCookie.split(';');
|
|
||||||
for (let i = 0; i < ca.length; i++) {
|
|
||||||
let c = ca[i];
|
|
||||||
while (c.charAt(0) == ' ') {
|
|
||||||
c = c.substring(1);
|
|
||||||
}
|
|
||||||
if (c.indexOf(name) == 0) {
|
|
||||||
return c.substring(name.length, c.length);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return "";
|
|
||||||
}
|
|
||||||
|
|
||||||
async function setStoreDetails(storeObj, isUser) {
|
|
||||||
|
|
||||||
let _preferredStore = getPreferredStoreDetails() ? JSON.parse(getPreferredStoreDetails()) : {};
|
|
||||||
|
|
||||||
if (!storeObj?.sn || !storeObj?.ad || !storeObj?.ct || !storeObj?.st || !storeObj?.zp) {
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
const formattedZip = storeObj?.zp ? storeObj.zp.split("-")[0] : "";
|
|
||||||
|
|
||||||
const formatedAddress = function (storeObj) {
|
|
||||||
return storeObj?.ct + ", " + storeObj?.st + " " + formattedZip;
|
|
||||||
}
|
|
||||||
|
|
||||||
const updatedStoreDetails = {
|
|
||||||
address: storeObj?.ad,
|
|
||||||
city: storeObj?.ct,
|
|
||||||
latitude: storeObj?.la,
|
|
||||||
longitude: storeObj?.lo,
|
|
||||||
state: storeObj?.st,
|
|
||||||
storeService: storeObj?.ss,
|
|
||||||
storeNumber: parseInt(storeObj?.sn),
|
|
||||||
// TODO: remove 'number' after full roll out to cloud
|
|
||||||
number: parseInt(storeObj?.sn),
|
|
||||||
zip: storeObj?.zp,
|
|
||||||
isSezzle: isSezzle(storeObj?.ss),
|
|
||||||
isBopis: isBopis(storeObj?.ss),
|
|
||||||
isDelivery: isDelivery(storeObj?.ss),
|
|
||||||
lastUpdated: Date.now(),
|
|
||||||
fullAddress: formatedAddress(storeObj),
|
|
||||||
};
|
|
||||||
|
|
||||||
_preferredStore[isUser ? "userStore" : "guestStore"] = updatedStoreDetails;
|
|
||||||
|
|
||||||
localStorage.setItem(
|
|
||||||
PREFERRED_STORE_DATA_KEY,
|
|
||||||
JSON.stringify(_preferredStore)
|
|
||||||
);
|
|
||||||
|
|
||||||
const setStorage = new CustomEvent("updateStoreEvent");
|
|
||||||
window.dispatchEvent(setStorage);
|
|
||||||
console.log('Store data updated, event dispatched');
|
|
||||||
}
|
|
||||||
|
|
||||||
// gets default store details
|
|
||||||
async function getGuestStoreDetails(storeNumber, fallbackFlow = false) {
|
|
||||||
|
|
||||||
let storeDetailsUrl = 'https://dggo.dollargeneral.com/omni/api/store/info/';
|
|
||||||
storeDetailsUrl = storeDetailsUrl + storeNumber;
|
|
||||||
|
|
||||||
const guestStoreDetails = async () => {
|
|
||||||
try {
|
|
||||||
var xhr = new XMLHttpRequest();
|
|
||||||
xhr.open("GET", storeDetailsUrl, true);
|
|
||||||
|
|
||||||
xhr.setRequestHeader("Content-Type", "application/json");
|
|
||||||
xhr.setRequestHeader("X-DG-appToken", getCookie("appToken"));
|
|
||||||
xhr.setRequestHeader("X-DG-appSessionToken", getCookie('appSessionToken'));
|
|
||||||
xhr.setRequestHeader("X-DG-customerGuid", getCookie('customerGuid'));
|
|
||||||
xhr.setRequestHeader("X-DG-deviceUniqueId", getCookie('uniqueDeviceId'));
|
|
||||||
xhr.setRequestHeader("X-DG-partnerApiToken", getCookie('partnerApiToken'));
|
|
||||||
let bearerToken = "Bearer " + getCookie('idToken');
|
|
||||||
xhr.setRequestHeader("Authorization", bearerToken);
|
|
||||||
|
|
||||||
if (useCloudService) {
|
|
||||||
xhr.setRequestHeader("X-DG-CLOUD-SERVICE", useCloudService);
|
|
||||||
}
|
|
||||||
|
|
||||||
xhr.onreadystatechange = function () {
|
|
||||||
if (this.readyState === XMLHttpRequest.DONE && this.status === 200) {
|
|
||||||
const sparkCode = this.getResponseHeader("x-spark");
|
|
||||||
if (sparkCode && SPARK_CODES.tokenExpired.includes(sparkCode)) {
|
|
||||||
refreshTokens()
|
|
||||||
.then(() => guestStoreDetails())
|
|
||||||
.catch(() => {
|
|
||||||
console.error("Failed to refresh tokens.");
|
|
||||||
});
|
|
||||||
|
|
||||||
@@ -1,7 +0,0 @@
|
|||||||
[
|
|
||||||
{
|
|
||||||
"url": "https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/728192558375",
|
|
||||||
"title": "Pok\u00e9mon Trading Card Game, 15 Card Pack, 1 ct",
|
|
||||||
"sku": "41936301"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
31
run.sh
31
run.sh
@@ -1,31 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
# Pokemon Discovery - Scraper & Catalog Generator Launcher
|
|
||||||
# Automatically activates virtual environment and runs the scraper
|
|
||||||
|
|
||||||
set -e
|
|
||||||
|
|
||||||
cd "$(dirname "$0")"
|
|
||||||
|
|
||||||
echo "Pokemon Discovery - Product Scraper & Catalog Generator"
|
|
||||||
echo "================================================"
|
|
||||||
|
|
||||||
# Check if virtual environment exists
|
|
||||||
if [[ ! -d "venv" ]]; then
|
|
||||||
echo "Creating virtual environment..."
|
|
||||||
python3 -m venv venv
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Activate virtual environment
|
|
||||||
source venv/bin/activate
|
|
||||||
|
|
||||||
# Check if requirements are installed
|
|
||||||
if ! python -c "import requests, bs4, barcode, selenium" 2>/dev/null; then
|
|
||||||
echo "Installing Python requirements..."
|
|
||||||
pip install -r requirements.txt
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Run the main script
|
|
||||||
python run_scraper.py
|
|
||||||
|
|
||||||
echo ""
|
|
||||||
echo "Script completed. Check the output above for results."
|
|
||||||
139
run_scraper.py
139
run_scraper.py
@@ -1,139 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Pokemon Discovery - Scraper and Catalog Generator
|
|
||||||
Main script that runs both scraping and PDF generation
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import subprocess
|
|
||||||
from datetime import datetime
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
def install_requirements():
|
|
||||||
"""Install Python requirements"""
|
|
||||||
print("Installing Python requirements...")
|
|
||||||
try:
|
|
||||||
subprocess.run([sys.executable, '-m', 'pip', 'install', '-r', 'requirements.txt'],
|
|
||||||
check=True)
|
|
||||||
print("Requirements installed successfully!")
|
|
||||||
except subprocess.CalledProcessError as e:
|
|
||||||
print(f"Failed to install requirements: {e}")
|
|
||||||
return False
|
|
||||||
return True
|
|
||||||
|
|
||||||
def run_scraper():
|
|
||||||
"""Run the scraper to collect product data"""
|
|
||||||
print("=" * 60)
|
|
||||||
print("STEP 1: SCRAPING POKEMON TCG PRODUCTS")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = subprocess.run([sys.executable, 'scraper.py'],
|
|
||||||
capture_output=True, text=True)
|
|
||||||
|
|
||||||
if result.returncode == 0:
|
|
||||||
print("Scraping completed successfully!")
|
|
||||||
print(result.stdout)
|
|
||||||
|
|
||||||
# Find the generated JSON file
|
|
||||||
json_files = list(Path('.').glob('pokemon_tcg_products_*.json'))
|
|
||||||
if json_files:
|
|
||||||
latest_file = max(json_files, key=os.path.getctime)
|
|
||||||
return str(latest_file)
|
|
||||||
else:
|
|
||||||
print("No JSON file was generated")
|
|
||||||
return None
|
|
||||||
else:
|
|
||||||
print("Scraping failed:")
|
|
||||||
print(result.stderr)
|
|
||||||
return None
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error running scraper: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def run_pdf_generator(json_file):
|
|
||||||
"""Run the PDF generator with the scraped data"""
|
|
||||||
print("=" * 60)
|
|
||||||
print("STEP 2: GENERATING PDF CATALOG")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = subprocess.run([sys.executable, 'pdf_generator.py', json_file],
|
|
||||||
capture_output=True, text=True)
|
|
||||||
|
|
||||||
if result.returncode == 0:
|
|
||||||
print("PDF generation completed successfully!")
|
|
||||||
print(result.stdout)
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
print("PDF generation failed:")
|
|
||||||
print(result.stderr)
|
|
||||||
return False
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error running PDF generator: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("Pokemon Discovery - Product Scraper & Catalog Generator")
|
|
||||||
print("=" * 60)
|
|
||||||
print(f"Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Check if requirements are installed
|
|
||||||
try:
|
|
||||||
import requests, bs4, barcode, PIL
|
|
||||||
print("✓ Required packages are available")
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"✗ Missing required package: {e}")
|
|
||||||
print("Installing requirements...")
|
|
||||||
if not install_requirements():
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Check if pandoc is available
|
|
||||||
try:
|
|
||||||
subprocess.run(['pandoc', '--version'],
|
|
||||||
capture_output=True, check=True)
|
|
||||||
print("✓ Pandoc is available for PDF generation")
|
|
||||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
|
||||||
print("⚠ Pandoc not found. PDF generation may fail.")
|
|
||||||
print(" Install pandoc with: sudo apt install pandoc (Ubuntu/Debian)")
|
|
||||||
print(" or: brew install pandoc (macOS)")
|
|
||||||
print(" or: pacman -S pandoc (Arch Linux)")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Run scraper
|
|
||||||
json_file = run_scraper()
|
|
||||||
if not json_file:
|
|
||||||
print("Scraping failed. Exiting.")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Run PDF generator
|
|
||||||
if run_pdf_generator(json_file):
|
|
||||||
print("=" * 60)
|
|
||||||
print("SUCCESS! Both scraping and PDF generation completed.")
|
|
||||||
print("=" * 60)
|
|
||||||
print(f"JSON data: {json_file}")
|
|
||||||
print("PDF catalog: Check the catalog_output/ directory")
|
|
||||||
print()
|
|
||||||
print("Files generated:")
|
|
||||||
|
|
||||||
# List generated files
|
|
||||||
for file_pattern in ['pokemon_tcg_products_*.json', 'catalog_output/pokemon_tcg_catalog_*.pdf']:
|
|
||||||
files = list(Path('.').glob(file_pattern))
|
|
||||||
if files:
|
|
||||||
latest = max(files, key=os.path.getctime)
|
|
||||||
print(f" - {latest}")
|
|
||||||
else:
|
|
||||||
print("=" * 60)
|
|
||||||
print("PARTIAL SUCCESS: Scraping completed, but PDF generation failed.")
|
|
||||||
print("=" * 60)
|
|
||||||
print(f"JSON data: {json_file}")
|
|
||||||
print("You can manually run the PDF generator with:")
|
|
||||||
print(f" python3 pdf_generator.py {json_file}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
32
scraper.py
32
scraper.py
@@ -1,7 +1,20 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
Pokemon Discovery - TCG Product Scraper for Dollar General
|
Pokemon Discovery — Site Scraper (Reference)
|
||||||
Scrapes product information and saves to JSON for PDF generation
|
|
||||||
|
HTML + Selenium/Brave scraper for Dollar General product pages.
|
||||||
|
Kept as a reference implementation. The primary tool is disco.py,
|
||||||
|
which reads product data from a HAR capture instead of scraping live.
|
||||||
|
|
||||||
|
This scraper can:
|
||||||
|
- Fetch individual product pages and extract title, SKU, price, stock
|
||||||
|
- Attempt to find product links from the category page (limited by
|
||||||
|
dynamic JS loading — products are injected via API after page load)
|
||||||
|
- Fall back to Brave browser via Selenium for JS-rendered content
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python scraper.py # Attempt full category scrape
|
||||||
|
# Or import and use PokemonTCGScraper class directly for individual pages
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
@@ -28,6 +41,14 @@ except ImportError:
|
|||||||
print("Selenium not available, using requests only (install selenium for Brave browser support)")
|
print("Selenium not available, using requests only (install selenium for Brave browser support)")
|
||||||
|
|
||||||
class PokemonTCGScraper:
|
class PokemonTCGScraper:
|
||||||
|
"""HTML/Selenium scraper for Dollar General Pokemon product pages.
|
||||||
|
|
||||||
|
Can extract product details (title, SKU, price, stock) from individual
|
||||||
|
product page URLs. Category-level scraping is limited because Dollar
|
||||||
|
General loads products dynamically via a JS API call after page load.
|
||||||
|
See disco.py for the HAR-based approach that bypasses this limitation.
|
||||||
|
"""
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.base_url = "https://www.dollargeneral.com"
|
self.base_url = "https://www.dollargeneral.com"
|
||||||
self.search_url = "https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true"
|
self.search_url = "https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true"
|
||||||
@@ -300,9 +321,10 @@ class PokemonTCGScraper:
|
|||||||
return has_pokemon and has_tcg
|
return has_pokemon and has_tcg
|
||||||
|
|
||||||
def try_api_scraping(self):
|
def try_api_scraping(self):
|
||||||
"""
|
"""Stub for API-based scraping (requires auth token).
|
||||||
Try to scrape products using the discovered API endpoint
|
|
||||||
This method contains the exact API call found via HAR analysis
|
Documents the discovered API endpoint and request format.
|
||||||
|
Not functional — use disco.py with a HAR file instead.
|
||||||
"""
|
"""
|
||||||
print("🔬 Attempting API-based scraping...")
|
print("🔬 Attempting API-based scraping...")
|
||||||
print(" Endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider")
|
print(" Endpoint: https://dggo.dollargeneral.com/omni/api/v2/category/search/provider")
|
||||||
|
|||||||
@@ -1,246 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test the Dollar General API endpoint for Pokemon products
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import requests
|
|
||||||
import sys
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
def get_auth_token():
|
|
||||||
"""Get authentication token from Dollar General"""
|
|
||||||
try:
|
|
||||||
# Try to get token from the token endpoint
|
|
||||||
token_url = 'https://www.dollargeneral.com/bin/omni/userTokens'
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Referer': 'https://www.dollargeneral.com/'
|
|
||||||
}
|
|
||||||
|
|
||||||
response = requests.get(token_url, headers=headers, timeout=30)
|
|
||||||
if response.status_code == 200:
|
|
||||||
data = response.json()
|
|
||||||
# Look for access token in the response
|
|
||||||
if 'access_token' in data:
|
|
||||||
return data['access_token']
|
|
||||||
elif 'token' in data:
|
|
||||||
return data['token']
|
|
||||||
else:
|
|
||||||
print("Token response structure:", list(data.keys()))
|
|
||||||
return None
|
|
||||||
else:
|
|
||||||
print(f"Failed to get token: {response.status_code}")
|
|
||||||
return None
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error getting token: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def test_api_with_existing_token():
|
|
||||||
"""Test with the token from HAR file"""
|
|
||||||
|
|
||||||
# Token extracted from HAR file (may expire)
|
|
||||||
har_token = "eyJ0eXAiOiJhdCtKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6Ik5qRTJNemczTXpSRVFrUXpNak5GUmprMU1FUkNNRUZDTVRBek1FWTFRa0pCTXpRM1EwTkNNZyJ9.eyJzY29wZSI6bnVsbCwiaWF0IjoxNzc0MTI3Nzc5LCJleHAiOjE3NzQxMzEzNzksImF1ZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImlzcyI6Imh0dHBzOi8vcHJvZC1kZ2dvLyIsInN1YiI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsInNpZCI6IlNrWk9makF5TURRMU1EVXpOVFEwWWpBM016SXpNak14TXpFek9ETTNNekV3TWpreFl6VitUVUZXYVhwbk56SXpVRGg2VWxkcmEySkRkMk5EZUdVNFlUWm5XVXBHVDBveVExTlRNVWxXWlhSalQzRnFWazVWZGtGWlIwOWtZV2x0WVVwRVRucG5SVlZvUTE5SE5VcHVObGhuTURSb2JuUkVhVlF3UTBzelNIND0iLCJqdGkiOiJzdDIucy5BdEx0VlphRHFnLnZrdW5OV2RWNjN2ZlJTTG00Y3VUd2d5bmc2X0pJNmxKRjA5a2lXTXVQeGZkVDRvT0NhMXhwa1VoRlRkM2tocHZUaFhsRUVwLWw0QzJrZnoycjkzVlYzeldBaUw5Y2x6Snl0amFJamJ4TEJnLkJOZy1CeUdpZnV0WnppQWhhMV8xRDBXTUFWR3JpNVVCX0pKbTRCNVRNYVhTWkZneXpxeUZERjJxZ3B3UTgyajZ2eGVtcnA5RERFTHZnM3hvdlZmZzBnLnNjMyIsImNsaWVudF9pZCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiIsImF6cCI6IldLOTlLc2VCYnUybmFoNC1ibFE3ZmsyUiJ9.I6ou9atkJ8ndkr2m2Trpg53fMIL3hpofCLUHoHYgZkOJnLnbmL0CQu7_pIChQ6nIDK03GagK6aqxd97E8B8vv9nweSmb7zXhrt43dKLEIdhxIGFkJ4xYgNNg-3cVjSlThBQ_AwCx924lOGjEfikEw4NrvGvrlNvrg1lnNz4hf629hUH-5ccVSdgo1w_LQzsLOeMCjuC_bmAoRxT5KLI9oESd4tPJZU5Nlt2ICbWJD9h-zNrt-ijwYCvb7j8amGbpMGhJZqtzu9f3wN0JUFxDg5rAN-WOtLjwEmR_NxDKq0NEeuU16uhaB8AJzy217XAgJ87bKZldZowsWs-Q9oAH3g"
|
|
||||||
|
|
||||||
endpoint = "https://dggo.dollargeneral.com/omni/api/v2/category/search/provider"
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:148.0) Gecko/20100101 Firefox/148.0',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Content-Type': 'application/json',
|
|
||||||
'Authorization': f'Bearer {har_token}',
|
|
||||||
'Referer': 'https://www.dollargeneral.com/'
|
|
||||||
}
|
|
||||||
|
|
||||||
# Test different filter combinations
|
|
||||||
test_requests = [
|
|
||||||
{
|
|
||||||
"name": "In Stock Pokemon Products",
|
|
||||||
"payload": {
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": None,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": False,
|
|
||||||
"dgPickUp": False,
|
|
||||||
"dgShipTohome": False,
|
|
||||||
"soldAtStore": True,
|
|
||||||
"inStock": True,
|
|
||||||
"onlyActivatedDeals": False
|
|
||||||
},
|
|
||||||
"IncludeSponsored": True,
|
|
||||||
"IncludeShipToHome": True,
|
|
||||||
"IncludeDeals": True,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960, # Pokemon category ID
|
|
||||||
"IncludeProducts": False,
|
|
||||||
"DoNotSave": False,
|
|
||||||
"OptOut": False,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "All Pokemon Products (including out of stock)",
|
|
||||||
"payload": {
|
|
||||||
"StoreNbr": 17506,
|
|
||||||
"SearchTerm": None,
|
|
||||||
"PageSize": 24,
|
|
||||||
"PageStartRecordIndex": 0,
|
|
||||||
"Filters": {
|
|
||||||
"category": [],
|
|
||||||
"brand": [],
|
|
||||||
"dgDelivery": False,
|
|
||||||
"dgPickUp": False,
|
|
||||||
"dgShipTohome": False,
|
|
||||||
"soldAtStore": True,
|
|
||||||
"inStock": False, # Include out of stock
|
|
||||||
"onlyActivatedDeals": False
|
|
||||||
},
|
|
||||||
"IncludeSponsored": True,
|
|
||||||
"IncludeShipToHome": True,
|
|
||||||
"IncludeDeals": True,
|
|
||||||
"offerSourceType": 0,
|
|
||||||
"Id": 723960,
|
|
||||||
"IncludeProducts": False,
|
|
||||||
"DoNotSave": False,
|
|
||||||
"OptOut": False,
|
|
||||||
"SearchType": 1
|
|
||||||
}
|
|
||||||
}
|
|
||||||
]
|
|
||||||
|
|
||||||
all_pokemon_products = []
|
|
||||||
|
|
||||||
for test in test_requests:
|
|
||||||
print(f"=== Testing: {test['name']} ===")
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.post(endpoint,
|
|
||||||
headers=headers,
|
|
||||||
json=test['payload'],
|
|
||||||
timeout=30)
|
|
||||||
|
|
||||||
print(f"Status Code: {response.status_code}")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
print(f"Response length: {len(response.text)} characters")
|
|
||||||
print(f"Response preview: {response.text[:200]}...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
items = data.get('ItemList', {}).get('Items', [])
|
|
||||||
print(f"Total products: {len(items)}")
|
|
||||||
except Exception as json_error:
|
|
||||||
print(f"JSON parsing error: {json_error}")
|
|
||||||
print(f"Full response: {response.text}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Filter for Pokemon products
|
|
||||||
pokemon_products = []
|
|
||||||
for item in items:
|
|
||||||
title = item.get('Title', '').lower()
|
|
||||||
if any(keyword in title for keyword in ['pokemon', 'pokémon', 'trading card']):
|
|
||||||
product_info = {
|
|
||||||
'title': item.get('Title'),
|
|
||||||
'sku': item.get('ItemNbr'),
|
|
||||||
'upc': item.get('UPC'),
|
|
||||||
'price': item.get('Price', {}).get('Amount'),
|
|
||||||
'url': f"https://www.dollargeneral.com{item.get('ProductUrl', '')}",
|
|
||||||
'in_stock': item.get('Inventory', {}).get('InStock'),
|
|
||||||
'image_url': item.get('ImageURL'),
|
|
||||||
'description': item.get('Description', ''),
|
|
||||||
'brand': item.get('Brand', '')
|
|
||||||
}
|
|
||||||
pokemon_products.append(product_info)
|
|
||||||
all_pokemon_products.append(product_info)
|
|
||||||
|
|
||||||
print(f"Pokemon products found: {len(pokemon_products)}")
|
|
||||||
|
|
||||||
for i, prod in enumerate(pokemon_products, 1):
|
|
||||||
print(f" {i}. {prod['title']}")
|
|
||||||
print(f" SKU: {prod['sku']}, UPC: {prod['upc']}")
|
|
||||||
print(f" Price: ${prod['price']}, In Stock: {prod['in_stock']}")
|
|
||||||
print(f" URL: {prod['url']}")
|
|
||||||
|
|
||||||
# Check if this is our test product
|
|
||||||
if prod['sku'] == '41936301':
|
|
||||||
print(f" 🎯 THIS IS OUR TEST PRODUCT!")
|
|
||||||
print()
|
|
||||||
|
|
||||||
elif response.status_code == 401:
|
|
||||||
print("❌ Authentication failed - token may be expired")
|
|
||||||
print("Response:", response.text)
|
|
||||||
return None
|
|
||||||
else:
|
|
||||||
print(f"❌ API call failed: {response.status_code}")
|
|
||||||
print("Response:", response.text[:500])
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ Error: {e}")
|
|
||||||
|
|
||||||
print("="*60)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Save results
|
|
||||||
if all_pokemon_products:
|
|
||||||
# Remove duplicates based on SKU
|
|
||||||
unique_products = {prod['sku']: prod for prod in all_pokemon_products}.values()
|
|
||||||
unique_products = list(unique_products)
|
|
||||||
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
||||||
filename = f'pokemon_tcg_api_results_{timestamp}.json'
|
|
||||||
|
|
||||||
with open(filename, 'w') as f:
|
|
||||||
json.dump(unique_products, f, indent=2)
|
|
||||||
|
|
||||||
print(f"🎉 SUCCESS!")
|
|
||||||
print(f"Found {len(unique_products)} unique Pokemon TCG products")
|
|
||||||
print(f"Saved to: {filename}")
|
|
||||||
|
|
||||||
return unique_products
|
|
||||||
|
|
||||||
return None
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("Pokemon Discovery - API Endpoint Test")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# First try to get a fresh token
|
|
||||||
print("Attempting to get fresh authentication token...")
|
|
||||||
fresh_token = get_auth_token()
|
|
||||||
|
|
||||||
if fresh_token:
|
|
||||||
print(f"✅ Got fresh token: {fresh_token[:50]}...")
|
|
||||||
else:
|
|
||||||
print("⚠️ Could not get fresh token, using HAR token")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Test API with existing token from HAR
|
|
||||||
products = test_api_with_existing_token()
|
|
||||||
|
|
||||||
if products:
|
|
||||||
print()
|
|
||||||
print("🚀 READY FOR INTEGRATION!")
|
|
||||||
print("The API endpoint is working and can be integrated into Pokemon Discovery")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Check if our known product is in the results
|
|
||||||
known_sku = '41936301'
|
|
||||||
known_product = next((p for p in products if p['sku'] == known_sku), None)
|
|
||||||
|
|
||||||
if known_product:
|
|
||||||
print(f"✅ Confirmed: Our test product (SKU {known_sku}) was found via API!")
|
|
||||||
print(f" Title: {known_product['title']}")
|
|
||||||
print(f" URL: {known_product['url']}")
|
|
||||||
print(f" Stock: {known_product['in_stock']}")
|
|
||||||
|
|
||||||
else:
|
|
||||||
print("❌ API test failed - may need fresh authentication")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,55 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test script to verify barcode generation functionality
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add current directory to path if running in venv
|
|
||||||
sys.path.insert(0, '.')
|
|
||||||
|
|
||||||
try:
|
|
||||||
import barcode
|
|
||||||
from barcode.writer import ImageWriter
|
|
||||||
print("✓ Barcode generation libraries are available")
|
|
||||||
|
|
||||||
# Test barcode generation
|
|
||||||
test_sku = "123456789012"
|
|
||||||
|
|
||||||
upc_generator = barcode.get_barcode_class('upca')
|
|
||||||
test_barcode = upc_generator("12345678901", writer=ImageWriter())
|
|
||||||
|
|
||||||
# Create test output directory
|
|
||||||
test_dir = Path("test_output")
|
|
||||||
test_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# Generate test barcode
|
|
||||||
barcode_path = test_dir / "test_barcode"
|
|
||||||
test_barcode.save(str(barcode_path), options={
|
|
||||||
'module_width': 0.2,
|
|
||||||
'module_height': 15.0,
|
|
||||||
'quiet_zone': 6.5,
|
|
||||||
'font_size': 10,
|
|
||||||
'text_distance': 5.0,
|
|
||||||
'background': 'white',
|
|
||||||
'foreground': 'black'
|
|
||||||
})
|
|
||||||
|
|
||||||
final_path = f"{barcode_path}.png"
|
|
||||||
if os.path.exists(final_path):
|
|
||||||
print(f"✓ Test barcode generated successfully: {final_path}")
|
|
||||||
print(f" File size: {os.path.getsize(final_path)} bytes")
|
|
||||||
else:
|
|
||||||
print(f"✗ Failed to generate test barcode")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"✗ Missing barcode library: {e}")
|
|
||||||
sys.exit(1)
|
|
||||||
except Exception as e:
|
|
||||||
print(f"✗ Barcode generation failed: {e}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
print("✓ All barcode generation tests passed!")
|
|
||||||
@@ -1,67 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test Brave browser integration with Pokemon Discovery
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
|
|
||||||
try:
|
|
||||||
from selenium import webdriver
|
|
||||||
from selenium.webdriver.chrome.options import Options
|
|
||||||
from selenium.webdriver.chrome.service import Service
|
|
||||||
from webdriver_manager.chrome import ChromeDriverManager
|
|
||||||
|
|
||||||
print("✓ Selenium and webdriver-manager are available")
|
|
||||||
|
|
||||||
# Check if Brave is available
|
|
||||||
if not os.path.exists('/usr/bin/brave'):
|
|
||||||
print("✗ Brave browser not found at /usr/bin/brave")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
print("✓ Brave browser found at /usr/bin/brave")
|
|
||||||
|
|
||||||
# Get Brave version
|
|
||||||
import subprocess
|
|
||||||
try:
|
|
||||||
result = subprocess.run(['/usr/bin/brave', '--version'],
|
|
||||||
capture_output=True, text=True, timeout=5)
|
|
||||||
brave_version = result.stdout.strip()
|
|
||||||
print(f"✓ {brave_version}")
|
|
||||||
except:
|
|
||||||
print("⚠ Could not get Brave version")
|
|
||||||
|
|
||||||
# Test ChromeDriver compatibility
|
|
||||||
print("\nTesting ChromeDriver compatibility...")
|
|
||||||
options = Options()
|
|
||||||
options.add_argument('--headless')
|
|
||||||
options.add_argument('--no-sandbox')
|
|
||||||
options.add_argument('--disable-dev-shm-usage')
|
|
||||||
options.binary_location = '/usr/bin/brave'
|
|
||||||
|
|
||||||
try:
|
|
||||||
service = Service(ChromeDriverManager().install())
|
|
||||||
driver = webdriver.Chrome(service=service, options=options)
|
|
||||||
|
|
||||||
# Simple test page
|
|
||||||
driver.get("data:text/html,<html><body><h1>Test</h1></body></html>")
|
|
||||||
title = driver.title
|
|
||||||
driver.quit()
|
|
||||||
|
|
||||||
print("✓ Brave + ChromeDriver test successful!")
|
|
||||||
print("✓ Pokemon Discovery is ready to use Brave for dynamic content")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"✗ ChromeDriver compatibility issue: {e}")
|
|
||||||
print("\n💡 Solutions:")
|
|
||||||
print("1. Update ChromeDriver: pip install --upgrade webdriver-manager")
|
|
||||||
print("2. Install matching ChromeDriver version manually")
|
|
||||||
print("3. Use Firefox with geckodriver as alternative")
|
|
||||||
print("\nNote: The main PDF generation functionality works without browser automation")
|
|
||||||
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"✗ Missing dependency: {e}")
|
|
||||||
print("Run: pip install selenium webdriver-manager")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
print("\n🎯 Test completed!")
|
|
||||||
@@ -1,26 +0,0 @@
|
|||||||
[
|
|
||||||
{
|
|
||||||
"title": "Pokemon Trading Card Game Battle Academy",
|
|
||||||
"price": "$19.95",
|
|
||||||
"stock": "In Stock",
|
|
||||||
"sku": "DG12345678",
|
|
||||||
"image_url": "https://via.placeholder.com/300x200?text=Pokemon+Battle+Academy",
|
|
||||||
"url": "https://www.dollargeneral.com/p/pokemon-battle-academy"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "Pokemon TCG Scarlet & Violet Booster Pack",
|
|
||||||
"price": "$4.25",
|
|
||||||
"stock": "In Stock",
|
|
||||||
"sku": "DG87654321",
|
|
||||||
"image_url": "https://via.placeholder.com/300x200?text=Pokemon+Booster+Pack",
|
|
||||||
"url": "https://www.dollargeneral.com/p/pokemon-scarlet-violet-booster"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"title": "Pokemon Tin Collection Box",
|
|
||||||
"price": "$12.95",
|
|
||||||
"stock": "Low Stock",
|
|
||||||
"sku": "DG11223344",
|
|
||||||
"image_url": "https://via.placeholder.com/300x200?text=Pokemon+Tin+Box",
|
|
||||||
"url": "https://www.dollargeneral.com/p/pokemon-tin-collection"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
@@ -1,152 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test dynamic content loading for Pokemon Discovery
|
|
||||||
"""
|
|
||||||
|
|
||||||
import requests
|
|
||||||
import json
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
import time
|
|
||||||
|
|
||||||
def test_api_endpoints():
|
|
||||||
"""Try to find API endpoints that might return product data"""
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
|
|
||||||
'Accept': 'application/json, text/plain, */*',
|
|
||||||
'Accept-Language': 'en-US,en;q=0.9',
|
|
||||||
'Referer': 'https://www.dollargeneral.com/c/toys/pokemon'
|
|
||||||
}
|
|
||||||
|
|
||||||
# Test potential API endpoints
|
|
||||||
api_tests = [
|
|
||||||
'https://www.dollargeneral.com/api/products/search?q=pokemon',
|
|
||||||
'https://www.dollargeneral.com/api/v1/products?category=toys&query=pokemon',
|
|
||||||
'https://www.dollargeneral.com/dg/search?q=pokemon&category=toys',
|
|
||||||
'https://www.dollargeneral.com/api/search?term=pokemon+trading+card',
|
|
||||||
]
|
|
||||||
|
|
||||||
print("=== Testing API Endpoints ===")
|
|
||||||
for url in api_tests:
|
|
||||||
try:
|
|
||||||
print(f"Testing: {url}")
|
|
||||||
response = requests.get(url, headers=headers, timeout=10)
|
|
||||||
print(f" Status: {response.status_code}")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
try:
|
|
||||||
data = response.json()
|
|
||||||
print(f" JSON Response: {len(str(data))} characters")
|
|
||||||
if 'products' in str(data).lower():
|
|
||||||
print(" ✓ Contains 'products'")
|
|
||||||
if 'pokemon' in str(data).lower():
|
|
||||||
print(" ✓ Contains 'pokemon'")
|
|
||||||
except:
|
|
||||||
print(f" Text Response: {len(response.text)} characters")
|
|
||||||
print()
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Error: {e}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
def test_network_requests():
|
|
||||||
"""Analyze the search page to find AJAX calls"""
|
|
||||||
|
|
||||||
url = 'https://www.dollargeneral.com/c/toys/pokemon?q=&soldAtStore=true'
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
|
||||||
}
|
|
||||||
|
|
||||||
print("=== Analyzing Search Page for API Calls ===")
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.get(url, headers=headers, timeout=30)
|
|
||||||
soup = BeautifulSoup(response.text, 'html.parser')
|
|
||||||
|
|
||||||
# Look for API endpoints in JavaScript
|
|
||||||
scripts = soup.find_all('script')
|
|
||||||
api_patterns = []
|
|
||||||
|
|
||||||
for script in scripts:
|
|
||||||
if script.string:
|
|
||||||
content = script.string
|
|
||||||
|
|
||||||
# Look for API endpoints
|
|
||||||
import re
|
|
||||||
patterns = [
|
|
||||||
r'(?:api|Api|API)["\'\s]*[:=]["\'\s]*([^"\']+)',
|
|
||||||
r'(?:endpoint|url|baseURL)["\'\s]*[:=]["\'\s]*([^"\']+)',
|
|
||||||
r'fetch\s*\(\s*["\']([^"\']+)["\']',
|
|
||||||
r'xhr\.open\s*\(\s*["\'][^"\']*["\'],\s*["\']([^"\']+)["\']',
|
|
||||||
r'/api/[^"\'\\s]+',
|
|
||||||
r'/search[^"\'\\s]*',
|
|
||||||
]
|
|
||||||
|
|
||||||
for pattern in patterns:
|
|
||||||
matches = re.findall(pattern, content, re.IGNORECASE)
|
|
||||||
for match in matches:
|
|
||||||
if 'dollargeneral' in match or match.startswith('/'):
|
|
||||||
api_patterns.append(match)
|
|
||||||
|
|
||||||
# Remove duplicates and clean up
|
|
||||||
unique_apis = list(set(api_patterns))
|
|
||||||
|
|
||||||
print(f"Found {len(unique_apis)} potential API endpoints:")
|
|
||||||
for api in unique_apis[:10]: # Show first 10
|
|
||||||
print(f" -> {api}")
|
|
||||||
|
|
||||||
return unique_apis
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error analyzing page: {e}")
|
|
||||||
return []
|
|
||||||
|
|
||||||
def test_sitemap_approach():
|
|
||||||
"""Try to find products via sitemap"""
|
|
||||||
|
|
||||||
print("=== Testing Sitemap Approach ===")
|
|
||||||
|
|
||||||
sitemap_urls = [
|
|
||||||
'https://www.dollargeneral.com/sitemap.xml',
|
|
||||||
'https://www.dollargeneral.com/robots.txt'
|
|
||||||
]
|
|
||||||
|
|
||||||
for url in sitemap_urls:
|
|
||||||
try:
|
|
||||||
print(f"Testing: {url}")
|
|
||||||
response = requests.get(url, timeout=10)
|
|
||||||
print(f" Status: {response.status_code}")
|
|
||||||
|
|
||||||
if response.status_code == 200:
|
|
||||||
content = response.text
|
|
||||||
if 'pokemon' in content.lower():
|
|
||||||
print(" ✓ Contains Pokemon references")
|
|
||||||
if '/p/' in content:
|
|
||||||
print(" ✓ Contains product URLs (/p/)")
|
|
||||||
print(f" Content length: {len(content)} characters")
|
|
||||||
print()
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Error: {e}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
print("Pokemon Discovery - Dynamic Content Testing")
|
|
||||||
print("=" * 60)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Test various approaches to find products
|
|
||||||
test_api_endpoints()
|
|
||||||
print()
|
|
||||||
|
|
||||||
apis = test_network_requests()
|
|
||||||
print()
|
|
||||||
|
|
||||||
test_sitemap_approach()
|
|
||||||
print()
|
|
||||||
|
|
||||||
print("=" * 60)
|
|
||||||
print("Summary:")
|
|
||||||
print("- Individual product extraction: ✅ WORKING")
|
|
||||||
print("- Product URLs can be processed if found")
|
|
||||||
print("- Main challenge: Finding product URLs from search page")
|
|
||||||
print("- Dynamic content requires browser automation or API discovery")
|
|
||||||
@@ -1,165 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test Pokemon Discovery with real Dollar General Pokemon products
|
|
||||||
Demonstrates full working pipeline with known products
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
# Add current directory to path
|
|
||||||
sys.path.insert(0, '.')
|
|
||||||
|
|
||||||
from scraper import PokemonTCGScraper
|
|
||||||
from pdf_generator import PokemonTCGCatalogGenerator
|
|
||||||
|
|
||||||
def test_known_products():
|
|
||||||
"""Test with known Pokemon TCG products from Dollar General"""
|
|
||||||
|
|
||||||
# Known Pokemon TCG products (you can add more as you find them)
|
|
||||||
known_products = [
|
|
||||||
'https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/728192558375',
|
|
||||||
# Add more product URLs here as they're discovered
|
|
||||||
]
|
|
||||||
|
|
||||||
print("Pokemon Discovery - Real Product Test")
|
|
||||||
print("=" * 50)
|
|
||||||
print(f"Testing with {len(known_products)} known products")
|
|
||||||
print()
|
|
||||||
|
|
||||||
scraper = PokemonTCGScraper()
|
|
||||||
products_found = []
|
|
||||||
|
|
||||||
for i, url in enumerate(known_products, 1):
|
|
||||||
print(f"Testing product {i}/{len(known_products)}")
|
|
||||||
print(f"URL: {url}")
|
|
||||||
|
|
||||||
# Get product page
|
|
||||||
html = scraper.get_page_content(url)
|
|
||||||
|
|
||||||
if html:
|
|
||||||
# Extract product information
|
|
||||||
product = scraper.extract_product_info(url, html)
|
|
||||||
|
|
||||||
# Check if it's a Pokemon TCG product
|
|
||||||
if scraper.is_pokemon_tcg_product(product):
|
|
||||||
products_found.append(product)
|
|
||||||
print(f"✓ FOUND: {product.get('title', 'Unknown')}")
|
|
||||||
print(f" SKU: {product.get('sku', 'N/A')}")
|
|
||||||
print(f" Price: {product.get('price', 'N/A')}")
|
|
||||||
|
|
||||||
# Try to get additional data we might have missed
|
|
||||||
if not product.get('price'):
|
|
||||||
print(" (Attempting to find price...)")
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
soup = BeautifulSoup(html, 'html.parser')
|
|
||||||
|
|
||||||
# More price selectors
|
|
||||||
price_selectors = ['[data-testid="price"]', '.price-display', '.current-price', '[class*="price"]']
|
|
||||||
for selector in price_selectors:
|
|
||||||
price_elem = soup.select_one(selector)
|
|
||||||
if price_elem and not product.get('price'):
|
|
||||||
price_text = price_elem.get_text().strip()
|
|
||||||
if '$' in price_text:
|
|
||||||
product['price'] = price_text
|
|
||||||
print(f" Found price: {price_text}")
|
|
||||||
break
|
|
||||||
|
|
||||||
# Try to get stock info
|
|
||||||
if not product.get('stock'):
|
|
||||||
print(" (Attempting to find stock status...)")
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
soup = BeautifulSoup(html, 'html.parser')
|
|
||||||
|
|
||||||
# Look for stock indicators
|
|
||||||
if 'in stock' in html.lower():
|
|
||||||
product['stock'] = 'In Stock'
|
|
||||||
elif 'out of stock' in html.lower():
|
|
||||||
product['stock'] = 'Out of Stock'
|
|
||||||
elif 'available' in html.lower():
|
|
||||||
product['stock'] = 'Available'
|
|
||||||
else:
|
|
||||||
product['stock'] = 'Unknown'
|
|
||||||
|
|
||||||
print(f" Stock: {product.get('stock')}")
|
|
||||||
else:
|
|
||||||
print("✗ Not a Pokemon TCG product")
|
|
||||||
else:
|
|
||||||
print("✗ Failed to get product page")
|
|
||||||
|
|
||||||
print()
|
|
||||||
|
|
||||||
if products_found:
|
|
||||||
print(f"SUCCESS! Found {len(products_found)} Pokemon TCG products")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Save to JSON file
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
||||||
json_file = f'pokemon_tcg_products_real_{timestamp}.json'
|
|
||||||
|
|
||||||
with open(json_file, 'w') as f:
|
|
||||||
json.dump(products_found, f, indent=2)
|
|
||||||
|
|
||||||
print(f"✓ Saved product data: {json_file}")
|
|
||||||
|
|
||||||
# Generate PDF catalog
|
|
||||||
print("✓ Generating PDF catalog...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
generator = PokemonTCGCatalogGenerator(json_file)
|
|
||||||
pdf_file = generator.generate_pdf()
|
|
||||||
|
|
||||||
if pdf_file:
|
|
||||||
print(f"✓ PDF catalog generated: {pdf_file}")
|
|
||||||
|
|
||||||
# Show file sizes
|
|
||||||
import os
|
|
||||||
if os.path.exists(pdf_file):
|
|
||||||
size = os.path.getsize(pdf_file) / 1024
|
|
||||||
print(f" PDF size: {size:.1f} KB")
|
|
||||||
|
|
||||||
# Count barcodes generated
|
|
||||||
barcode_dir = generator.barcodes_dir
|
|
||||||
if barcode_dir.exists():
|
|
||||||
barcodes = list(barcode_dir.glob('*.png'))
|
|
||||||
print(f" Barcodes generated: {len(barcodes)}")
|
|
||||||
|
|
||||||
print()
|
|
||||||
print("🎉 COMPLETE SUCCESS!")
|
|
||||||
print("Pokemon Discovery successfully:")
|
|
||||||
print(f" • Scraped {len(products_found)} real products from Dollar General")
|
|
||||||
print(" • Generated professional PDF catalog")
|
|
||||||
print(" • Created scannable UPC-A barcodes")
|
|
||||||
print(" • Used Unix-friendly timestamped files")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error generating PDF: {e}")
|
|
||||||
print("But product scraping was successful!")
|
|
||||||
return True
|
|
||||||
|
|
||||||
else:
|
|
||||||
print("No Pokemon TCG products found.")
|
|
||||||
print()
|
|
||||||
print("This could be due to:")
|
|
||||||
print("- Products no longer available")
|
|
||||||
print("- Changed product URLs")
|
|
||||||
print("- Need to find more current product URLs")
|
|
||||||
|
|
||||||
return False
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
success = test_known_products()
|
|
||||||
|
|
||||||
print()
|
|
||||||
print("=" * 50)
|
|
||||||
if success:
|
|
||||||
print("✅ Pokemon Discovery is fully functional!")
|
|
||||||
print(" Ready for production use with product URLs")
|
|
||||||
else:
|
|
||||||
print("⚠️ Product URL discovery needed")
|
|
||||||
print(" Core functionality confirmed working")
|
|
||||||
print("=" * 50)
|
|
||||||
@@ -1,260 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Working Pokemon Product Finder
|
|
||||||
Implements a practical approach to find Pokemon TCG products
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import requests
|
|
||||||
from datetime import datetime
|
|
||||||
from scraper import PokemonTCGScraper
|
|
||||||
|
|
||||||
class WorkingProductFinder:
|
|
||||||
"""
|
|
||||||
A practical implementation that combines known techniques
|
|
||||||
to find Pokemon TCG products automatically
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.scraper = PokemonTCGScraper()
|
|
||||||
self.known_products = []
|
|
||||||
|
|
||||||
def discover_products_via_sitemap(self):
|
|
||||||
"""Try to find product URLs via sitemap or other discovery methods"""
|
|
||||||
|
|
||||||
print("🔍 Attempting product discovery via multiple methods...")
|
|
||||||
|
|
||||||
# Method 1: Try sitemap approach
|
|
||||||
urls_to_check = [
|
|
||||||
'https://www.dollargeneral.com/sitemap.xml',
|
|
||||||
'https://www.dollargeneral.com/sitemap-products.xml',
|
|
||||||
'https://www.dollargeneral.com/sitemap-pokemon.xml'
|
|
||||||
]
|
|
||||||
|
|
||||||
found_urls = []
|
|
||||||
|
|
||||||
for url in urls_to_check:
|
|
||||||
try:
|
|
||||||
print(f" Checking: {url}")
|
|
||||||
response = requests.get(url, timeout=30)
|
|
||||||
if response.status_code == 200:
|
|
||||||
content = response.text.lower()
|
|
||||||
if 'pokemon' in content:
|
|
||||||
print(f" ✓ Contains Pokemon references")
|
|
||||||
# Extract URLs here if needed
|
|
||||||
|
|
||||||
if '/p/' in content:
|
|
||||||
print(f" ✓ Contains product URLs")
|
|
||||||
# Could parse sitemap XML here
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ✗ Failed: {e}")
|
|
||||||
|
|
||||||
return found_urls
|
|
||||||
|
|
||||||
def search_via_known_patterns(self):
|
|
||||||
"""Try common Pokemon TCG product URL patterns"""
|
|
||||||
|
|
||||||
print("🎯 Trying known product URL patterns...")
|
|
||||||
|
|
||||||
# Common Pokemon TCG product patterns at Dollar General
|
|
||||||
search_patterns = [
|
|
||||||
# Known working product
|
|
||||||
'https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/728192558375',
|
|
||||||
|
|
||||||
# Try variations and similar UPCs
|
|
||||||
'https://www.dollargeneral.com/search?q=pokemon+trading+card',
|
|
||||||
'https://www.dollargeneral.com/search?q=pokemon+pack',
|
|
||||||
'https://www.dollargeneral.com/search?q=pokemon+tin',
|
|
||||||
]
|
|
||||||
|
|
||||||
working_products = []
|
|
||||||
|
|
||||||
for pattern in search_patterns:
|
|
||||||
print(f" Testing: {pattern}")
|
|
||||||
|
|
||||||
if '/p/' in pattern:
|
|
||||||
# This is a direct product URL
|
|
||||||
html = self.scraper.get_page_content(pattern)
|
|
||||||
if html:
|
|
||||||
product = self.scraper.extract_product_info(pattern, html)
|
|
||||||
if self.scraper.is_pokemon_tcg_product(product):
|
|
||||||
working_products.append(product)
|
|
||||||
print(f" ✓ Valid: {product.get('title', 'Unknown')}")
|
|
||||||
else:
|
|
||||||
# This is a search URL - check if it has useful content
|
|
||||||
try:
|
|
||||||
response = requests.get(pattern, timeout=30)
|
|
||||||
if response.status_code == 200 and len(response.text) > 5000:
|
|
||||||
print(f" ✓ Search page accessible")
|
|
||||||
# Could parse for product links here
|
|
||||||
except:
|
|
||||||
print(f" ✗ Search failed")
|
|
||||||
|
|
||||||
return working_products
|
|
||||||
|
|
||||||
def expand_known_products(self):
|
|
||||||
"""Try to find more products based on known ones"""
|
|
||||||
|
|
||||||
print("🔄 Attempting to find related products...")
|
|
||||||
|
|
||||||
# If we have a working product URL, try variations
|
|
||||||
known_url = 'https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/728192558375'
|
|
||||||
|
|
||||||
# Extract the UPC from known URL
|
|
||||||
upc = '728192558375'
|
|
||||||
base_upc = upc[:-1] # Remove last digit
|
|
||||||
|
|
||||||
print(f" Base UPC pattern: {base_upc}X")
|
|
||||||
|
|
||||||
# Try variations in UPC (last digit changes for different products)
|
|
||||||
variations_to_try = []
|
|
||||||
for i in range(10):
|
|
||||||
test_upc = base_upc + str(i)
|
|
||||||
test_url = f'https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/{test_upc}'
|
|
||||||
variations_to_try.append(test_url)
|
|
||||||
|
|
||||||
found_products = []
|
|
||||||
|
|
||||||
for url in variations_to_try[:5]: # Try first 5 to be respectful
|
|
||||||
print(f" Testing UPC variation: {url.split('/')[-1]}")
|
|
||||||
|
|
||||||
try:
|
|
||||||
html = self.scraper.get_page_content(url)
|
|
||||||
if html and 'pokemon' in html.lower():
|
|
||||||
product = self.scraper.extract_product_info(url, html)
|
|
||||||
if product.get('title'):
|
|
||||||
found_products.append(product)
|
|
||||||
print(f" ✓ Found: {product['title']}")
|
|
||||||
else:
|
|
||||||
print(f" ✗ No product found")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ✗ Error: {e}")
|
|
||||||
|
|
||||||
# Be respectful - small delay
|
|
||||||
import time
|
|
||||||
time.sleep(1)
|
|
||||||
|
|
||||||
return found_products
|
|
||||||
|
|
||||||
def manual_product_list(self):
|
|
||||||
"""Return manually curated list of Pokemon TCG products"""
|
|
||||||
|
|
||||||
print("📋 Using manually curated product list...")
|
|
||||||
|
|
||||||
# These would be products we've confirmed exist
|
|
||||||
# Users can add more as they discover them
|
|
||||||
manual_list = [
|
|
||||||
{
|
|
||||||
'title': 'Pokémon Trading Card Game, 15 Card Pack, 1 ct',
|
|
||||||
'url': 'https://www.dollargeneral.com/p/pok-mon-trading-card-game-card-pack-ct/728192558375',
|
|
||||||
'sku': '41936301',
|
|
||||||
'upc': '728192558375',
|
|
||||||
'note': 'Confirmed working product'
|
|
||||||
}
|
|
||||||
]
|
|
||||||
|
|
||||||
verified_products = []
|
|
||||||
|
|
||||||
for item in manual_list:
|
|
||||||
print(f" Verifying: {item['title']}")
|
|
||||||
|
|
||||||
html = self.scraper.get_page_content(item['url'])
|
|
||||||
if html:
|
|
||||||
product = self.scraper.extract_product_info(item['url'], html)
|
|
||||||
if product.get('title'):
|
|
||||||
verified_products.append(product)
|
|
||||||
print(f" ✓ Verified: {product['title']}")
|
|
||||||
|
|
||||||
return verified_products
|
|
||||||
|
|
||||||
def find_all_pokemon_products(self):
|
|
||||||
"""Try all available methods to find Pokemon TCG products"""
|
|
||||||
|
|
||||||
print("Pokemon Product Finder - Multiple Discovery Methods")
|
|
||||||
print("=" * 60)
|
|
||||||
|
|
||||||
all_products = []
|
|
||||||
|
|
||||||
# Method 1: Sitemap discovery
|
|
||||||
sitemap_products = self.discover_products_via_sitemap()
|
|
||||||
all_products.extend(sitemap_products)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Method 2: Known patterns
|
|
||||||
pattern_products = self.search_via_known_patterns()
|
|
||||||
all_products.extend(pattern_products)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Method 3: Expand from known products
|
|
||||||
expanded_products = self.expand_known_products()
|
|
||||||
all_products.extend(expanded_products)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Method 4: Manual list (always works)
|
|
||||||
manual_products = self.manual_product_list()
|
|
||||||
all_products.extend(manual_products)
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Remove duplicates based on SKU
|
|
||||||
unique_products = {}
|
|
||||||
for product in all_products:
|
|
||||||
sku = product.get('sku')
|
|
||||||
if sku and sku not in unique_products:
|
|
||||||
unique_products[sku] = product
|
|
||||||
|
|
||||||
final_products = list(unique_products.values())
|
|
||||||
|
|
||||||
print("=" * 60)
|
|
||||||
print(f"🎉 DISCOVERY COMPLETE!")
|
|
||||||
print(f"Found {len(final_products)} unique Pokemon TCG products")
|
|
||||||
print()
|
|
||||||
|
|
||||||
if final_products:
|
|
||||||
# Filter for products with 'pack' or 'tin' in the name
|
|
||||||
pack_tin_products = []
|
|
||||||
for product in final_products:
|
|
||||||
title = product.get('title', '').lower()
|
|
||||||
if any(keyword in title for keyword in ['pack', 'tin', 'box', 'collection']):
|
|
||||||
pack_tin_products.append(product)
|
|
||||||
print(f"✓ Pack/Tin: {product['title']}")
|
|
||||||
|
|
||||||
print()
|
|
||||||
print(f"📦 Found {len(pack_tin_products)} products with 'pack', 'tin', 'box', or 'collection'")
|
|
||||||
|
|
||||||
# Save results
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
||||||
filename = f'pokemon_tcg_discovered_{timestamp}.json'
|
|
||||||
|
|
||||||
with open(filename, 'w') as f:
|
|
||||||
json.dump(final_products, f, indent=2)
|
|
||||||
|
|
||||||
print(f"💾 Saved all products to: {filename}")
|
|
||||||
|
|
||||||
return final_products
|
|
||||||
else:
|
|
||||||
print("❌ No products discovered through any method")
|
|
||||||
return []
|
|
||||||
|
|
||||||
def main():
|
|
||||||
finder = WorkingProductFinder()
|
|
||||||
products = finder.find_all_pokemon_products()
|
|
||||||
|
|
||||||
if products:
|
|
||||||
print()
|
|
||||||
print("🚀 SUCCESS! Products ready for PDF generation:")
|
|
||||||
print(f" python pdf_generator.py pokemon_tcg_discovered_[timestamp].json")
|
|
||||||
print()
|
|
||||||
print("📈 Next steps:")
|
|
||||||
print("1. Add more product URLs to manual_product_list() as you discover them")
|
|
||||||
print("2. Run the PDF generator to create your catalog")
|
|
||||||
print("3. The API authentication can be solved later for bulk discovery")
|
|
||||||
else:
|
|
||||||
print()
|
|
||||||
print("📝 Current limitation: Product discovery needs enhancement")
|
|
||||||
print("💡 Suggestion: Add known product URLs to manual_product_list()")
|
|
||||||
print("✅ Individual product extraction still works perfectly!")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
Reference in New Issue
Block a user