🕸️ Scrape Data Tool 🕸️

A simple web-based tool that lets users scrape and download HTML, CSS, JS, and images from websites, packaged neatly into a ZIP file for easy downloading.
🚀 Features
- Uses Playwright (Chromium) to fully render dynamic JavaScript content.
- Downloads all linked CSS, JS, and image assets.
- Rewrites asset links in the HTML to local files.
- Packages the entire website into a downloadable ZIP file.
- Simple REST API with
/scrapeand/downloadendpoints. - CORS enabled for cross-origin requests.
- Auto-cleans ZIP files after download to save disk space.
- Logging for easier debugging.
📋 Requirements
- Python 3.7+
- Playwright
- Flask & Flask-CORS
- Requests
- BeautifulSoup4
🚀 Installation & Setup
- Clone the repo:
git clone <repository-url>and enter folder. - Install Python globally (instructions differ for Linux/macOS/Windows).
- Create and activate a virtual environment.
- Install dependencies with
pip install flask flask-cors requests beautifulsoup4 playwright. - Install Playwright browsers:
playwright install.
⚙️ Running the Server
Start the Flask app using:
python app.py Default URL: http://127.0.0.1:5000
🔗 API Usage
1. Scrape a website
Endpoint: /scrapeMethod: GET Query Params:url (required) — website URL, type (optional) — currently supports "zip".
GET http://127.0.0.1:5000/scrape?url=https://example.com&type=zipReturns a JSON with the ZIP file path or an error.
2. Download ZIP archive
Endpoint: /download/<filename>Method: GET
GET http://127.0.0.1:5000/download/example_com_1a2b3c4d.zip
⚙️ How It Works
- Uses Playwright to load full pages including JS content.
- Parses HTML with BeautifulSoup.
- Downloads external assets locally and rewrites HTML links.
- Bundles everything into a ZIP file in
downloads/. - ZIP path is returned, and file can be downloaded via API.
💡 Notes & Tips
- Currently supports basic CSS, JS, and image assets linked via
hrefandsrc. - Check that scraping is allowed on target sites.
- Large sites may take longer to scrape.
- ZIP files are auto-deleted after download.
- Consider rate limiting or proxies to avoid IP blocking.
📄 License
Licensed under the MIT License.
👤 Author
Made with ❤️ by Santosh Gautam
Website: https://www.hisantosh.com
Contact: santoshgautam317@gmail.com
