Skip to content
Santosh Gautam

SANTOSH GAUTAM

Full Stack Developer

🕸️ Scrape Data Tool 🕸️

Preview of WooCommerce Payment Gateway Boilerplate

A simple web-based tool that lets users scrape and download HTML, CSS, JS, and images from websites, packaged neatly into a ZIP file for easy downloading.

🚀 Features

  • Uses Playwright (Chromium) to fully render dynamic JavaScript content.
  • Downloads all linked CSS, JS, and image assets.
  • Rewrites asset links in the HTML to local files.
  • Packages the entire website into a downloadable ZIP file.
  • Simple REST API with /scrape and /download endpoints.
  • CORS enabled for cross-origin requests.
  • Auto-cleans ZIP files after download to save disk space.
  • Logging for easier debugging.

📋 Requirements

  • Python 3.7+
  • Playwright
  • Flask & Flask-CORS
  • Requests
  • BeautifulSoup4

🚀 Installation & Setup

  1. Clone the repo: git clone <repository-url> and enter folder.
  2. Install Python globally (instructions differ for Linux/macOS/Windows).
  3. Create and activate a virtual environment.
  4. Install dependencies with pip install flask flask-cors requests beautifulsoup4 playwright.
  5. Install Playwright browsers: playwright install.

⚙️ Running the Server

Start the Flask app using:

python app.py

Default URL: http://127.0.0.1:5000

🔗 API Usage

1. Scrape a website

Endpoint: /scrapeMethod: GET Query Params:url (required) — website URL, type (optional) — currently supports "zip".

GET http://127.0.0.1:5000/scrape?url=https://example.com&type=zip

Returns a JSON with the ZIP file path or an error.

2. Download ZIP archive

Endpoint: /download/<filename>Method: GET

GET http://127.0.0.1:5000/download/example_com_1a2b3c4d.zip
      

⚙️ How It Works

  • Uses Playwright to load full pages including JS content.
  • Parses HTML with BeautifulSoup.
  • Downloads external assets locally and rewrites HTML links.
  • Bundles everything into a ZIP file in downloads/.
  • ZIP path is returned, and file can be downloaded via API.

💡 Notes & Tips

  • Currently supports basic CSS, JS, and image assets linked via href and src.
  • Check that scraping is allowed on target sites.
  • Large sites may take longer to scrape.
  • ZIP files are auto-deleted after download.
  • Consider rate limiting or proxies to avoid IP blocking.

📄 License

Licensed under the MIT License.

👤 Author

Made with ❤️ by Santosh Gautam
Website: https://www.hisantosh.com
Contact: santoshgautam317@gmail.com