Skip to content
Santosh Gautam - Full Stack Developer Gurugram India
SANTOSH GAUTAM Full Stack Developer
Scrape Data Tool — Python Playwright Web Scraper by Santosh Gautam India
Open Source Python · Playwright
Automation Tool Open Source Personal Project

Scrape Data Tool — Website Scraper

A Python & Playwright-based web scraping tool that fully renders dynamic JavaScript content and downloads complete websites — HTML, CSS, JS and images — packaged into a downloadable ZIP archive.

Santosh Gautam - Full Stack Developer India

Santosh Gautam

Full Stack Developer · India

PythonPlaywrightFlaskBeautifulSoup4REST API

Features

Full JS Rendering

Playwright (Chromium) renders dynamic JS content.

Complete Asset Download

Downloads CSS, JS, images — all linked assets.

Local Link Rewriting

HTML links rewritten to point to local files.

ZIP Download

Full website bundled into a downloadable ZIP.

REST API

/scrape and /download endpoints included.

CORS Enabled

Cross-origin requests supported out of the box.

Auto-Cleanup

ZIP files auto-deleted after download.

Built-in Logging

Detailed logs for easier debugging.

Requirements

Python 3.7+ Playwright Flask Flask-CORS Requests BeautifulSoup4

Installation & Setup

1

Clone the repository and navigate into the project folder.

git clone <repository-url> && cd scrape-data-tool
2

Create and activate a Python virtual environment.

python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
3

Install all required dependencies.

pip install flask flask-cors requests beautifulsoup4 playwright
4

Install Playwright browsers (Chromium).

playwright install
5

You're ready! Start the server.

Running the Server

Start the Flask development server:

python app.py

# Server starts at:
# http://127.0.0.1:5000

API Usage

GET/scrapeScrape a website
url — required, target website URLtype — optional, supports zip
GET http://127.0.0.1:5000/scrape?url=https://example.com&type=zip
GET/download/<filename>Download ZIP
GET http://127.0.0.1:5000/download/example_com_1a2b3c4d.zip

How It Works

1. Playwright loads full page (including JS-rendered content)
2. BeautifulSoup parses HTML structure
3. External assets (CSS, JS, images) downloaded locally
4. HTML links rewritten to point to local assets
5. All files bundled into ZIP archive → downloads/ folder
6. ZIP path returned via API → user downloads file
7. ZIP auto-deleted after download to save disk space

Notes & Tips

  • Supports CSS, JS and image assets linked via href and src.
  • Always verify that scraping is permitted on target websites.
  • Large sites may take longer — consider timeout settings.
  • ZIP files are auto-deleted after download to save storage.
  • Use rate limiting or proxies to avoid IP blocking on large-scale scraping.

License

MIT License — Free to use & modify

Author

Santosh Gautam

Santosh Gautam

hisantosh.com