Files
awesome-llm-apps/starter_ai_agents/web_scrapping_ai_agent
Shubham Saboo 9c3738877a Add tutorial link to README for web scraping AI agent
Added a tutorial link for building the web scraping AI agent.
2025-11-30 21:49:29 -08:00
..
2025-06-13 17:48:06 +05:30
2025-04-17 16:01:23 -05:00

🕷️ Web Scraping AI Agent

🎓 FREE Step-by-Step Tutorial

👉 Click here to follow our complete step-by-step tutorial and learn how to build this from scratch with detailed code walkthroughs, explanations, and best practices.

AI-powered web scraping using ScrapeGraph AI - extract structured data from websites using natural language prompts. This folder contains two implementations:

  1. 🏠 Local Library - Using scrapegraphai library (runs locally)
  2. ☁️ Cloud SDK - Using ScrapeGraph AI API (managed service)

📁 What's Inside

🏠 Local Library Version

Files: ai_scrapper.py, local_ai_scrapper.py

Use the open-source ScrapeGraph AI library that runs on your local machine.

Pros:

  • Free to use (no API costs)
  • Full control over execution
  • Privacy-friendly (all data stays local)

Cons:

  • Requires local installation and dependencies
  • Limited by your hardware
  • Need to manage updates

Quick Start:

pip install -r requirements.txt
streamlit run ai_scrapper.py

☁️ Cloud SDK Version

Folder: scrapegraph_ai_sdk/

Use the managed ScrapeGraph AI API with advanced features and no setup required.

Pros:

  • No setup required (just API key)
  • Scalable and fast
  • Advanced features (SmartCrawler, SearchScraper, Markdownify)
  • Always up-to-date

Cons:

  • Pay-per-use (credit-based)
  • Requires internet connection

Quick Start:

cd scrapegraph_ai_sdk/
pip install -r requirements.txt
export SGAI_API_KEY='your-api-key'
python quickstart.py

📖 Full Documentation: See scrapegraph_ai_sdk/README.md


🚀 Getting Started

Local Library Version

  1. Clone the repository
git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps/starter_ai_agents/web_scrapping_ai_agent
  1. Install dependencies
pip install -r requirements.txt
  1. Get your OpenAI API Key
  1. Run the Streamlit App
streamlit run ai_scrapper.py
# Or for local models:
streamlit run local_ai_scrapper.py

Cloud SDK Version

  1. Navigate to SDK folder
cd scrapegraph_ai_sdk/
  1. Install dependencies
pip install -r requirements.txt
  1. Get your ScrapeGraph AI API Key
  1. Set API key
export SGAI_API_KEY='your-api-key-here'
  1. Run demos
# Quick test
python quickstart.py

# SmartScraper demo
python smart_scraper_demo.py

# Interactive app
streamlit run scrapegraph_app.py

📊 Feature Comparison

Feature Local Library Cloud SDK
Setup Install dependencies API key only
Cost Free (+ LLM costs) Pay-per-use
Processing Your hardware Cloud-based
Speed Depends on hardware Fast & optimized
SmartScraper
SearchScraper
SmartCrawler
Markdownify
Scheduled Jobs
Scalability Limited Unlimited
Maintenance Self-managed Fully managed

💡 Use Cases

E-commerce Scraping

# Extract product information
prompt = "Extract product names, prices, and availability"

Content Aggregation

# Convert articles to structured data
prompt = "Extract article title, author, date, and main content"

Competitive Intelligence

# Monitor competitor websites
prompt = "Extract pricing, features, and updates"

Lead Generation

# Extract contact information
prompt = "Find company names, emails, and phone numbers"

🔧 How It Works

Local Library

  1. You provide your OpenAI API key
  2. Select the model (GPT-4o, GPT-5, or local models)
  3. Enter the URL and extraction prompt
  4. The app uses ScrapeGraphAI to scrape and extract data locally
  5. Results are displayed in the app

Cloud SDK

  1. You provide your ScrapeGraph AI API key
  2. Choose the scraping method (SmartScraper, SearchScraper, etc.)
  3. Define extraction prompt and optional output schema
  4. API processes the request in the cloud
  5. Structured results are returned

🌟 Cloud SDK Features

🤖 SmartScraper

Extract structured data using natural language:

response = client.smartscraper(
    website_url="https://example.com",
    user_prompt="Extract all products with prices"
)

🔍 SearchScraper

AI-powered web search with structured results:

response = client.smartscraper(
    user_prompt="Find top 5 AI news websites",
    num_results=5
)

📝 Markdownify

Convert webpages to clean markdown:

response = client.markdownify(
    website_url="https://example.com/article"
)

🕷️ SmartCrawler

Crawl multiple pages intelligently:

request_id = client.smartcrawler(
    url="https://docs.example.com",
    user_prompt="Extract all API endpoints",
    max_pages=50
)

📖 Documentation


🤝 Which Version Should I Use?

Use Local Library if:

  • You want free, open-source solution
  • You have good hardware
  • You need full control
  • Privacy is critical

Use Cloud SDK if:

  • You want quick setup
  • You need advanced features
  • You want scalability
  • You prefer managed service

💡 Pro Tip: Start with the local version to learn, then switch to SDK for production!