mirror of https://github.com/Shubhamsaboo/awesome-llm-apps.git synced 2025-12-05 18:56:07 -06:00

Files

Shubham Saboo 9c3738877a Add tutorial link to README for web scraping AI agent

Added a tutorial link for building the web scraping AI agent.

2025-11-30 21:49:29 -08:00

ai_scrapper.py

fix: Update AI Scrapper to utilize new OpenAI models in the selection interface and README

2025-11-08 22:09:57 -08:00

local_ai_scrapper.py

chore: fix typo

2025-06-13 17:48:06 +05:30

README.md

Add tutorial link to README for web scraping AI agent

2025-11-30 21:49:29 -08:00

requirements.txt

Updated repo structure

2025-04-17 16:01:23 -05:00

README.md

🕷️ Web Scraping AI Agent

🎓 FREE Step-by-Step Tutorial

👉 Click here to follow our complete step-by-step tutorial and learn how to build this from scratch with detailed code walkthroughs, explanations, and best practices.

AI-powered web scraping using ScrapeGraph AI - extract structured data from websites using natural language prompts. This folder contains two implementations:

🏠 Local Library - Using scrapegraphai library (runs locally)
☁️ Cloud SDK - Using ScrapeGraph AI API (managed service)

📁 What's Inside

🏠 Local Library Version

Files: ai_scrapper.py, local_ai_scrapper.py

Use the open-source ScrapeGraph AI library that runs on your local machine.

✅ Pros:

Free to use (no API costs)
Full control over execution
Privacy-friendly (all data stays local)

❌ Cons:

Requires local installation and dependencies
Limited by your hardware
Need to manage updates

Quick Start:

pip install -r requirements.txt
streamlit run ai_scrapper.py

☁️ Cloud SDK Version

Folder: scrapegraph_ai_sdk/

Use the managed ScrapeGraph AI API with advanced features and no setup required.

✅ Pros:

No setup required (just API key)
Scalable and fast
Advanced features (SmartCrawler, SearchScraper, Markdownify)
Always up-to-date

❌ Cons:

Pay-per-use (credit-based)
Requires internet connection

Quick Start:

cd scrapegraph_ai_sdk/
pip install -r requirements.txt
export SGAI_API_KEY='your-api-key'
python quickstart.py

📖 Full Documentation: See scrapegraph_ai_sdk/README.md

🚀 Getting Started

Local Library Version

Clone the repository

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps/starter_ai_agents/web_scrapping_ai_agent

Install dependencies

pip install -r requirements.txt

Get your OpenAI API Key

Sign up for an OpenAI account
Obtain your API key

Run the Streamlit App

streamlit run ai_scrapper.py
# Or for local models:
streamlit run local_ai_scrapper.py

Cloud SDK Version

Navigate to SDK folder

cd scrapegraph_ai_sdk/

Install dependencies

pip install -r requirements.txt

Get your ScrapeGraph AI API Key

Sign up at scrapegraphai.com
Get your API key

Set API key

export SGAI_API_KEY='your-api-key-here'

Run demos

# Quick test
python quickstart.py

# SmartScraper demo
python smart_scraper_demo.py

# Interactive app
streamlit run scrapegraph_app.py

📊 Feature Comparison

Feature	Local Library	Cloud SDK
Setup	Install dependencies	API key only
Cost	Free (+ LLM costs)	Pay-per-use
Processing	Your hardware	Cloud-based
Speed	Depends on hardware	Fast & optimized
SmartScraper	✅	✅
SearchScraper	❌	✅
SmartCrawler	❌	✅
Markdownify	❌	✅
Scheduled Jobs	❌	✅
Scalability	Limited	Unlimited
Maintenance	Self-managed	Fully managed

💡 Use Cases

E-commerce Scraping

# Extract product information
prompt = "Extract product names, prices, and availability"

Content Aggregation

# Convert articles to structured data
prompt = "Extract article title, author, date, and main content"

Competitive Intelligence

# Monitor competitor websites
prompt = "Extract pricing, features, and updates"

Lead Generation

# Extract contact information
prompt = "Find company names, emails, and phone numbers"

🔧 How It Works

Local Library

You provide your OpenAI API key
Select the model (GPT-4o, GPT-5, or local models)
Enter the URL and extraction prompt
The app uses ScrapeGraphAI to scrape and extract data locally
Results are displayed in the app

Cloud SDK

You provide your ScrapeGraph AI API key
Choose the scraping method (SmartScraper, SearchScraper, etc.)
Define extraction prompt and optional output schema
API processes the request in the cloud
Structured results are returned

🌟 Cloud SDK Features

🤖 SmartScraper

Extract structured data using natural language:

response = client.smartscraper(
    website_url="https://example.com",
    user_prompt="Extract all products with prices"
)

🔍 SearchScraper

AI-powered web search with structured results:

response = client.smartscraper(
    user_prompt="Find top 5 AI news websites",
    num_results=5
)

📝 Markdownify

Convert webpages to clean markdown:

response = client.markdownify(
    website_url="https://example.com/article"
)

🕷️ SmartCrawler

Crawl multiple pages intelligently:

request_id = client.smartcrawler(
    url="https://docs.example.com",
    user_prompt="Extract all API endpoints",
    max_pages=50
)

📖 Documentation

Local Library: ScrapeGraphAI GitHub
Cloud SDK: See scrapegraph_ai_sdk/README.md
API Docs: https://docs.scrapegraphai.com

🤝 Which Version Should I Use?

Use Local Library if:

✅ You want free, open-source solution
✅ You have good hardware
✅ You need full control
✅ Privacy is critical

Use Cloud SDK if:

✅ You want quick setup
✅ You need advanced features
✅ You want scalability
✅ You prefer managed service

💡 Pro Tip: Start with the local version to learn, then switch to SDK for production!