🕷️ Web Scraping AI Agent
🎓 FREE Step-by-Step Tutorial
👉 Click here to follow our complete step-by-step tutorial and learn how to build this from scratch with detailed code walkthroughs, explanations, and best practices.
AI-powered web scraping using ScrapeGraph AI - extract structured data from websites using natural language prompts. This folder contains two implementations:
- 🏠 Local Library - Using
scrapegraphailibrary (runs locally) - ☁️ Cloud SDK - Using ScrapeGraph AI API (managed service)
📁 What's Inside
🏠 Local Library Version
Files: ai_scrapper.py, local_ai_scrapper.py
Use the open-source ScrapeGraph AI library that runs on your local machine.
✅ Pros:
- Free to use (no API costs)
- Full control over execution
- Privacy-friendly (all data stays local)
❌ Cons:
- Requires local installation and dependencies
- Limited by your hardware
- Need to manage updates
Quick Start:
pip install -r requirements.txt
streamlit run ai_scrapper.py
☁️ Cloud SDK Version
Folder: scrapegraph_ai_sdk/
Use the managed ScrapeGraph AI API with advanced features and no setup required.
✅ Pros:
- No setup required (just API key)
- Scalable and fast
- Advanced features (SmartCrawler, SearchScraper, Markdownify)
- Always up-to-date
❌ Cons:
- Pay-per-use (credit-based)
- Requires internet connection
Quick Start:
cd scrapegraph_ai_sdk/
pip install -r requirements.txt
export SGAI_API_KEY='your-api-key'
python quickstart.py
📖 Full Documentation: See scrapegraph_ai_sdk/README.md
🚀 Getting Started
Local Library Version
- Clone the repository
git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps/starter_ai_agents/web_scrapping_ai_agent
- Install dependencies
pip install -r requirements.txt
- Get your OpenAI API Key
- Sign up for an OpenAI account
- Obtain your API key
- Run the Streamlit App
streamlit run ai_scrapper.py
# Or for local models:
streamlit run local_ai_scrapper.py
Cloud SDK Version
- Navigate to SDK folder
cd scrapegraph_ai_sdk/
- Install dependencies
pip install -r requirements.txt
- Get your ScrapeGraph AI API Key
- Sign up at scrapegraphai.com
- Get your API key
- Set API key
export SGAI_API_KEY='your-api-key-here'
- Run demos
# Quick test
python quickstart.py
# SmartScraper demo
python smart_scraper_demo.py
# Interactive app
streamlit run scrapegraph_app.py
📊 Feature Comparison
| Feature | Local Library | Cloud SDK |
|---|---|---|
| Setup | Install dependencies | API key only |
| Cost | Free (+ LLM costs) | Pay-per-use |
| Processing | Your hardware | Cloud-based |
| Speed | Depends on hardware | Fast & optimized |
| SmartScraper | ✅ | ✅ |
| SearchScraper | ❌ | ✅ |
| SmartCrawler | ❌ | ✅ |
| Markdownify | ❌ | ✅ |
| Scheduled Jobs | ❌ | ✅ |
| Scalability | Limited | Unlimited |
| Maintenance | Self-managed | Fully managed |
💡 Use Cases
E-commerce Scraping
# Extract product information
prompt = "Extract product names, prices, and availability"
Content Aggregation
# Convert articles to structured data
prompt = "Extract article title, author, date, and main content"
Competitive Intelligence
# Monitor competitor websites
prompt = "Extract pricing, features, and updates"
Lead Generation
# Extract contact information
prompt = "Find company names, emails, and phone numbers"
🔧 How It Works
Local Library
- You provide your OpenAI API key
- Select the model (GPT-4o, GPT-5, or local models)
- Enter the URL and extraction prompt
- The app uses ScrapeGraphAI to scrape and extract data locally
- Results are displayed in the app
Cloud SDK
- You provide your ScrapeGraph AI API key
- Choose the scraping method (SmartScraper, SearchScraper, etc.)
- Define extraction prompt and optional output schema
- API processes the request in the cloud
- Structured results are returned
🌟 Cloud SDK Features
🤖 SmartScraper
Extract structured data using natural language:
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract all products with prices"
)
🔍 SearchScraper
AI-powered web search with structured results:
response = client.smartscraper(
user_prompt="Find top 5 AI news websites",
num_results=5
)
📝 Markdownify
Convert webpages to clean markdown:
response = client.markdownify(
website_url="https://example.com/article"
)
🕷️ SmartCrawler
Crawl multiple pages intelligently:
request_id = client.smartcrawler(
url="https://docs.example.com",
user_prompt="Extract all API endpoints",
max_pages=50
)
📖 Documentation
- Local Library: ScrapeGraphAI GitHub
- Cloud SDK: See scrapegraph_ai_sdk/README.md
- API Docs: https://docs.scrapegraphai.com
🤝 Which Version Should I Use?
Use Local Library if:
- ✅ You want free, open-source solution
- ✅ You have good hardware
- ✅ You need full control
- ✅ Privacy is critical
Use Cloud SDK if:
- ✅ You want quick setup
- ✅ You need advanced features
- ✅ You want scalability
- ✅ You prefer managed service
💡 Pro Tip: Start with the local version to learn, then switch to SDK for production!