refactor: Replace Firecrawl function tool with MCPToolset for enhanced web scraping capabilities

2026-04-28 22:28:59 -05:00 · 2025-10-17 17:40:11 -07:00
parent 19cc107b5b
commit e50db32887
3 changed files with 55 additions and 66 deletions
--- a/advanced_ai_agents/multi_agent_apps/agent_teams/ai_seo_audit_team/README.md
+++ b/advanced_ai_agents/multi_agent_apps/agent_teams/ai_seo_audit_team/README.md
@@ -1,11 +1,11 @@
 # 🔍 AI SEO Audit Team

-The **AI SEO Audit Team** is an autonomous, multi-agent workflow built with Google ADK. It takes a webpage URL, crawls the live page, researches real-time SERP competition, and produces a polished, prioritized SEO optimization report. The app uses Firecrawl for accurate page scraping and Google’s Gemini 2.5 Flash for analysis and reporting.
+The **AI SEO Audit Team** is an autonomous, multi-agent workflow built with Google ADK. It takes a webpage URL, crawls the live page, researches real-time SERP competition, and produces a polished, prioritized SEO optimization report. The app uses **Firecrawl via MCP (Model Context Protocol)** for accurate page scraping and Google's Gemini 2.5 Flash for analysis and reporting.

 ## Features

 - **End-to-End On-Page SEO Evaluation**
-  - Automated crawl of any public URL (Firecrawl)
+  - Automated crawl of any public URL (Firecrawl MCP)
  - Structured audit of titles, headings, content depth, internal/external links, and technical signals
 - **Competitive SERP Intelligence**
  - Google Search research for the inferred primary keyword
@@ -30,23 +30,33 @@ All agents run sequentially using ADK’s `SequentialAgent`, passing state betwe

 ## Requirements

+### System Requirements
+- **Python 3.10+** for Google ADK
+- **Node.js** (for Firecrawl MCP server via npx)
+
+### Python Dependencies
+
 Install the Python dependencies:

 ```bash
-pip install -r advanced_ai_agents/multi_agent_apps/agent_teams/ai_seo_audit_team/requirements.txt
+pip install -r requirements.txt
 ```

-You will also need valid API keys:
+### API Keys
+
+You will need valid API keys:

 - `GOOGLE_API_KEY` – Gemini (Google AI Studio) for LLM + Google Search
- `FIRECRAWL_API_KEY` – Firecrawl scrape endpoint
+- `FIRECRAWL_API_KEY` – Firecrawl MCP server ([get one here](https://firecrawl.dev/app/api-keys))

-Create a local `.env` (same directory as `agent.py`) and populate:
+Set your environment variables (e.g., add to your shell profile or `export` in your terminal):

+```bash
+export GOOGLE_API_KEY=your_gemini_key
+export FIRECRAWL_API_KEY=your_firecrawl_key
 ```
-GOOGLE_API_KEY=your_gemini_key
-FIRECRAWL_API_KEY=your_firecrawl_key
-```
+
+Alternatively, you can put these in a `.env` file if you prefer.

 ## Running the App with ADK Dev UI

@@ -62,7 +72,7 @@ FIRECRAWL_API_KEY=your_firecrawl_key

 3. **Launch the ADK web UI** from the project root:
   ```bash
-   adk web advanced_ai_agents/multi_agent_apps/agent_teams
+   adk web
   ```

 4. In the UI:
@@ -81,7 +91,8 @@ FIRECRAWL_API_KEY=your_firecrawl_key
 ```
 ai_seo_audit_team/
 ├── agent.py          # Multi-agent workflow definitions
-├── requirements.txt  # Minimal dependencies (google-adk, firecrawl-py, pydantic)
+├── requirements.txt  # Minimal dependencies
+├── __init__.py       # Module initialization
 └── README.md         # You are here
 ```

--- a/advanced_ai_agents/multi_agent_apps/agent_teams/ai_seo_audit_team/agent.py
+++ b/advanced_ai_agents/multi_agent_apps/agent_teams/ai_seo_audit_team/agent.py
@@ -9,12 +9,12 @@ The workflow runs three specialized agents in sequence:

 from __future__ import annotations
 import os
-from typing import Dict, List, Optional
+from typing import List, Optional
 from pydantic import BaseModel, Field
 from google.adk.agents import LlmAgent, SequentialAgent
-from google.adk.tools import FunctionTool, google_search
+from google.adk.tools import google_search
 from google.adk.tools.agent_tool import AgentTool
-from firecrawl import FirecrawlApp
+from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset, StdioServerParameters


 # =============================================================================
@@ -135,47 +135,21 @@ class OptimizationRecommendation(BaseModel):
 # Tools
 # =============================================================================

-
-def firecrawl_scrape(url: str) -> Dict[str, object]:
-    """
-    Scrape a target URL with Firecrawl and return structured data for auditing.
-
-    Args:
-        url: Fully-qualified URL to crawl.
-
-    Returns:
-        Dictionary payload from Firecrawl that includes markdown, html, and link metadata.
-    """
-    api_key = os.getenv("FIRECRAWL_API_KEY")
-    if not api_key:
-        raise RuntimeError(
-            "FIRECRAWL_API_KEY environment variable is not set. "
-            "Provide a valid Firecrawl API key to enable crawling."
-        )
-
-    app = FirecrawlApp(api_key=api_key)
-    try:
-        document = app.scrape(
-            url=url,
-            formats=[
-                "markdown",
-                "html",
-                "links",
-            ],
-            only_main_content=True,
-            timeout=90000,
-            block_ads=True,
-        )
-    except Exception as exc:  # pragma: no cover - tool errors pass to the agent
-        raise RuntimeError(f"Firecrawl scrape failed: {exc}") from exc
-
-    payload = document.model_dump(exclude_none=True)
-    if not payload.get("markdown") and not payload.get("html"):
-        raise RuntimeError("Firecrawl scrape completed but returned no page content.")
-    return payload
-
-
-firecrawl_tool = FunctionTool(firecrawl_scrape)
+# Firecrawl MCP Toolset - connects to Firecrawl's MCP server for advanced web scraping
+firecrawl_toolset = MCPToolset(
+    connection_params=StdioServerParameters(
+        command='npx',
+        args=[
+            "-y",  # Auto-confirm npm package installation
+            "firecrawl-mcp",  # The Firecrawl MCP server package
+        ],
+        env={
+            "FIRECRAWL_API_KEY": os.getenv("FIRECRAWL_API_KEY", "")
+        }
+    ),
+    # Filter to use only the scrape tool for this agent
+    tool_filter=['firecrawl_scrape']
+)


 # =============================================================================
@@ -208,14 +182,17 @@ page_auditor_agent = LlmAgent(
    description=(
        "Scrapes the target URL, performs a structural on-page SEO audit, and extracts keyword signals."
    ),
-    instruction="""You are Agent 1 in a sequential SEO workflow.
- Extract the URL from the latest user message. If no valid URL is provided, ask the user for one and stop.
+    instruction="""You are Agent 1 in a sequential SEO workflow. Your role is to gather data silently for the next agents.
+- Extract the URL from the latest user message. The user MUST provide a valid URL.
 - Call the `firecrawl_scrape` tool exactly once to gather page content, metadata, and links.
+  * Use these parameters: {"url": "target_url", "formats": ["markdown", "html", "links"], "onlyMainContent": true}
 - Audit the page structure: title tag, meta description, headings hierarchy, word count, link health, and technical flags.
 - Infer the dominant search intent and identify the primary and secondary keyword targets based on page content.
 - Populate every field in the PageAuditOutput schema and store the result in `state['page_audit']`.
- Output must be valid JSON only, with no extra commentary. Every string field needs meaningful text (use clear fallbacks like "Not available" if necessary). Keep numeric fields as integers and lists as arrays (use [] when empty).""",
-    tools=[firecrawl_tool],
+- Output must be valid JSON only, with no extra commentary. Every string field needs meaningful text (use clear fallbacks like "Not available" if necessary). Keep numeric fields as integers and lists as arrays (use [] when empty).
+- If the scrape fails or returns no content, still return valid JSON with fallback values like "Error: Unable to scrape page" for string fields.
+- IMPORTANT: Do not include any text before or after the JSON. Just output the raw JSON structure.""",
+    tools=[firecrawl_toolset],
    output_schema=PageAuditOutput,
    output_key="page_audit",
 )
@@ -227,12 +204,13 @@ serp_analyst_agent = LlmAgent(
    description=(
        "Researches the live SERP for the discovered primary keyword and summarizes the competitive landscape."
    ),
-    instruction="""You are Agent 2 in the workflow.
+    instruction="""You are Agent 2 in the workflow. Your role is to silently gather SERP data for the final report agent.
 - Read the keyword data from `state['page_audit']['target_keywords']`.
 - For the primary keyword, call the `perform_google_search` tool with arguments `{"request": "<primary keyword>"}` to fetch the top organic results (request 10 results).
 - Summarize each result with rank, title, URL, snippet, and content type.
 - Highlight common title patterns, dominant content formats, People Also Ask questions, recurring themes, and opportunities to differentiate the page.
- Populate the SerpAnalysis schema, store it in `state['serp_analysis']`, and return strict JSON only. Ensure `primary_keyword` is a non-empty string (use a clear fallback if the search fails) and keep every list field as an array (return [] when empty).""",
+- Populate the SerpAnalysis schema, store it in `state['serp_analysis']`, and return strict JSON only. Ensure `primary_keyword` is a non-empty string (use a clear fallback if the search fails) and keep every list field as an array (return [] when empty).
+- IMPORTANT: Do not include any text before or after the JSON. Just output the raw JSON structure.""",
    tools=[google_search_tool],
    output_schema=SerpAnalysis,
    output_key="serp_analysis",
@@ -243,16 +221,17 @@ optimization_advisor_agent = LlmAgent(
    name="OptimizationAdvisorAgent",
    model="gemini-2.5-flash",
    description="Synthesizes the audit and SERP findings into a prioritized optimization roadmap.",
-    instruction="""You are Agent 3 and the final expert in the workflow.
+    instruction="""You are Agent 3 and the final expert in the workflow. You create the user-facing report.
 - Review `state['page_audit']` and `state['serp_analysis']` to understand the current page and competitive landscape.
- Produce a polished Markdown report for the user that includes:
+- Produce a polished, well-formatted Markdown report that includes:
  * Executive summary
  * Key audit findings (technical + content + keyword highlights)
  * Prioritized action list grouped by priority level (P0/P1/P2) with rationale and expected impact
  * Keyword strategy and SERP insights
  * Measurement / next-step suggestions
 - Reference concrete data points from the earlier agents. If some data is missing, acknowledge it directly rather than fabricating.
- Return Markdown only—no JSON.""",
+- Return ONLY the Markdown report—no JSON, no preamble, no explanatory text. Start directly with "# SEO Audit Report" as the first line.
+- Make the report professional, actionable, and ready to share with stakeholders.""",
 )


--- a/advanced_ai_agents/multi_agent_apps/agent_teams/ai_seo_audit_team/requirements.txt
+++ b/advanced_ai_agents/multi_agent_apps/agent_teams/ai_seo_audit_team/requirements.txt
@@ -1,3 +1,2 @@
 google-adk
-firecrawl-py
-pydantic>=2.7.0
+pydantic>=2.7.0