Merge pull request #407 from bibinprathap/knowledge-graph-rag

Add Knowledge Graph RAG with Verifiable Citations example
This commit is contained in:
Shubham Saboo
2026-02-16 22:48:55 -08:00
committed by GitHub
6 changed files with 766 additions and 0 deletions

View File

@@ -181,6 +181,7 @@ A curated collection of **Awesome LLM apps built with RAG, AI Agents, Multi-agen
* [⛓️ Basic RAG Chain](rag_tutorials/rag_chain/)
* [📠 RAG with Database Routing](rag_tutorials/rag_database_routing/)
* [🖼️ Vision RAG](rag_tutorials/vision_rag/)
* [🕸️ Knowledge Graph RAG with Citations](rag_tutorials/knowledge_graph_rag_citations/)
### 💾 LLM Apps with Memory Tutorials

View File

@@ -0,0 +1,12 @@
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY knowledge_graph_rag.py .
EXPOSE 8501
CMD ["streamlit", "run", "knowledge_graph_rag.py", "--server.address", "0.0.0.0"]

View File

@@ -0,0 +1,171 @@
# 🔍 Knowledge Graph RAG with Verifiable Citations
A Streamlit application demonstrating how **Knowledge Graph-based Retrieval-Augmented Generation (RAG)** provides multi-hop reasoning with fully verifiable source attribution.
## 🎯 What Makes This Different?
Traditional vector-based RAG finds similar text chunks, but struggles with:
- Questions requiring information from multiple documents
- Complex reasoning chains
- Providing verifiable sources for each claim
**Knowledge Graph RAG** solves these by:
1. **Building a structured graph** of entities and relationships from documents
2. **Traversing connections** to find related information (multi-hop reasoning)
3. **Tracking provenance** so every claim links back to its source
## ✨ Features
| Feature | Description |
|---------|-------------|
| 🔗 **Multi-hop Reasoning** | Traverse entity relationships to answer complex questions |
| 📚 **Verifiable Citations** | Every claim includes source document and text |
| 🧠 **Reasoning Trace** | See exactly how the answer was derived |
| 🏠 **Fully Local** | Uses Ollama for LLM, Neo4j for graph storage |
## 🚀 Quick Start
### Prerequisites
1. **Ollama** - Local LLM inference
```bash
# Install from https://ollama.ai
ollama pull llama3.2
```
2. **Neo4j** - Knowledge graph database
```bash
# Using Docker
docker run -d \
--name neo4j \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
neo4j:latest
```
### Installation
```bash
# Clone and navigate
cd knowledge_graph_rag_citations
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run knowledge_graph_rag.py
```
## 📖 How It Works
### Step 1: Document → Knowledge Graph
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Document │ ──► │ LLM Extraction │ ──► │ Knowledge Graph │
│ (Text/PDF) │ │ (Entities+Rels) │ │ (Neo4j) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
```
The LLM extracts:
- **Entities**: People, organizations, concepts, technologies
- **Relationships**: How entities connect (e.g., "works_for", "created", "uses")
- **Provenance**: Source document and chunk for each extraction
### Step 2: Query → Multi-hop Traversal
```
┌─────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐
│ Query │ ──► │ Find Start │ ──► │ Traverse │ ──► │ Context │
│ │ │ Entities │ │ Relations │ │ + Sources│
└─────────┘ └─────────────┘ └─────────────┘ └───────────┘
```
### Step 3: Answer → Verified Citations
```
┌─────────────┐ ┌─────────────┐ ┌──────────────────┐
│ Context │ ──► │ Generate │ ──► │ Answer with │
│ + Sources │ │ Answer │ │ [1][2] Citations│
└─────────────┘ └─────────────┘ └──────────────────┘
┌──────────────────┐
│ Citation Details │
│ • Source Doc │
│ • Source Text │
│ • Reasoning Path │
└──────────────────┘
```
## 🖥️ Usage Example
### 1. Add a Document
Paste or select a sample document. The system extracts entities and relationships:
```
Document: "GraphRAG was developed by Microsoft Research.
Darren Edge led the project..."
Extracted:
├── Entity: GraphRAG (TECHNOLOGY)
├── Entity: Microsoft Research (ORGANIZATION)
├── Entity: Darren Edge (PERSON)
└── Relationship: Darren Edge --[WORKS_FOR]--> Microsoft Research
```
### 2. Ask a Question
```
Question: "Who developed GraphRAG and what organization are they from?"
```
### 3. Get Verified Answer
```
Answer: GraphRAG was developed by researchers at Microsoft Research [1],
with Darren Edge leading the project [2].
Citations:
[1] Source: AI Research Paper
Text: "GraphRAG is a technique developed by Microsoft Research..."
[2] Source: AI Research Paper
Text: "...introduced by researchers including Darren Edge..."
```
## 🔧 Configuration
| Setting | Default | Description |
|---------|---------|-------------|
| Neo4j URI | `bolt://localhost:7687` | Neo4j connection string |
| Neo4j User | `neo4j` | Database username |
| Neo4j Password | - | Database password |
| LLM Model | `llama3.2` | Ollama model for extraction/generation |
## 🏗️ Architecture
```
knowledge_graph_rag_citations/
├── knowledge_graph_rag.py # Main Streamlit application
├── requirements.txt # Python dependencies
└── README.md # This file
```
### Key Components
- **`KnowledgeGraphManager`**: Neo4j interface for graph operations
- **`extract_entities_with_llm()`**: LLM-based entity/relationship extraction
- **`generate_answer_with_citations()`**: Multi-hop RAG with provenance tracking
## 🎓 Learn More
This example is inspired by [VeritasGraph](https://github.com/bibinprathap/VeritasGraph), an enterprise-grade framework for:
- On-premise knowledge graph RAG
- Visual reasoning traces (Veritas-Scope)
- LoRA-tuned LLM integration
## 📝 License
MIT License

View File

@@ -0,0 +1,58 @@
version: '3.8'
services:
neo4j:
image: neo4j:latest
container_name: kg-rag-neo4j
ports:
- "7474:7474" # Browser
- "7687:7687" # Bolt
environment:
- NEO4J_AUTH=neo4j/password
- NEO4J_PLUGINS='["apoc"]'
volumes:
- neo4j_data:/data
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:7474 || exit 1"]
interval: 10s
timeout: 5s
retries: 5
# Ollama container (optional - can use local installation instead)
ollama:
image: ollama/ollama:latest
container_name: kg-rag-ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
# Pull the model on startup
entrypoint: ["/bin/sh", "-c"]
command:
- |
/bin/ollama serve &
sleep 5
/bin/ollama pull llama3.2
wait
streamlit:
build:
context: .
dockerfile: Dockerfile
container_name: kg-rag-app
ports:
- "8501:8501"
environment:
- NEO4J_URI=bolt://neo4j:7687
- NEO4J_USER=neo4j
- NEO4J_PASSWORD=password
- OLLAMA_HOST=http://ollama:11434
depends_on:
neo4j:
condition: service_healthy
ollama:
condition: service_started
volumes:
neo4j_data:
ollama_data:

View File

@@ -0,0 +1,521 @@
"""
Knowledge Graph RAG with Verifiable Citations
A Streamlit app demonstrating how Knowledge Graph-based RAG provides:
1. Multi-hop reasoning across documents
2. Verifiable source attribution for every claim
3. Transparent reasoning traces
This example uses Ollama for local LLM inference and Neo4j for the knowledge graph.
"""
import streamlit as st
import ollama
from ollama import Client as OllamaClient
from neo4j import GraphDatabase
from typing import List, Dict, Tuple
import re
import os
from dataclasses import dataclass
import json
import hashlib
# Configure Ollama host from environment (for Docker)
OLLAMA_HOST = os.environ.get('OLLAMA_HOST', 'http://localhost:11434')
ollama_client = OllamaClient(host=OLLAMA_HOST)
# ============================================================================
# Data Models
# ============================================================================
@dataclass
class Entity:
"""Represents an entity extracted from documents."""
id: str
name: str
entity_type: str
description: str
source_doc: str
source_chunk: str
@dataclass
class Relationship:
"""Represents a relationship between entities."""
source: str
target: str
relation_type: str
description: str
source_doc: str
@dataclass
class Citation:
"""Represents a verifiable citation for a claim."""
claim: str
source_document: str
source_text: str
confidence: float
reasoning_path: List[str]
@dataclass
class AnswerWithCitations:
"""Final answer with full attribution."""
answer: str
citations: List[Citation]
reasoning_trace: List[str]
# ============================================================================
# Knowledge Graph Manager
# ============================================================================
class KnowledgeGraphManager:
"""Manages the Neo4j knowledge graph for RAG."""
def __init__(self, uri: str, user: str, password: str):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def clear_graph(self):
"""Clear all nodes and relationships."""
with self.driver.session() as session:
session.run("MATCH (n) DETACH DELETE n")
def add_entity(self, entity: Entity):
"""Add an entity to the knowledge graph."""
with self.driver.session() as session:
session.run(
"""
MERGE (e:Entity {id: $id})
SET e.name = $name,
e.type = $entity_type,
e.description = $description,
e.source_doc = $source_doc,
e.source_chunk = $source_chunk
""",
id=entity.id,
name=entity.name,
entity_type=entity.entity_type,
description=entity.description,
source_doc=entity.source_doc,
source_chunk=entity.source_chunk
)
def add_relationship(self, rel: Relationship):
"""Add a relationship between entities."""
with self.driver.session() as session:
session.run(
"""
MATCH (a:Entity {name: $source})
MATCH (b:Entity {name: $target})
MERGE (a)-[r:RELATES_TO {type: $rel_type}]->(b)
SET r.description = $description,
r.source_doc = $source_doc
""",
source=rel.source,
target=rel.target,
rel_type=rel.relation_type,
description=rel.description,
source_doc=rel.source_doc
)
def find_related_entities(self, entity_name: str, hops: int = 2) -> List[Dict]:
"""Find entities related within N hops, with full provenance."""
with self.driver.session() as session:
result = session.run(
f"""
MATCH path = (start:Entity)-[*1..{hops}]-(related:Entity)
WHERE toLower(start.name) CONTAINS toLower($name) OR toLower(start.description) CONTAINS toLower($name)
RETURN related.name as name,
related.description as description,
related.source_doc as source,
related.source_chunk as chunk,
[r in relationships(path) | r.description] as path_descriptions
LIMIT 20
""",
name=entity_name, hops=hops
)
return [dict(record) for record in result]
def semantic_search(self, query: str) -> List[Dict]:
"""Search for relevant entities based on query."""
with self.driver.session() as session:
# Simple text matching (in production, use vector embeddings)
result = session.run(
"""
MATCH (e:Entity)
WHERE e.name CONTAINS $query
OR e.description CONTAINS $query
RETURN e.name as name,
e.description as description,
e.source_doc as source,
e.source_chunk as chunk,
e.type as type
LIMIT 10
""",
query=query
)
return [dict(record) for record in result]
# ============================================================================
# LLM-based Entity Extraction
# ============================================================================
def extract_entities_with_llm(text: str, source_doc: str, model: str = "llama3.2") -> Tuple[List[Entity], List[Relationship]]:
"""Use LLM to extract entities and relationships from text."""
extraction_prompt = f"""Analyze the following text and extract:
1. KEY ENTITIES (people, organizations, concepts, technologies, events)
2. RELATIONSHIPS between these entities
For each entity, provide:
- name: The entity name
- type: Category (PERSON, ORGANIZATION, CONCEPT, TECHNOLOGY, EVENT, LOCATION)
- description: Brief description based on the text
For each relationship, provide:
- source: Source entity name
- target: Target entity name
- type: Relationship type (e.g., WORKS_FOR, CREATED, USES, LOCATED_IN)
- description: Description of how they relate
TEXT:
{text}
Respond in JSON format:
{{
"entities": [
{{"name": "...", "type": "...", "description": "..."}}
],
"relationships": [
{{"source": "...", "target": "...", "type": "...", "description": "..."}}
]
}}
"""
try:
response = ollama_client.chat(
model=model,
messages=[{"role": "user", "content": extraction_prompt}],
format="json"
)
data = json.loads(response['message']['content'])
entities = []
for e in data.get('entities', []):
entity_id = hashlib.md5(f"{e['name']}_{source_doc}".encode()).hexdigest()[:12]
entities.append(Entity(
id=entity_id,
name=e['name'],
entity_type=e['type'],
description=e['description'],
source_doc=source_doc,
source_chunk=text[:200] + "..."
))
relationships = []
for r in data.get('relationships', []):
relationships.append(Relationship(
source=r['source'],
target=r['target'],
relation_type=r['type'],
description=r['description'],
source_doc=source_doc
))
return entities, relationships
except Exception as e:
st.warning(f"Entity extraction error: {e}")
return [], []
# ============================================================================
# Multi-hop RAG with Citations
# ============================================================================
def generate_answer_with_citations(
query: str,
graph: KnowledgeGraphManager,
model: str = "llama3.2"
) -> AnswerWithCitations:
"""
Generate an answer using multi-hop graph traversal with full citations.
This is the core differentiator: every claim is traced back to source documents.
"""
reasoning_trace = []
citations = []
# Step 1: Initial semantic search
reasoning_trace.append(f"🔍 Searching knowledge graph for: '{query}'")
initial_results = graph.semantic_search(query)
if not initial_results:
return AnswerWithCitations(
answer="I couldn't find relevant information in the knowledge graph.",
citations=[],
reasoning_trace=reasoning_trace
)
reasoning_trace.append(f"📊 Found {len(initial_results)} initial entities")
# Step 2: Multi-hop expansion
all_context = []
for entity in initial_results[:3]:
reasoning_trace.append(f"🔗 Expanding from entity: {entity['name']}")
related = graph.find_related_entities(entity['name'], hops=2)
for rel in related:
all_context.append({
"entity": rel['name'],
"description": rel['description'],
"source": rel['source'],
"chunk": rel['chunk'],
"path": rel.get('path_descriptions', [])
})
reasoning_trace.append(f" → Found related: {rel['name']}")
# Step 3: Build context with source tracking
context_parts = []
source_map = {}
for i, ctx in enumerate(all_context):
source_key = f"[{i+1}]"
context_parts.append(f"{source_key} {ctx['entity']}: {ctx['description']}")
source_map[source_key] = {
"document": ctx['source'],
"text": ctx['chunk'],
"entity": ctx['entity']
}
context_text = "\n".join(context_parts)
reasoning_trace.append(f"📝 Built context from {len(context_parts)} sources")
# Step 4: Generate answer with citation requirements
answer_prompt = f"""Based on the following knowledge graph context, answer the question.
IMPORTANT: For each claim you make, cite the source using [N] notation.
CONTEXT:
{context_text}
QUESTION: {query}
Provide a comprehensive answer with inline citations [1], [2], etc. for each claim.
"""
try:
response = ollama_client.chat(
model=model,
messages=[{"role": "user", "content": answer_prompt}]
)
answer = response['message']['content']
reasoning_trace.append("✅ Generated answer with citations")
# Step 5: Extract and verify citations
citation_refs = re.findall(r'\[(\d+)\]', answer)
for ref in set(citation_refs):
key = f"[{ref}]"
if key in source_map:
src = source_map[key]
citations.append(Citation(
claim=f"Reference {key}",
source_document=src['document'],
source_text=src['text'],
confidence=0.85,
reasoning_path=[f"Entity: {src['entity']}"]
))
reasoning_trace.append(f"🔒 Verified {len(citations)} citations")
return AnswerWithCitations(
answer=answer,
citations=citations,
reasoning_trace=reasoning_trace
)
except Exception as e:
return AnswerWithCitations(
answer=f"Error generating answer: {e}",
citations=[],
reasoning_trace=reasoning_trace
)
# ============================================================================
# Streamlit UI
# ============================================================================
def main():
st.set_page_config(
page_title="Knowledge Graph RAG with Citations",
page_icon="🔍",
layout="wide"
)
st.title("🔍 Knowledge Graph RAG with Verifiable Citations")
st.markdown("""
This demo shows how **Knowledge Graph-based RAG** provides:
- **Multi-hop reasoning** across connected information
- **Verifiable source attribution** for every claim
- **Transparent reasoning traces** you can audit
Unlike traditional vector RAG, every answer is traceable to its source documents.
""")
# Sidebar configuration
st.sidebar.header("⚙️ Configuration")
neo4j_uri = st.sidebar.text_input("Neo4j URI", "bolt://localhost:7687")
neo4j_user = st.sidebar.text_input("Neo4j User", "neo4j")
neo4j_password = st.sidebar.text_input("Neo4j Password", type="password", value="password")
llm_model = st.sidebar.selectbox("LLM Model", ["llama3.2", "mistral", "phi3"])
# Initialize session state
if 'graph_initialized' not in st.session_state:
st.session_state.graph_initialized = False
st.session_state.documents = []
# Main content
tab1, tab2, tab3 = st.tabs(["📄 Add Documents", "❓ Ask Questions", "🔬 View Graph"])
with tab1:
st.header("Step 1: Build Knowledge Graph from Documents")
sample_docs = {
"AI Research Paper": """
GraphRAG is a technique developed by Microsoft Research that combines knowledge graphs
with retrieval-augmented generation. Unlike traditional RAG which uses vector similarity,
GraphRAG builds a structured knowledge graph from documents, enabling multi-hop reasoning.
The technique was introduced by researchers including Darren Edge and Ha Trinh.
GraphRAG excels at answering complex questions that require connecting information
from multiple sources, such as "What are the relationships between different research projects?"
""",
"Company Report": """
Acme Corp was founded in 2020 by Jane Smith and John Doe in San Francisco.
The company develops AI-powered analytics tools for enterprise customers.
Their flagship product, DataSense, uses machine learning to analyze business data.
Jane Smith previously worked at Google as a senior engineer on the TensorFlow team.
John Doe was a co-founder of StartupX, which was acquired by Microsoft in 2019.
Acme Corp raised $50 million in Series B funding led by Sequoia Capital.
"""
}
doc_choice = st.selectbox("Choose sample document:", list(sample_docs.keys()))
doc_text = st.text_area("Or paste your own document:", sample_docs[doc_choice], height=200)
doc_name = st.text_input("Document name:", doc_choice)
if st.button("🔨 Extract & Add to Knowledge Graph"):
with st.spinner("Extracting entities and relationships..."):
try:
graph = KnowledgeGraphManager(neo4j_uri, neo4j_user, neo4j_password)
entities, relationships = extract_entities_with_llm(doc_text, doc_name, llm_model)
for entity in entities:
graph.add_entity(entity)
for rel in relationships:
graph.add_relationship(rel)
graph.close()
st.success(f"✅ Extracted {len(entities)} entities and {len(relationships)} relationships")
with st.expander("View Extracted Entities"):
for e in entities:
st.write(f"**{e.name}** ({e.entity_type}): {e.description}")
with st.expander("View Extracted Relationships"):
for r in relationships:
st.write(f"{r.source} --[{r.relation_type}]--> {r.target}: {r.description}")
st.session_state.graph_initialized = True
st.session_state.documents.append(doc_name)
except Exception as e:
st.error(f"Error: {e}")
st.info("Make sure Neo4j is running and Ollama has the model pulled.")
with tab2:
st.header("Step 2: Ask Questions with Verifiable Answers")
if not st.session_state.graph_initialized:
st.warning("⚠️ Please add documents to the knowledge graph first.")
else:
st.info(f"📚 Knowledge graph contains documents: {', '.join(st.session_state.documents)}")
query = st.text_input("Enter your question:", "What are the key concepts in GraphRAG and who developed it?")
if st.button("🔍 Ask with Citations"):
with st.spinner("Traversing knowledge graph and generating answer..."):
try:
graph = KnowledgeGraphManager(neo4j_uri, neo4j_user, neo4j_password)
result = generate_answer_with_citations(query, graph, llm_model)
graph.close()
# Display reasoning trace
st.subheader("🧠 Reasoning Trace")
for step in result.reasoning_trace:
st.write(step)
# Display answer
st.subheader("💬 Answer")
st.markdown(result.answer)
# Display citations
st.subheader("📚 Source Citations")
if result.citations:
for i, citation in enumerate(result.citations):
with st.expander(f"Citation {i+1}: {citation.source_document}"):
st.write(f"**Source Document:** {citation.source_document}")
st.write(f"**Source Text:** {citation.source_text}")
st.write(f"**Confidence:** {citation.confidence:.0%}")
st.write(f"**Reasoning Path:** {''.join(citation.reasoning_path)}")
else:
st.info("No specific citations extracted for this answer.")
except Exception as e:
st.error(f"Error: {e}")
with tab3:
st.header("🔬 Knowledge Graph Visualization")
st.info("This tab shows the structure of your knowledge graph.")
if st.button("📊 Show Graph Statistics"):
try:
graph = KnowledgeGraphManager(neo4j_uri, neo4j_user, neo4j_password)
with graph.driver.session() as session:
node_count = session.run("MATCH (n) RETURN count(n) as count").single()['count']
rel_count = session.run("MATCH ()-[r]->() RETURN count(r) as count").single()['count']
col1, col2 = st.columns(2)
col1.metric("Total Entities", node_count)
col2.metric("Total Relationships", rel_count)
graph.close()
except Exception as e:
st.error(f"Error connecting to Neo4j: {e}")
if st.button("🗑️ Clear Graph"):
try:
graph = KnowledgeGraphManager(neo4j_uri, neo4j_user, neo4j_password)
graph.clear_graph()
graph.close()
st.session_state.graph_initialized = False
st.session_state.documents = []
st.success("Graph cleared!")
except Exception as e:
st.error(f"Error: {e}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,3 @@
streamlit>=1.28.0
ollama>=0.1.0
neo4j>=5.0.0