What This Repo Claims
Extract structured data from any website using natural language prompts. No CSS selectors. No XPath. No brittle parsing code. Just describe what you want and the library does it.
23k stars. MIT license. Active development as of May 2026.
The core promise:
pip install scrapegraphai → configure with any LLM (including local ollama models) → run SmartScraperGraph with a prompt and a URL → get back clean structured JSON.
What I Tested
Environment:
- macOS, Apple Silicon
- repoverifiertest isolated user — clean venv
- Python 3.13
- ollama/llama3.2:latest and ollama/llama3.1:8b
- ollama/nomic-embed-text for embeddings
- No API key used at any point
Test 1: Install
pip install scrapegraphai
python3 -m playwright install chromium
Both work cleanly. Playwright must be installed separately — documented in the README.
Test 2: SmartScraperGraph with llama3.2 (2GB) — example.com
graph_config = {
"llm": {"model": "ollama/llama3.2", "format": "json", "base_url": "http://localhost:11434"},
"embeddings": {"model": "ollama/nomic-embed-text", "base_url": "http://localhost:11434"},
"headless": True,
}
SmartScraperGraph(
prompt="Extract the domain name, main heading, and description",
source="https://example.com",
config=graph_config
).run()
Output:
{
"content": {
"domain_name": "example.com",
"main_heading": "Example Domain",
"description": "This domain is used in documentation examples without needing permission."
}
}
Correct. Clean structured JSON from a simple static page.
Test 3: SmartScraperGraph with llama3.2 — JS-rendered pages
Tested on Railway docs and scrapegraphai.com (both JS-rendered):
Railway docs result:
{"url": "https://station.railway.com/questions", "text": "If you're stuck don't hesitate to open a Help Thread."}
scrapegraphai.com result:
{"api": {"status": null, "compare": null}, "resources": {"blog": null}, "social": {"github": null}}
Navigation structure with all null values. No actual content extracted. JS-rendered pages are not handled reliably with the 2GB model.
Test 4: Wikipedia with llama3.2 vs llama3.1:8b
Same prompt on the Python Wikipedia page:
| Model | Section headings extracted |
|-------|---------------------------|
| llama3.2 (2GB) | 2 headings |
| llama3.1:8b (4.9GB) | 7 headings |
llama3.1:8b result:
{
"main_title": "Python (programming language)",
"summary": "Multi-paradigm programming language with a design philosophy emphasizing code readability and simplicity.",
"section_headings": ["History", "Design philosophy and features", "Syntax and semantics", "Code examples", "Libraries", "Development environments", "Implementations"]
}
Significantly better extraction with the larger model.
Test 5: Railway docs with llama3.1:8b
{
"title": "Quick Start",
"description": "Railway's Quick Start guide to deploying a project with GitHub, CLI, or Docker image."
}
Title and description correct. Steps and sections still not extracted — the page is JS-rendered and Playwright fetches incomplete HTML.
Findings
Finding 1: Core claim holds on static pages
SmartScraperGraph correctly fetches, parses, and extracts structured JSON from static HTML pages. The pipeline — fetch → parse → generate answer — runs cleanly every time. No API key required.
Finding 2: llama3.1:8b is the minimum recommended model
llama3.2 (2GB) misses content on complex pages. llama3.1:8b (4.9GB) extracts meaningfully more. The README shows examples using llama3.2 without mentioning this limitation. First-time users will get poor results with the default small model.
Finding 3: JS-rendered pages are unreliable
Playwright fetches the page but LLM extraction fails on heavily JS-rendered content regardless of model size. Navigation structure is returned instead of content. This affects most modern web apps and documentation sites.
Finding 4: nomic-embed-text required for ollama config
The README ollama example requires both an LLM model and an embeddings model (
nomic-embed-text). New users need to pull both separately. Not clearly highlighted in the quickstart.
Finding 5: Telemetry enabled by default
Anonymous usage metrics collected by default. Opt-out requires setting
SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. Documented but easy to miss.
What I Did Not Test
- SearchGraph (multi-page scraping)
- PDF and local file extraction
- GPT-4o or cloud LLM providers
- Concurrent scraping performance
Verdict: Solid
ScrapeGraphAI delivers on its core claim for static HTML pages. Install it, point it at a local ollama model, describe what you want in plain English, and you get back structured JSON — with no API key and no cloud dependency. The pipeline is clean, fast, and correctly implemented.
Two important caveats to know before using it: llama3.1:8b (4.9GB) is the minimum recommended model for reliable results — the README examples use llama3.2 which is too small for most real pages. JS-rendered pages remain unreliable regardless of model size.
For the Web Intelligence Agent Stack, use it on documentation sites, Wikipedia, and other static content sources. Avoid JS-heavy web apps without additional preprocessing.
Included in Solution #3: Web Intelligence Agent Stack.
This review follows RepoVerifier Standard v1.0. [Read the standard →](https://repoverifier.dev/about)