SOLID

ScrapeGraphAI — Does "Extract Any Web Data With a Natural Language Prompt" Actually Work?

Name: ScrapeGraphAI — Does "Extract Any Web Data With a Natural Language Prompt" Actually Work?
Item: Scrapegraph-ai
Rating: 5
Author: Balaji Loganathan

github.com/ScrapeGraphAI/Scrapegraph-ai ★ 23000 stars

Claim tested

A Python web scraping library that uses LLMs and graph logic to extract structured data from web pages using natural language prompts. Tested on macOS Apple Silicon, Python 3.13, using ollama/llama3.1:8b and nomic-embed-text locally. Core claim holds — SmartScraperGraph correctly extracts structured JSON from static pages with no API key. JS-rendered pages are unreliable. llama3.1:8b (4.9GB) is the minimum recommended model — smaller models miss content.

Criteria Scorecard

Criterion	Score
install_works	true
claim_testable	true
readme_accurate	false
creator_notified	false
errors_documented	true
claim_tested_clean_env	true
verdict_matches_evidence	true

Display this badge

Markdown

[![RepoVerifier: SOLID](https://repoverifier.dev/badges/solid.svg)](https://repoverifier.dev/reviews/scrapegraphai-scrapegraph-ai)

HTML

<a href="https://repoverifier.dev/reviews/scrapegraphai-scrapegraph-ai"><img src="https://repoverifier.dev/badges/solid.svg" alt="RepoVerifier: SOLID" height="20"></a>

Paste this in your repo’s README. Links back to the full review.

Environment

osmacOS (Apple Silicon)

test_userrepoverifiertest (isolated)

test_methodisolated user, clean venv, no prior scrapegraphai install

api_key_usedfalse

python_version3.13

embeddings_modelollama/nomic-embed-text

llm_models_testedollama/llama3.2, ollama/llama3.1:8b

scrapegraphai_version1.76.0

Full Review

What This Repo Claims

Extract structured data from any website using natural language prompts. No CSS selectors. No XPath. No brittle parsing code. Just describe what you want and the library does it.

23k stars. MIT license. Active development as of May 2026.

The core promise: pip install scrapegraphai → configure with any LLM (including local ollama models) → run SmartScraperGraph with a prompt and a URL → get back clean structured JSON.

What I Tested

Environment:

macOS, Apple Silicon

repoverifiertest isolated user — clean venv

Python 3.13

ollama/llama3.2:latest and ollama/llama3.1:8b

ollama/nomic-embed-text for embeddings

No API key used at any point

Test 1: Install

pip install scrapegraphai
python3 -m playwright install chromium

Both work cleanly. Playwright must be installed separately — documented in the README.

Test 2: SmartScraperGraph with llama3.2 (2GB) — example.com

graph_config = {
    "llm": {"model": "ollama/llama3.2", "format": "json", "base_url": "http://localhost:11434"},
    "embeddings": {"model": "ollama/nomic-embed-text", "base_url": "http://localhost:11434"},
    "headless": True,
}
SmartScraperGraph(
    prompt="Extract the domain name, main heading, and description",
    source="https://example.com",
    config=graph_config
).run()

Output:

{
    "content": {
        "domain_name": "example.com",
        "main_heading": "Example Domain",
        "description": "This domain is used in documentation examples without needing permission."
    }
}

Correct. Clean structured JSON from a simple static page.

Test 3: SmartScraperGraph with llama3.2 — JS-rendered pages

Tested on Railway docs and scrapegraphai.com (both JS-rendered):

Railway docs result:

{"url": "https://station.railway.com/questions", "text": "If you're stuck don't hesitate to open a Help Thread."}

scrapegraphai.com result:

{"api": {"status": null, "compare": null}, "resources": {"blog": null}, "social": {"github": null}}

Navigation structure with all null values. No actual content extracted. JS-rendered pages are not handled reliably with the 2GB model.

Test 4: Wikipedia with llama3.2 vs llama3.1:8b

Same prompt on the Python Wikipedia page:

| Model | Section headings extracted |
|-------|---------------------------|
| llama3.2 (2GB) | 2 headings |
| llama3.1:8b (4.9GB) | 7 headings |

llama3.1:8b result:

{
    "main_title": "Python (programming language)",
    "summary": "Multi-paradigm programming language with a design philosophy emphasizing code readability and simplicity.",
    "section_headings": ["History", "Design philosophy and features", "Syntax and semantics", "Code examples", "Libraries", "Development environments", "Implementations"]
}

Significantly better extraction with the larger model.

Test 5: Railway docs with llama3.1:8b

{
    "title": "Quick Start",
    "description": "Railway's Quick Start guide to deploying a project with GitHub, CLI, or Docker image."
}

Title and description correct. Steps and sections still not extracted — the page is JS-rendered and Playwright fetches incomplete HTML.

Findings

Finding 1: Core claim holds on static pages

SmartScraperGraph correctly fetches, parses, and extracts structured JSON from static HTML pages. The pipeline — fetch → parse → generate answer — runs cleanly every time. No API key required.

Finding 2: llama3.1:8b is the minimum recommended model

llama3.2 (2GB) misses content on complex pages. llama3.1:8b (4.9GB) extracts meaningfully more. The README shows examples using llama3.2 without mentioning this limitation. First-time users will get poor results with the default small model.

Finding 3: JS-rendered pages are unreliable

Playwright fetches the page but LLM extraction fails on heavily JS-rendered content regardless of model size. Navigation structure is returned instead of content. This affects most modern web apps and documentation sites.

Finding 4: nomic-embed-text required for ollama config

The README ollama example requires both an LLM model and an embeddings model (nomic-embed-text). New users need to pull both separately. Not clearly highlighted in the quickstart.

Finding 5: Telemetry enabled by default

Anonymous usage metrics collected by default. Opt-out requires setting SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. Documented but easy to miss.

What I Did Not Test

SearchGraph (multi-page scraping)

PDF and local file extraction

GPT-4o or cloud LLM providers

Concurrent scraping performance

Verdict: Solid

ScrapeGraphAI delivers on its core claim for static HTML pages. Install it, point it at a local ollama model, describe what you want in plain English, and you get back structured JSON — with no API key and no cloud dependency. The pipeline is clean, fast, and correctly implemented.

Two important caveats to know before using it: llama3.1:8b (4.9GB) is the minimum recommended model for reliable results — the README examples use llama3.2 which is too small for most real pages. JS-rendered pages remain unreliable regardless of model size.

For the Web Intelligence Agent Stack, use it on documentation sites, Wikipedia, and other static content sources. Avoid JS-heavy web apps without additional preprocessing.

Included in Solution #3: Web Intelligence Agent Stack.

This review follows RepoVerifier Standard v1.0. [Read the standard →](https://repoverifier.dev/about)

This review follows RepoVerifier Standard v1.0. Read the standard →