Advanced Web Scraper API

Turn this natural language request...

Prompt

"Extract all trending stories from Hacker News including their titles, URLs, and point counts."

...into this scraped structured data automatically

JSON Result

{
  "success": true,
  "stories": [
    {
      "title": "The surprising size of Tesla's data empire",
      "url": "https://foundation.mozilla.org/en/privacynotincluded/articles/tesla-tracking/",
      "points": 314
    },
    {
      "title": "Understanding how HashMap works in Rust",
      "url": "https://dev.to/seanchen1991/understanding-how-hashmap-works-in-rust-2jj2",
      "points": 239
    }
    /* More stories... */
  ],
  "timestamp": "2025-04-27T12:34:56.789Z"
}

100+

Supported Step Types

LLM Integration Options

99.7%

Scraping Success Rate

Key Features

Advanced capabilities that make complex web scraping simple

AI-Powered Configuration NEW

Generate complex scraping configurations from just a URL and natural language prompt. Supports multiple LLM providers through OpenRouter, including GPT-4, Claude, and Gemini.

Learn more →

Advanced Navigation

Execute multi-step browser interactions, handle logins, pagination, conditional logic, and dynamic content with a powerful navigation engine. Includes optional human-like timing randomization for more realistic interactions.

See navigation types →

Robust Proxy Management

Automatic proxy rotation, health checks, and support for multiple sources (file, API) to ensure reliable and undetected scraping.

Proxy API details →

Persistent Sessions

Maintain browser sessions across jobs, preserving cookies and localStorage to bypass logins and reduce CAPTCHAs for more efficient scraping.

Session docs →

Flexible Extraction

Extract data precisely using CSS selectors, XPath, Regular Expressions, or even custom JavaScript functions. Handle complex and nested data structures.

Extraction options →

Detailed Job Progress

Monitor asynchronous jobs with granular progress updates, including percentage completion and detailed status messages throughout the scraping process.

Job Status API →

Human-like Behaviour

Simulate more realistic user behavior by automatically randomizing delays, pauses, and interaction timing within navigation steps.

Behaviour options →

Getting Started

It only takes a few minutes to start extracting web data

Explore the Documentation

Familiarize yourself with the concepts and available API endpoints. The documentation provides comprehensive details on how to use the various features.

Browse Documentation

Generate a scraping workflow Config with AI

Use the AI endpoint to create a scraping configuration by simply describing what data you want to extract.


curl -X POST http://localhost:3000/api/v1/ai/generate-config \
-H "Content-Type: application/json" \
-d '{
  "url": "https://news.ycombinator.com/",
  "prompt": "Extract the title and URL for each story listed on the front page.",
  "previousJobId": "",
    "fetchHtmlForRefinement": false,
    "options": {
        "maxIterations": 3,
        "testConfig": true,
        "model": "anthropic/claude-3.7-sonnet:thinking",
        "maxTokens": 8192,
        "temperature": 0.7,
        "browserOptions": {
            "headless": true,
            "proxy": false
        },
        "interactionHints": []
    }
}'

This will return a Job ID that you can use to check the status of your configuration generation.

Response Example

200: Queued Job Status


{
    "success":true,
    "message":"AI configuration generation job queued successfully",
    "data":{
        "jobId":"6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
        "statusUrl":"/api/v1/jobs/6675cd68-4d40-4e2a-bf23-0d9fe3d5c157"
    },
    "timestamp":"2025-04-27T11:37:21.336Z"
}

Check Job Status

Poll the job status endpoint until the status is "completed". This will give you the generated configuration.

curl http://localhost:3000/api/v1/jobs/6675cd68-4d40-4e2a-bf23-0d9fe3d5c157

The response will tell you the status of the job and the generated configuration:

Response Example (pending)


{
    "success": true,
    "message": "Job status retrieved",
    "data": {
        "id": "6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
        "name": "generate-config",
        "queueName": "config-generation-jobs",
        "status": "active",
        "progress": {
            "percentage": 75,
            "status": "Testing Config (Iteration 1)"
        },
        "result": null,
        "estimatedCost": null,
        "createdAt": 1745736033517,
        "numberInQueue": 0
    },
    "timestamp": "2025-04-27T06:40:51.035Z"
}

Response Example (Processing)


{
  "startUrl": "https://news.ycombinator.com/",
  "steps": [
      {
          "type": "extract",
          "name": "hackerNewsStories",
          "selector": "table#hnmain",
          "fields": {
              "stories": {
                  "selector": "tr.athing.submission",
                  "type": "css",
                  "multiple": true,
                  "fields": {
                      "title": {
                          "selector": ".titleline a:first-child",
                          "type": "css"
                      },
                      "url": {
                          "selector": ".titleline a:first-child",
                          "type": "css",
                          "attribute": "href"
                      }
                  }
              }
          }
      }
  ],
  "options": {
      "timeout": 30000,
      "javascript": false
  }
}

Use Your Generated Config to Scrape Data

After the job completes, take the generated configuration and use it with the navigation endpoint to start scraping data as a human would, with the ability to handle dynamic content, pagination, and conditional logic.


curl -X POST http://localhost:3000/api/v1/navigate \
-H "Content-Type: application/json" \
-d '{
  "startUrl": "https://news.ycombinator.com/",
  "steps": [
      {
          "type": "extract",
          "name": "hackerNewsStories",
          "selector": "table#hnmain",
          "fields": {
              "stories": {
                  "selector": "tr.athing.submission",
                  "type": "css",
                  "multiple": true,
                  "fields": {
                      "title": {
                          "selector": ".titleline a:first-child",
                          "type": "css"
                      },
                      "url": {
                          "selector": ".titleline a:first-child",
                          "type": "css",
                          "attribute": "href"
                      }
                  }
              }
          }
      }
  ],
  "options": {
      "timeout": 30000,
      "javascript": false
  }
}'

This will return another Job ID. Once that job completes, you'll have your scraped data!

Response Example (queued)

200: Queued Job Status


{
  "success":true,
  "message":"AI configuration generation job queued successfully",
  "data":{
      "jobId":"6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
      "statusUrl":"/api/v1/jobs/6675cd68-4d40-4e2a-bf23-0d9fe3d5c157"
  },
  "timestamp":"2025-04-27T11:37:21.336Z"
}

Get the Scraped Data

After the navigation job completes, you can retrieve the scraped data using the job ID.

curl http://localhost:3000/api/v1/jobs/891bb3c2-3376-4957-94b8-ce70ae8fc31d

This will return the scraped data.

Response Example (processing)

if the job is still processing, you'll get a 202 status code. with a message indicating the job is still processing. and a status url to check the status of the job.


{
    "success": true,
    "message": "Job status retrieved",
    "data": {
        "id": "6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
        "name": "generate-config",
        "queueName": "config-generation-jobs",
        "status": "active",
        "progress": {
            "percentage": 75,
            "status": "Testing Config (Iteration 1)"
        },
        "result": null,
        "estimatedCost": null,
        "createdAt": 1745736033517,
        "numberInQueue": 0
    },
    "timestamp": "2025-04-27T06:40:51.035Z"
}

200: Data Retrieved


{
  "hackerNewsStories": {
      "stories": [
          {
              "title": "Reverse Geocoding Is Hard",
              "url": "https://shkspr.mobi/blog/2025/04/reverse-geocoding-is-hard/"
          },
          {
              "title": "Shardines: SQLite3 Database-per-Tenant with ActiveRecord",
              "url": "https://blog.julik.nl/2025/04/a-can-of-shardines"
          },
          /* ... More Listings ... */
        ]
    }
}

See more data


{
  "success": true,
  "message": "Job status retrieved",
  "data": {
      "id": "e420b897-336a-4691-a249-0226c6c17380",
      "name": "execute-flow",
      "queueName": "navigation-jobs",
      "status": "completed",
      "progress": {
          "percentage": 100,
          "status": "Completed"
      },
      "result": {
          "id": "e420b897-336a-4691-a249-0226c6c17380",
          "startUrl": "https://news.ycombinator.com/",
          "status": "completed",
          "stepsExecuted": 1,
          "result": {
              "hackerNewsStories": {
                  "stories": [
                      {
                          "title": "Reverse Geocoding Is Hard",
                          "url": "https://shkspr.mobi/blog/2025/04/reverse-geocoding-is-hard/"
                      },
                      {
                          "title": "Shardines: SQLite3 Database-per-Tenant with ActiveRecord",
                          "url": "https://blog.julik.nl/2025/04/a-can-of-shardines"
                      },
                      {
                          "title": "We're building a dystopia just to make people click on ads [video]",
                          "url": "https://www.ted.com/talks/zeynep_tufekci_we_re_building_a_dystopia_just_to_make_people_click_on_ads"
                      },
                      {
                          "title": "Mesmerizing Interlocking Geometric Patterns Produced with Japanese Woodworking",
                          "url": "https://www.smithsonianmag.com/smithsonian-institution/see-the-mesmerizing-interlocking-geometric-patterns-produced-with-this-ancient-japanese-woodworking-technique-180986494/"
                      },
                      {
                          "title": "Show HN: Remote-Controlled IKEA Deathstar Lamp",
                          "url": "https://gitlab.com/sephalon/deathstar_lamp"
                      },
                      {
                          "title": "Show HN: Lil digi – play a platformer game as yourself",
                          "url": "https://www.lildigi.me/"
                      },
                      {
                          "title": "Show HN: A Common Lisp implementation in development, supports ASDF",
                          "url": "https://savannah.nongnu.org/p/alisp"
                      },
                      {
                          "title": "Wikipedia: Database Download",
                          "url": "https://en.wikipedia.org/wiki/Wikipedia:Database_download"
                      },
                      {
                          "title": "How to program a text adventure in C",
                          "url": "https://helderman.github.io/htpataic/htpataic01.html"
                      },
                      {
                          "title": "Open-source interactive C tutorial in the browser",
                          "url": "https://www.learn-c.org/"
                      },
                      {
                          "title": "ZFS: Apple's New Filesystem that wasn't (2016)",
                          "url": "https://ahl.dtrace.org/2016/06/15/apple_and_zfs/"
                      },
                      {
                          "title": "Earth's oceans used to be green, and they could turn purple next",
                          "url": "https://newatlas.com/science/earths-oceans-used-to-be-green-and-they-could-turn-purple-next/"
                      },
                      {
                          "title": "Found a simple tool for database modeling: dbdiagram.io",
                          "url": "https://dbdiagram.io"
                      },
                      {
                          "title": "Show HN: Bhvr, a Bun and Hono and Vite and React Starter",
                          "url": "https://bhvr.dev"
                      },
                      {
                          "title": "Icônes",
                          "url": "https://icones.js.org/"
                      },
                      {
                          "title": "Show HN: My self-written hobby OS is finally running on my vintage IBM ThinkPad",
                          "url": "https://github.com/joexbayer/RetrOS-32"
                      },
                      {
                          "title": "Bare metal printf – C standard library without OS",
                          "url": "https://popovicu.com/posts/bare-metal-printf/"
                      },
                      {
                          "title": "An end to all this prostate trouble?",
                          "url": "https://yarchive.net/blog/prostate/"
                      },
                      {
                          "title": "Sigbovik Conference Proceedings 2025 [pdf]",
                          "url": "https://sigbovik.org/2025/proceedings.pdf"
                      },
                      {
                          "title": "Watching o3 guess a photo's location is surreal, dystopian and entertaining",
                          "url": "https://simonwillison.net/2025/Apr/26/o3-photo-locations/"
                      },
                      {
                          "title": "What Porn Did to American Culture",
                          "url": "https://www.theatlantic.com/newsletters/archive/2025/04/what-porn-did-to-american-culture/682610/"
                      },
                      {
                          "title": "Anatomy of a SQL Engine",
                          "url": "https://www.dolthub.com/blog/2025-04-25-sql-engine-anatomy/"
                      },
                      {
                          "title": "Former Disney employee who hacked Disney World menus sentenced to 3 years",
                          "url": "https://databreaches.net/2025/04/24/former-disney-employeedwho-hacked-disney-world-restaurant-menus-in-revenge-sentenced-to-3-years-in-federal-prison/"
                      },
                      {
                          "title": "Bill Gates's Personal Easter Eggs in 8 Bit BASIC (2008)",
                          "url": "https://www.pagetable.com/?p=43"
                      },
                      {
                          "title": "Chongqing, the Largest City – In Pictures",
                          "url": "https://www.theguardian.com/world/gallery/2025/apr/27/chongqing-the-worlds-largest-city-in-pictures"
                      },
                      {
                          "title": "Compiler Reminders",
                          "url": "https://jfmengels.net/compiler-reminders/"
                      },
                      {
                          "title": "Cloth",
                          "url": "https://www.cloudofoz.com/verlet-test/"
                      },
                      {
                          "title": "Australian who ordered radioactive materials walks away from court",
                          "url": "https://www.chemistryworld.com/news/australian-who-ordered-radioactive-materials-over-the-internet-walks-away-from-court/4021306.article"
                      },
                      {
                          "title": "Parity (YC S24) is hiring founding engineers to build an AI SRE (in-person, SF)",
                          "url": "https://www.ycombinator.com/companies/parity/jobs"
                      },
                      {
                          "title": "The Friendship Recession: The lost art of connecting",
                          "url": "https://www.happiness.hks.harvard.edu/february-2025-issue/the-friendship-recession-the-lost-art-of-connecting"
                      }
                  ]
              }
          },
          "screenshots": [],
          "timestamp": "2025-04-27T16:05:28.213Z",
          "queueName": "navigation-jobs"
      },
      "createdAt": 1745769924089,
      "completedAt": 1745769928220
  },
  "timestamp": "2025-04-27T16:12:00.487Z"
}

More Examples

What Can You Build?

Explore the diverse applications of the Advanced Web Scraper API

Market Research

Gather data on market trends, competitor pricing, product reviews, and customer sentiment from various online sources to inform business strategy.

Price Monitoring

Track prices of products across e-commerce sites, monitor competitor pricing strategies, and identify arbitrage opportunities automatically.

Lead Generation

Extract contact information from business directories, professional networks, or event websites to build targeted outreach lists.

Content Aggregation

Collect articles, blog posts, news updates, or job listings from multiple sources to build curated content platforms and newsletters.

Data Enrichment

Augment existing datasets by scraping additional information from public profiles, company websites, or other online sources.

Automated Testing

Use the navigation engine to automate UI testing, check website functionality, or monitor website uptime and performance.

Comprehensive API

A rich set of endpoints to meet all your data extraction needs

POST /api/v1/ai/generate-config - Generate scraping configurations using AI.
POST /api/v1/navigate - Execute complex browser navigation flows.
POST /api/v1/scrape - Perform direct scraping tasks.
GET /api/v1/jbs - get all jobs in the queue.
GET /api/v1/jobs/:id - Check the status and retrieve results of asynchronous jobs.
GET /api/v1/proxy/... - Manage and utilize proxies.
GET /api/v1/session/... - Manage persistent browser sessions.
GET /api/v1/templates - List available configuration templates.

Complete API Reference

Ready to Extract the Data You Need?

Start using the Advanced Web Scraper API today and transform how you collect data from the web.

Star on GitHub Read the Docs

Example Scraped Data

See what you can extract with just a few API calls

Google Maps Result

{
  "success": true,
  "searchResults": {
    "listings": [
      {
        "name": "Black Brick Studio",
        "rating": "5.0",
        "reviews": 12,
        "address": "117 Bute St",
        "phone": "029 2252 0485"
      },
      /* ... More Listings ... */
    ]
  },
  "timestamp": "2025-04-13T19:12:14.318Z"
}

HackerNews Result

{
  "success": true,
  "stories": [
    {
      "title": "The surprising size of Tesla's data empire",
      "url": "https://...",
      "points": 314
    },
    {
      "title": "Understanding how HashMap works in Rust",
      "url": "https://...",
      "points": 239
    }
  ],
  "timestamp": "2025-04-27T12:34:56.789Z"
}

Google Trends Result

{
  "success": true,
  "trendsData": {
    "trends": [
      {
        "title": "AI Development Tools",
        "searchVolume": "100K+"
      },
      /* ... More Trends ... */
    ]
  },
  "timestamp": "2025-04-28T10:00:00.000Z"
}

AI Integration

Powerful language model capabilities built right into the API

Multiple Model Support

Access a wide range of AI models through our unified OpenRouter integration, including GPT-4, Claude, Gemini, and more.

Cost Tracking

Built-in token usage tracking and cost estimation for each model, helping you manage your AI budget effectively.

Customizable Parameters

Fine-tune your AI requests with customizable parameters like temperature, max tokens, and provider-specific options.

AI Integration Documentation

Frequently Asked Questions

Answers to common questions about the Advanced Web Scraper API

What is the AI Configuration Generation feature?

This feature allows you to generate a scraping configuration (`config.json`) simply by providing a target URL and a natural language prompt describing what data you want to extract. The API uses a Large Language Model (LLM) to create the configuration, potentially tests it, and can even attempt to fix errors automatically. See the AI API documentation for more details.

How does the Navigation Engine work?

The Navigation Engine executes multi-step browser interactions defined in a configuration file. It can handle complex tasks like logging in, clicking buttons, filling forms, scrolling, handling popups, switching tabs/frames, and executing custom scripts. It uses Playwright for browser automation and supports various step types detailed in the Navigation Types documentation.

What kind of proxies does the API support?

The API features a robust Proxy Manager that can load proxies from various sources (currently JSON files, API sources planned). It supports health checks, automatic rotation based on performance (latency, success rate), and filtering by criteria like protocol or country. You can interact with it via the Proxy API.

How are browser sessions managed?

The API supports persistent browser sessions. This means cookies, localStorage, and sessionStorage can be saved and reused across different scraping jobs for the same domain. This helps bypass logins and reduces the likelihood of encountering CAPTCHAs on subsequent runs. See the Session Management documentation.

What models are supported for AI Configuration Generation?

The API supports a wide range of models through OpenRouter, including:

OpenAI models (GPT-4, GPT-3.5-turbo)
Anthropic models (Claude-3-7-Sonnet, Claude-3-5-Sonnet)
Google models (Gemini 2.5 Pro)
DeepSeek models (DeepSeek Reasoner, DeepSeek Chat)

You can specify your preferred model in the request options, and the system will automatically route to the appropriate provider. See the AI API documentation for more details.

Is the API suitable for beginners?

Yes! The AI Configuration Generation feature makes it accessible even for users with minimal technical knowledge. You can describe what data you want in plain English, and the API will create a working configuration for you. For more advanced users, the API provides full control over every aspect of the scraping process.

How do I handle websites that require login?

The API includes dedicated step types for handling logins, including the `login` step which simplifies common login flows. Combined with the session management feature, you can log in once and then reuse that authenticated session for subsequent scraping jobs. For examples, see the Navigation Examples.

Web Data Extraction Made Simple