AI-Powered Configuration NEW
Generate complex scraping configurations from just a URL and natural language prompt. Supports multiple LLM providers through OpenRouter, including GPT-4, Claude, and Gemini.
Learn more →A powerful, AI-driven API that makes web scraping accessible to hobbyists, individual developers, and due diligence teams alike. No complex coding required.
Advanced capabilities that make complex web scraping simple
Generate complex scraping configurations from just a URL and natural language prompt. Supports multiple LLM providers through OpenRouter, including GPT-4, Claude, and Gemini.
Learn more →Execute multi-step browser interactions, handle logins, pagination, conditional logic, and dynamic content with a powerful navigation engine. Includes optional human-like timing randomization for more realistic interactions.
See navigation types →Automatic proxy rotation, health checks, and support for multiple sources (file, API) to ensure reliable and undetected scraping.
Proxy API details →Maintain browser sessions across jobs, preserving cookies and localStorage to bypass logins and reduce CAPTCHAs for more efficient scraping.
Session docs →Extract data precisely using CSS selectors, XPath, Regular Expressions, or even custom JavaScript functions. Handle complex and nested data structures.
Extraction options →Monitor asynchronous jobs with granular progress updates, including percentage completion and detailed status messages throughout the scraping process.
Job Status API →Simulate more realistic user behavior by automatically randomizing delays, pauses, and interaction timing within navigation steps.
Behaviour options →It only takes a few minutes to start extracting web data
Familiarize yourself with the concepts and available API endpoints. The documentation provides comprehensive details on how to use the various features.
Browse DocumentationUse the AI endpoint to create a scraping configuration by simply describing what data you want to extract.
curl -X POST http://localhost:3000/api/v1/ai/generate-config \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com/",
"prompt": "Extract the title and URL for each story listed on the front page.",
"previousJobId": "",
"fetchHtmlForRefinement": false,
"options": {
"maxIterations": 3,
"testConfig": true,
"model": "anthropic/claude-3.7-sonnet:thinking",
"maxTokens": 8192,
"temperature": 0.7,
"browserOptions": {
"headless": true,
"proxy": false
},
"interactionHints": []
}
}'
This will return a Job ID that you can use to check the status of your configuration generation.
{
"success":true,
"message":"AI configuration generation job queued successfully",
"data":{
"jobId":"6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
"statusUrl":"/api/v1/jobs/6675cd68-4d40-4e2a-bf23-0d9fe3d5c157"
},
"timestamp":"2025-04-27T11:37:21.336Z"
}
Poll the job status endpoint until the status is "completed". This will give you the generated configuration.
curl http://localhost:3000/api/v1/jobs/6675cd68-4d40-4e2a-bf23-0d9fe3d5c157
The response will tell you the status of the job and the generated configuration:
{
"success": true,
"message": "Job status retrieved",
"data": {
"id": "6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
"name": "generate-config",
"queueName": "config-generation-jobs",
"status": "active",
"progress": {
"percentage": 75,
"status": "Testing Config (Iteration 1)"
},
"result": null,
"estimatedCost": null,
"createdAt": 1745736033517,
"numberInQueue": 0
},
"timestamp": "2025-04-27T06:40:51.035Z"
}
{
"startUrl": "https://news.ycombinator.com/",
"steps": [
{
"type": "extract",
"name": "hackerNewsStories",
"selector": "table#hnmain",
"fields": {
"stories": {
"selector": "tr.athing.submission",
"type": "css",
"multiple": true,
"fields": {
"title": {
"selector": ".titleline a:first-child",
"type": "css"
},
"url": {
"selector": ".titleline a:first-child",
"type": "css",
"attribute": "href"
}
}
}
}
}
],
"options": {
"timeout": 30000,
"javascript": false
}
}
After the job completes, take the generated configuration and use it with the navigation endpoint to start scraping data as a human would, with the ability to handle dynamic content, pagination, and conditional logic.
curl -X POST http://localhost:3000/api/v1/navigate \
-H "Content-Type: application/json" \
-d '{
"startUrl": "https://news.ycombinator.com/",
"steps": [
{
"type": "extract",
"name": "hackerNewsStories",
"selector": "table#hnmain",
"fields": {
"stories": {
"selector": "tr.athing.submission",
"type": "css",
"multiple": true,
"fields": {
"title": {
"selector": ".titleline a:first-child",
"type": "css"
},
"url": {
"selector": ".titleline a:first-child",
"type": "css",
"attribute": "href"
}
}
}
}
}
],
"options": {
"timeout": 30000,
"javascript": false
}
}'
This will return another Job ID. Once that job completes, you'll have your scraped data!
{
"success":true,
"message":"AI configuration generation job queued successfully",
"data":{
"jobId":"6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
"statusUrl":"/api/v1/jobs/6675cd68-4d40-4e2a-bf23-0d9fe3d5c157"
},
"timestamp":"2025-04-27T11:37:21.336Z"
}
After the navigation job completes, you can retrieve the scraped data using the job ID.
curl http://localhost:3000/api/v1/jobs/891bb3c2-3376-4957-94b8-ce70ae8fc31d
This will return the scraped data.
if the job is still processing, you'll get a 202 status code. with a message indicating the job is still processing. and a status url to check the status of the job.
{
"success": true,
"message": "Job status retrieved",
"data": {
"id": "6675cd68-4d40-4e2a-bf23-0d9fe3d5c157",
"name": "generate-config",
"queueName": "config-generation-jobs",
"status": "active",
"progress": {
"percentage": 75,
"status": "Testing Config (Iteration 1)"
},
"result": null,
"estimatedCost": null,
"createdAt": 1745736033517,
"numberInQueue": 0
},
"timestamp": "2025-04-27T06:40:51.035Z"
}
{
"hackerNewsStories": {
"stories": [
{
"title": "Reverse Geocoding Is Hard",
"url": "https://shkspr.mobi/blog/2025/04/reverse-geocoding-is-hard/"
},
{
"title": "Shardines: SQLite3 Database-per-Tenant with ActiveRecord",
"url": "https://blog.julik.nl/2025/04/a-can-of-shardines"
},
/* ... More Listings ... */
]
}
}
{
"success": true,
"message": "Job status retrieved",
"data": {
"id": "e420b897-336a-4691-a249-0226c6c17380",
"name": "execute-flow",
"queueName": "navigation-jobs",
"status": "completed",
"progress": {
"percentage": 100,
"status": "Completed"
},
"result": {
"id": "e420b897-336a-4691-a249-0226c6c17380",
"startUrl": "https://news.ycombinator.com/",
"status": "completed",
"stepsExecuted": 1,
"result": {
"hackerNewsStories": {
"stories": [
{
"title": "Reverse Geocoding Is Hard",
"url": "https://shkspr.mobi/blog/2025/04/reverse-geocoding-is-hard/"
},
{
"title": "Shardines: SQLite3 Database-per-Tenant with ActiveRecord",
"url": "https://blog.julik.nl/2025/04/a-can-of-shardines"
},
{
"title": "We're building a dystopia just to make people click on ads [video]",
"url": "https://www.ted.com/talks/zeynep_tufekci_we_re_building_a_dystopia_just_to_make_people_click_on_ads"
},
{
"title": "Mesmerizing Interlocking Geometric Patterns Produced with Japanese Woodworking",
"url": "https://www.smithsonianmag.com/smithsonian-institution/see-the-mesmerizing-interlocking-geometric-patterns-produced-with-this-ancient-japanese-woodworking-technique-180986494/"
},
{
"title": "Show HN: Remote-Controlled IKEA Deathstar Lamp",
"url": "https://gitlab.com/sephalon/deathstar_lamp"
},
{
"title": "Show HN: Lil digi – play a platformer game as yourself",
"url": "https://www.lildigi.me/"
},
{
"title": "Show HN: A Common Lisp implementation in development, supports ASDF",
"url": "https://savannah.nongnu.org/p/alisp"
},
{
"title": "Wikipedia: Database Download",
"url": "https://en.wikipedia.org/wiki/Wikipedia:Database_download"
},
{
"title": "How to program a text adventure in C",
"url": "https://helderman.github.io/htpataic/htpataic01.html"
},
{
"title": "Open-source interactive C tutorial in the browser",
"url": "https://www.learn-c.org/"
},
{
"title": "ZFS: Apple's New Filesystem that wasn't (2016)",
"url": "https://ahl.dtrace.org/2016/06/15/apple_and_zfs/"
},
{
"title": "Earth's oceans used to be green, and they could turn purple next",
"url": "https://newatlas.com/science/earths-oceans-used-to-be-green-and-they-could-turn-purple-next/"
},
{
"title": "Found a simple tool for database modeling: dbdiagram.io",
"url": "https://dbdiagram.io"
},
{
"title": "Show HN: Bhvr, a Bun and Hono and Vite and React Starter",
"url": "https://bhvr.dev"
},
{
"title": "Icônes",
"url": "https://icones.js.org/"
},
{
"title": "Show HN: My self-written hobby OS is finally running on my vintage IBM ThinkPad",
"url": "https://github.com/joexbayer/RetrOS-32"
},
{
"title": "Bare metal printf – C standard library without OS",
"url": "https://popovicu.com/posts/bare-metal-printf/"
},
{
"title": "An end to all this prostate trouble?",
"url": "https://yarchive.net/blog/prostate/"
},
{
"title": "Sigbovik Conference Proceedings 2025 [pdf]",
"url": "https://sigbovik.org/2025/proceedings.pdf"
},
{
"title": "Watching o3 guess a photo's location is surreal, dystopian and entertaining",
"url": "https://simonwillison.net/2025/Apr/26/o3-photo-locations/"
},
{
"title": "What Porn Did to American Culture",
"url": "https://www.theatlantic.com/newsletters/archive/2025/04/what-porn-did-to-american-culture/682610/"
},
{
"title": "Anatomy of a SQL Engine",
"url": "https://www.dolthub.com/blog/2025-04-25-sql-engine-anatomy/"
},
{
"title": "Former Disney employee who hacked Disney World menus sentenced to 3 years",
"url": "https://databreaches.net/2025/04/24/former-disney-employeedwho-hacked-disney-world-restaurant-menus-in-revenge-sentenced-to-3-years-in-federal-prison/"
},
{
"title": "Bill Gates's Personal Easter Eggs in 8 Bit BASIC (2008)",
"url": "https://www.pagetable.com/?p=43"
},
{
"title": "Chongqing, the Largest City – In Pictures",
"url": "https://www.theguardian.com/world/gallery/2025/apr/27/chongqing-the-worlds-largest-city-in-pictures"
},
{
"title": "Compiler Reminders",
"url": "https://jfmengels.net/compiler-reminders/"
},
{
"title": "Cloth",
"url": "https://www.cloudofoz.com/verlet-test/"
},
{
"title": "Australian who ordered radioactive materials walks away from court",
"url": "https://www.chemistryworld.com/news/australian-who-ordered-radioactive-materials-over-the-internet-walks-away-from-court/4021306.article"
},
{
"title": "Parity (YC S24) is hiring founding engineers to build an AI SRE (in-person, SF)",
"url": "https://www.ycombinator.com/companies/parity/jobs"
},
{
"title": "The Friendship Recession: The lost art of connecting",
"url": "https://www.happiness.hks.harvard.edu/february-2025-issue/the-friendship-recession-the-lost-art-of-connecting"
}
]
}
},
"screenshots": [],
"timestamp": "2025-04-27T16:05:28.213Z",
"queueName": "navigation-jobs"
},
"createdAt": 1745769924089,
"completedAt": 1745769928220
},
"timestamp": "2025-04-27T16:12:00.487Z"
}
Explore the diverse applications of the Advanced Web Scraper API
Gather data on market trends, competitor pricing, product reviews, and customer sentiment from various online sources to inform business strategy.
Track prices of products across e-commerce sites, monitor competitor pricing strategies, and identify arbitrage opportunities automatically.
Extract contact information from business directories, professional networks, or event websites to build targeted outreach lists.
Collect articles, blog posts, news updates, or job listings from multiple sources to build curated content platforms and newsletters.
Augment existing datasets by scraping additional information from public profiles, company websites, or other online sources.
Use the navigation engine to automate UI testing, check website functionality, or monitor website uptime and performance.
A rich set of endpoints to meet all your data extraction needs
POST /api/v1/ai/generate-config
- Generate scraping configurations using AI.
POST /api/v1/navigate
- Execute complex browser navigation flows.
POST /api/v1/scrape
- Perform direct scraping tasks.
GET /api/v1/jbs
- get all jobs in the queue.
GET /api/v1/jobs/:id
- Check the status and retrieve results of asynchronous jobs.
GET /api/v1/proxy/...
- Manage and utilize proxies.
GET /api/v1/session/...
- Manage persistent browser sessions.
GET /api/v1/templates
- List available configuration templates.
Start using the Advanced Web Scraper API today and transform how you collect data from the web.
See what you can extract with just a few API calls
Powerful language model capabilities built right into the API
Access a wide range of AI models through our unified OpenRouter integration, including GPT-4, Claude, Gemini, and more.
Built-in token usage tracking and cost estimation for each model, helping you manage your AI budget effectively.
Fine-tune your AI requests with customizable parameters like temperature, max tokens, and provider-specific options.
Answers to common questions about the Advanced Web Scraper API
This feature allows you to generate a scraping configuration (`config.json`) simply by providing a target URL and a natural language prompt describing what data you want to extract. The API uses a Large Language Model (LLM) to create the configuration, potentially tests it, and can even attempt to fix errors automatically. See the AI API documentation for more details.
The Navigation Engine executes multi-step browser interactions defined in a configuration file. It can handle complex tasks like logging in, clicking buttons, filling forms, scrolling, handling popups, switching tabs/frames, and executing custom scripts. It uses Playwright for browser automation and supports various step types detailed in the Navigation Types documentation.
The API features a robust Proxy Manager that can load proxies from various sources (currently JSON files, API sources planned). It supports health checks, automatic rotation based on performance (latency, success rate), and filtering by criteria like protocol or country. You can interact with it via the Proxy API.
The API supports persistent browser sessions. This means cookies, localStorage, and sessionStorage can be saved and reused across different scraping jobs for the same domain. This helps bypass logins and reduces the likelihood of encountering CAPTCHAs on subsequent runs. See the Session Management documentation.
The API supports a wide range of models through OpenRouter, including:
Yes! The AI Configuration Generation feature makes it accessible even for users with minimal technical knowledge. You can describe what data you want in plain English, and the API will create a working configuration for you. For more advanced users, the API provides full control over every aspect of the scraping process.
The API includes dedicated step types for handling logins, including the `login` step which simplifies common login flows. Combined with the session management feature, you can log in once and then reuse that authenticated session for subsequent scraping jobs. For examples, see the Navigation Examples.