Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape entire website with Website Crawler
The Website Crawler API allows developers to programmatically crawl websites and access structured metadata via four simple endpoints. This API gives you clean JSON responses and real-time crawl updates. The JSON/structured response generated by the /crawl/cwdata endpoint can be used for a variety of purpose. For example, as the data is JSON/LLM ready format, you can use it to train an AI model, use it for creating chatbots, audit websites, etc.
To use the API, youβll need an API Key.
How to get one:
https://www.websitecrawler.org/api
GET /crawl/authenticate
Obtain an access token through the API. This token must be included in all subsequent requests.
Key required in the JSON payload:
key
(string): Your API KeySample Request to get the token:
curl -X POST https://www.websitecrawler.org/api/crawl/authenticate \
-H "Content-Type: application/json" \
-d '{"apiKey": "your_api_key"}'
{
"token": "api_generated_token"
}
GET /crawl/start
Initiate a new crawl for a given domain.
Keys required in the JSON payload:
url
(string, required): Target website (e.g. example.com
) i.e. a non redirecting main URL of the website.limit
(integer, required): Max pages to crawl (free tier is resticted to 100)Sample Request to initiate crawling:
curl -X POST https://www.websitecrawler.org/api/crawl/start \
-H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
-d '{"url": "your_url","limit":"your_limit"}'
{
"status": "Crawling"
}
{
"status": "Completed!"
}
GET /crawl/cwdata
Retrieve the structured crawl output once crawling has completed.
url
(string, required): Target website (e.g. example.com
)
curl -X POST https://www.websitecrawler.org/api/crawl/cwdata \
-H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
-d '{"url": "your_url"}'
{
"status": [
{
"tt": "WPTLS - WordPress Plugins, themes and related services",
"np": "12",
"h1": "",
"nw": "534",
"h2": "Why learn HTML when there is WordPress?",
"h3": "",
"h4": "",
"h5": "",
"atb": "Why learn HTML when there is WordPress?",
"sc": "200",
"md": "Reviews, comparison, and collection of top WordPress themes, plugins, related services, and useful WP tips.",
"elsc": "",
"textCN": "Websitedata.",
"d": "",
"mr": "follow, index",
"pname": "wptls.com",
"al": "",
"cn": "https://wptls.com/",
"kw": "",
"url": "https://wptls.com",
"at": "",
"external_links": "https://www.facebook.com/wptls",
"tm": "96",
"image_links": "https://wptls.com/wp-content/uploads/2021/12/cropped-wptls-logo.png | https://wptls.com/wp-content/uploads/2021/12/cropped-wptls-logo.png | https://wptls.com/wp-content/uploads/2024/02/Spaceship-768x378.jpg | https://wptls.com/wp-content/uploads/2023/12/AdSense-768x612.png | https://wptls.com/wp-content/uploads/2023/12/Exabytes-768x375.jpg | https://wptls.com/wp-content/uploads/2023/10/HTML-768x112.jpg | https://wptls.com/wp-content/uploads/2023/10/Cloudflare-add-site-768x363.png | https://wptls.com/wp-content/uploads/2023/01/Google-Trends-768x363.webp | https://wptls.com/wp-content/uploads/2022/11/Twenty-Twenty-Three-768x351.webp | https://wptls.com/wp-content/uploads/2022/11/Broken-Link-Checker-768x223.webp | https://wptls.com/wp-content/uploads/2022/11/wordpress_logo.webp | https://wptls.com/wp-content/uploads/2022/11/footer-css-768x327.webp",
"internal_links": "https://wptls.com/why-learn-html-when-there-is-wordpress/ | https://wptls.com/customize-footer-wordpress/",
"nofollow_links": ""
}
]
}
GET /crawl/currentURL
Get the last crawled/processed URL
Required key in the JSON payload:
url
(string, required): Target website (e.g. example.com
) i.e. a non redirecting main URL of the website.Sample Request to get the last crawled/processed URL:
curl -X POST https://www.websitecrawler.org/api/crawl/currentURL \
-H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
-d '{"url": "your_url"}'
{
"currentURL": "https://wptls.com"
}
GET /crawl/clear
Clear the previous job in case you want to rerun the crawler.
Required key:
url
(string, required): Target website (e.g. example.com
) i.e. a non redirecting main URL of the website.Sample Request to clear the job:
curl -X POST https://www.websitecrawler.org/api/crawl/clear \
-H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
-d '{"url": "your_url"}'
{
"clearStatus": "Job cannot be cleared as the URL of the entered website is being crawled."
}
The Python and Java demos showcases how to use the WebsiteCrawlerSDK
to interact with websitecrawler.org, enabling automated URL submission, status tracking, and retrieval of crawl data via their API.
install the website crawler sdk
pip install website-crawler-sdk
change API_KEY,YOUR_LIMIT,YOUR_URL in the following demo script and run it. The objective of this script is to submit a URL to websitecrawler.org, get crawl status, the current URL being processed by websitecrawler in realtime, and retrieve the structured data once the task of crawling the website is finished.
import time
from website_crawler_sdk import WebsiteCrawlerConfig, WebsiteCrawlerClient
"""
Author: Pramod Choudhary (websitecrawler.org)
Version: 1.1
Date: July 10, 2025
"""
# Replace with your actual API key, target URL, and limit
YOUR_API_KEY = "YOUR_API_KEY" #Your API key goes here
URL = "YOUR_URL" #Enter a non redirecting URL/domain with https or http
LIMIT = YOUR_LIMIT #Change YOUR_LIMIT
def main():
cfg = WebsiteCrawlerConfig(YOUR_API_KEY)
client = WebsiteCrawlerClient(cfg)
# Submit URL to WebsiteCrawler.org for crawling
client.submit_url_to_website_crawler(URL, LIMIT) #Submit the URL and Limit to websitecrawler via API
while True:
task_status = client.get_task_status() #Start retrieving data if the task_status is true
print(f"{task_status} << task status")
time.sleep(2) #Wait for 2 seconds
if not task_status:
break
if task_status:
status = client.get_crawl_status() #get_crawl_status() method gets the crawl status
currenturl = client.get_current_url() #get_current_url() method gets the current URL
data = client.get_crawl_data() # get_crawl_data() method gets the structured data once crawling has completed
if status:
print(f"Current Status:: {status}")
if status == "Crawling": #Crawling is one of the status
print(f"Current URL:: {currenturl}")
if status == "Completed!": #Completed! (with exclamation) is one of the status
print("Task has been completed... closing the loop and gettint the data...")
if data:
print(f"JSON Data:: {data}")
time.sleep(20) # Give extra time for large JSON response
break
print("Job over")
if __name__ == "__main__":
main()
Download the jar file WebsiteCrawlerSDK-Java-1.0.jar and add it as a dependency in your java project. Create the WebsiteCrawlerConfig object as shown in the following code. Pass the WebsiteCrawlerConfig object to WebsiteCrawlerClient. Use the WebsiteCrawlerConfig object to call the methods.
WebsiteCrawlerConfig config = new WebsiteCrawlerConfig("YOUR_API_KEY");
WebsiteCrawlerClient crawler = new WebsiteCrawlerClient(config);
package wc.WebsiteCrawlerAPIUsageDemo;
import wc.websitecrawlersdk.WebsiteCrawlerClient;
import wc.websitecrawlersdk.WebsiteCrawlerConfig;
/**
*
* @author Pramod
*/
public class WebsiteCrawlerAPIUsageDemo {
public static void main(String[] args) throws InterruptedException {
String status;
String currenturl;
String data;
WebsiteCrawlerConfig cfg = new WebsiteCrawlerConfig(YOUR_API_KEY); //replace YOUR_API_KEY with your api key
WebsiteCrawlerClient client = new WebsiteCrawlerClient(cfg);
client.submitUrlToWebsiteCrawler(URL, LIMIT); //replace "URL" with the URL you want Websitecrawler.org to crawl and the number of URLs
boolean taskStatus;
while (true) {
taskStatus = client.getTaskStatus(); //getTaskStatus() should be true before you call any methods
System.out.println(taskStatus + "<<task status");
Thread.sleep(9000);
if (taskStatus == true) {
status = client.getCrawlStatus(); // getCrawlStatus() method returns the live crawling status
currenturl = client.getCurrentURL(); //getCurrentURL() method returns the URL being processed by WebsiteCrawler.org
data = client.getcwData(); // getcwData() returns the JSON array of the website data;
System.out.println("Crawl status::");
if (status != null) {
System.out.println(status);
}
if (status != null && status.equals("Crawling")) { // status: Crawling ----> Crawl job is in progresss
System.out.println("Current URL::" + currenturl);
}
if (status != null && status.equals("Completed!")) { // status: Completed! ---> Crawl job has completed succesfully
System.out.println("Task has been completed.. closing the while loop");
if (data != null) {
System.out.println("Json Data::" + data);
Thread.sleep(20000); // JSON data might be huge. Thread.sleep makes the program wait until json data is retrieved
break; // exits the while(true) loop
}
}
}
}
System.out.println("job over");
}
}
This section highlights how the XML-Sitemap-Generator
project uses the websitecrawler.org
API to automate XML sitemap generation.
The following steps outline the flow between the Website Crawler API and the sitemap generation logic:
Start Crawling
crawl/start
endpoint to initiate crawling of your website:https://www.websitecrawler.org/api/crawl/start?url=example.com&limit=100&key=YOUR_API_KEY
Fetch Crawled Data
https://www.websitecrawler.org/api/crawl/cwdata?url=example.com&key=YOUR_API_KEY
Process and Transform
Generate Sitemap
sitemap.xml
for SEO optimization and better search engine indexing.Check out the full implementation here:
π XML-Sitemap-Generator
For best results, ensure your API key is valid and your domain permits crawling.
##π Feedback & Support
Found a bug or need help? Open an issue or connect via websitecrawler.org