Website Crawler

Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape entire website with Website Crawler

39
3
Java

Website Crawler

πŸ•·οΈ Website Crawler API

The Website Crawler API allows developers to programmatically crawl websites and access structured metadata via four simple endpoints. This API gives you clean JSON responses and real-time crawl updates. The JSON/structured response generated by the /crawl/cwdata endpoint can be used for a variety of purpose. For example, as the data is JSON/LLM ready format, you can use it to train an AI model, use it for creating chatbots, audit websites, etc.


πŸ” Authentication

To use the API, you’ll need an API Key.

How to get one:

  1. Visit websitecrawler.org
  2. Create an account or log in
  3. Go to the Settings page to generate your API key

🌐 Base URL

https://www.websitecrawler.org/api


πŸ“‘ Endpoints

1. GET /crawl/authenticate

Obtain an access token through the API. This token must be included in all subsequent requests.

  • Key required in the JSON payload:

    • key (string): Your API Key
  • Sample Request to get the token:

curl -X POST https://www.websitecrawler.org/api/crawl/authenticate \
 -H "Content-Type: application/json" \
 -d '{"apiKey": "your_api_key"}'

    • Sample Response:
{
  "token": "api_generated_token"
}


2. GET /crawl/start

Initiate a new crawl for a given domain.

  • Keys required in the JSON payload:

    • url (string, required): Target website (e.g. example.com) i.e. a non redirecting main URL of the website.
    • limit (integer, required): Max pages to crawl (free tier is resticted to 100)
  • Sample Request to initiate crawling:

curl -X POST https://www.websitecrawler.org/api/crawl/start \
     -H "Authorization: Bearer api_generated_token" \
 -H "Content-Type: application/json" \
 -d '{"url": "your_url","limit":"your_limit"}'

    • Sample Response 1:
{
  "status": "Crawling"
}

    • Sample Response 2:
{
  "status": "Completed!"
} 

3. GET /crawl/cwdata

Retrieve the structured crawl output once crawling has completed.

  • Required key in JSON payload:
    • url (string, required): Target website (e.g. example.com)

      • Sample Request to get data:
curl -X POST https://www.websitecrawler.org/api/crawl/cwdata \
     -H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
 -d '{"url": "your_url"}'

    • Sample Response:
{
  "status": [
    {
      "tt": "WPTLS - WordPress Plugins, themes and related services",
      "np": "12",
      "h1": "",
      "nw": "534",
      "h2": "Why learn HTML when there is WordPress?",
      "h3": "",
      "h4": "",
      "h5": "",
      "atb": "Why learn HTML when there is WordPress?",
      "sc": "200",
      "md": "Reviews, comparison, and collection of top WordPress themes, plugins, related services, and useful WP tips.",
      "elsc": "",
      "textCN": "Websitedata.",
      "d": "",
      "mr": "follow, index",
      "pname": "wptls.com",
      "al": "",
      "cn": "https://wptls.com/",
      "kw": "",
      "url": "https://wptls.com",
      "at": "",
      "external_links": "https://www.facebook.com/wptls",
      "tm": "96",
      "image_links": "https://wptls.com/wp-content/uploads/2021/12/cropped-wptls-logo.png | https://wptls.com/wp-content/uploads/2021/12/cropped-wptls-logo.png | https://wptls.com/wp-content/uploads/2024/02/Spaceship-768x378.jpg | https://wptls.com/wp-content/uploads/2023/12/AdSense-768x612.png | https://wptls.com/wp-content/uploads/2023/12/Exabytes-768x375.jpg | https://wptls.com/wp-content/uploads/2023/10/HTML-768x112.jpg | https://wptls.com/wp-content/uploads/2023/10/Cloudflare-add-site-768x363.png | https://wptls.com/wp-content/uploads/2023/01/Google-Trends-768x363.webp | https://wptls.com/wp-content/uploads/2022/11/Twenty-Twenty-Three-768x351.webp | https://wptls.com/wp-content/uploads/2022/11/Broken-Link-Checker-768x223.webp | https://wptls.com/wp-content/uploads/2022/11/wordpress_logo.webp | https://wptls.com/wp-content/uploads/2022/11/footer-css-768x327.webp",
      "internal_links": "https://wptls.com/why-learn-html-when-there-is-wordpress/ | https://wptls.com/customize-footer-wordpress/",
      "nofollow_links": ""
    }
  ]
}

4. GET /crawl/currentURL

Get the last crawled/processed URL

  • Required key in the JSON payload:

    • url (string, required): Target website (e.g. example.com) i.e. a non redirecting main URL of the website.
  • Sample Request to get the last crawled/processed URL:

curl -X POST https://www.websitecrawler.org/api/crawl/currentURL \
     -H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
 -d '{"url": "your_url"}'

    • Sample Response:
{
  "currentURL": "https://wptls.com"
}

5. GET /crawl/clear

Clear the previous job in case you want to rerun the crawler.

  • Required key:

    • url (string, required): Target website (e.g. example.com) i.e. a non redirecting main URL of the website.
  • Sample Request to clear the job:

curl -X POST https://www.websitecrawler.org/api/crawl/clear \
     -H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
 -d '{"url": "your_url"}'

    • Sample Response:
{
  "clearStatus": "Job cannot be cleared as the URL of the entered website is being crawled."
}

πŸ•ΈοΈ Website Crawler API Usage Demo

The Python and Java demos showcases how to use the WebsiteCrawlerSDK to interact with websitecrawler.org, enabling automated URL submission, status tracking, and retrieval of crawl data via their API.

Python

install the website crawler sdk

pip install website-crawler-sdk

change API_KEY,YOUR_LIMIT,YOUR_URL in the following demo script and run it. The objective of this script is to submit a URL to websitecrawler.org, get crawl status, the current URL being processed by websitecrawler in realtime, and retrieve the structured data once the task of crawling the website is finished.

import time
from website_crawler_sdk import WebsiteCrawlerConfig, WebsiteCrawlerClient

"""
Author: Pramod Choudhary (websitecrawler.org)
Version: 1.1
Date: July 10, 2025
"""

# Replace with your actual API key, target URL, and limit
YOUR_API_KEY = "YOUR_API_KEY" #Your API key goes here
URL = "YOUR_URL" #Enter a non redirecting URL/domain with https or http
LIMIT = YOUR_LIMIT #Change YOUR_LIMIT 

def main():
    cfg = WebsiteCrawlerConfig(YOUR_API_KEY)
    client = WebsiteCrawlerClient(cfg)

    # Submit URL to WebsiteCrawler.org for crawling
    client.submit_url_to_website_crawler(URL, LIMIT) #Submit the URL and Limit to websitecrawler via API

    while True:
        task_status = client.get_task_status() #Start retrieving data if the task_status is true
        print(f"{task_status} << task status")
        time.sleep(2)  #Wait for 2 seconds

        if not task_status:
           break

        if task_status:
            status = client.get_crawl_status() #get_crawl_status() method gets the crawl status
            currenturl = client.get_current_url() #get_current_url() method gets the current URL
            data = client.get_crawl_data() # get_crawl_data() method gets the structured data once crawling has completed

            if status:
                print(f"Current Status:: {status}")


            if status == "Crawling": #Crawling is one of the status
                print(f"Current URL:: {currenturl}")

            if status == "Completed!":  #Completed! (with exclamation) is one of the status
                print("Task has been completed... closing the loop and gettint the data...")
                if data:
                    print(f"JSON Data:: {data}")
                    time.sleep(20)  # Give extra time for large JSON response
                    break
            
           

    print("Job over")

if __name__ == "__main__":
    main()

Java


πŸš€ Features

  • Submit any website URL to be crawled
  • Track crawl status in real-time
  • View current URL being crawled
  • Retrieve JSON-formatted crawl data on completion

πŸ“¦ Prerequisites


How to use the Java library?

Download the jar file WebsiteCrawlerSDK-Java-1.0.jar and add it as a dependency in your java project. Create the WebsiteCrawlerConfig object as shown in the following code. Pass the WebsiteCrawlerConfig object to WebsiteCrawlerClient. Use the WebsiteCrawlerConfig object to call the methods.

WebsiteCrawlerConfig config = new WebsiteCrawlerConfig("YOUR_API_KEY");
WebsiteCrawlerClient crawler = new WebsiteCrawlerClient(config);

package wc.WebsiteCrawlerAPIUsageDemo;

import wc.websitecrawlersdk.WebsiteCrawlerClient;
import wc.websitecrawlersdk.WebsiteCrawlerConfig;

/**
 *
 * @author Pramod
 */
public class WebsiteCrawlerAPIUsageDemo {

    public static void main(String[] args) throws InterruptedException {
        String status;
        String currenturl;
        String data;
        WebsiteCrawlerConfig cfg = new WebsiteCrawlerConfig(YOUR_API_KEY); //replace YOUR_API_KEY with your api key
        WebsiteCrawlerClient client = new WebsiteCrawlerClient(cfg);

        client.submitUrlToWebsiteCrawler(URL, LIMIT); //replace "URL" with the URL you want Websitecrawler.org to crawl and the number of URLs
        boolean taskStatus;
        while (true) {
            taskStatus = client.getTaskStatus(); //getTaskStatus() should be true before you call any methods
            System.out.println(taskStatus + "<<task status");
            Thread.sleep(9000);
            if (taskStatus == true) {
                status = client.getCrawlStatus(); // getCrawlStatus() method returns the live crawling status
                currenturl = client.getCurrentURL(); //getCurrentURL() method returns the URL being processed by WebsiteCrawler.org
                data = client.getcwData(); // getcwData() returns the JSON array of the website data;
                System.out.println("Crawl status::");
                if (status != null) {
                    System.out.println(status);
                }
                if (status != null && status.equals("Crawling")) { // status: Crawling  ----> Crawl job is in progresss
                    System.out.println("Current URL::" + currenturl);
                }
                if (status != null && status.equals("Completed!")) { // status: Completed! ---> Crawl job has completed succesfully 
                    System.out.println("Task has been completed.. closing the while loop");
                    if (data != null) {
                        System.out.println("Json Data::" + data);
                        Thread.sleep(20000); // JSON data might be huge. Thread.sleep makes the program wait until json data is retrieved
                        break; // exits the while(true) loop
                    }
                }

            }
        }
        System.out.println("job over");
    }
}

🧩 Integration Example: XML Sitemap Generator

This section highlights how the XML-Sitemap-Generator project uses the websitecrawler.org API to automate XML sitemap generation.

πŸ”„ Integration Workflow

The following steps outline the flow between the Website Crawler API and the sitemap generation logic:

  1. Start Crawling

    • Use the crawl/start endpoint to initiate crawling of your website:
      https://www.websitecrawler.org/api/crawl/start?url=example.com&limit=100&key=YOUR_API_KEY
      
  2. Fetch Crawled Data

    • Once crawling is complete, retrieve data using:
      https://www.websitecrawler.org/api/crawl/cwdata?url=example.com&key=YOUR_API_KEY
      
    • Response includes structured metadata (titles, links, status codes, etc.) in JSON format.
  3. Process and Transform

    • The XML Sitemap Generator parses the response and extracts valid URLs.
  4. Generate Sitemap

    • The extracted URLs are then converted into a compliant sitemap.xml for SEO optimization and better search engine indexing.

πŸ“‚ Repository

Check out the full implementation here:
πŸ”— XML-Sitemap-Generator


For best results, ensure your API key is valid and your domain permits crawling.

##πŸ‘‹ Feedback & Support
Found a bug or need help? Open an issue or connect via websitecrawler.org