Current location: Home> Ai Course> AI Basics

【2025】How to use Firecrawl API to generate LLMs.txt

Author: LoRA Time: 11 Mar 2025 1088

Clean, high-quality text data is crucial in large-scale language model (LLM) training and data analysis. The LLMs.txt generator API provided by Firecrawl is able to extract structured text from any website and generate llms.txt and llms-full.txt files for LLM. This article will introduce its working principle, usage method and key parameters in detail to help you quickly master the efficient use of this tool.

Introduction to Firecrawl LLMs.txt Generator API

Firecrawl's /llmstxt endpoint can crawl the content of a specified website and generate text data for LLM training and analysis. This API provides two text output formats:

  • llms.txt : Contains key information and summary of the website.

  • llms-full.txt : Complete web text content, suitable for deeper AI training.

How the Firecrawl API works

1️⃣ Crawl the page of the target website and its links
2️⃣ Extract the core text content of the website and remove HTML code and useless information
3️⃣ Generate text files in two formats (simple and full version)
4️⃣ Return data through API for LLM training or analysis

How to generate LLMs.txt using Firecrawl API

1️⃣ Install the Firecrawl SDK and initialize the API

Python code examples

 from firecrawl import FirecrawlApp

# Initialize API client firecrawl = FirecrawlApp(api_key="your_api_key")

# define the generated parameters params = {
    "maxUrls": 2, # Maximum number of URLs to be crawled "showFullText": True # Does it contain full text}

# Generate LLMs.txt
results = firecrawl.generate_llms_text(
    url="https://example.com",
    params=params
)

# Process return data if results['success']:
    print(f"Status: {results['status']}")
    print(f"Generated Data: {results['data']}")
else:
    print(f"Error: {results.get('error', 'Unknown error')}")

2️⃣ Key parameter description

  • url : The URL of the website that needs to extract text

  • maxUrls (optional): Maximum number of pages to be crawled, range 1-100 (default value 10)

  • showFullText (optional): Whether to generate llms-full.txt (default value False)

Monitor LLMs.txt generation status

The generation of LLMs.txt is executed asynchronously and can be polled through the API.

Use cURL for status checking

 curl "https://api.firecrawl.dev/v1/llmstxt/job_id" 
  -H "Authorization: Bearer your_api_key"

Example returns the result:

  • Processing

 {
  "success": true,
  "data": {
    "llmstxt": "# Firecrawl.dev llms.txtnn- [Web Data Extraction Tool](https://www.firecrawl.dev/)...",
    "llmsfulltxt": "# Firecrawl.dev llms-full.txtnn"
  },
  "status": "processing",
  "expiresAt": "2025-03-03T23:19:18.000Z"
}
  • Completed

 {
  "success": true,
  "data": {
    "llmstxt": "# http://firecrawl.dev llms.txtnn- [Web Data Extraction Tool](https://www.firecrawl.dev/)...",
    "llmsfulltxt": "# http://firecrawl.dev llms-full.txtnn## Web Data Extraction Tool..."
  },
  "status": "completed",
  "expiresAt": "2025-03-03T22:45:50.000Z"
}

⚠️ Known Limits (Alpha Edition)

  • Only public web pages are supported, and content that is restricted or behind paywalls cannot be crawled.

  • The maximum number of URL crawls is 5000 (Alpha version limit).

  • The output format may be adjusted, please pay attention to the official Firecrawl update.

Billing and usage rules

  • 1 credit point for every URL captured

  • Control costs through maxUrls. For example, maxUrls=10 consumes 10 credit points

in conclusion:

Firecrawl's LLMs.txt Generation API is an ideal tool for AI training and data analysis, which can quickly crawl web pages and generate clear and structured text data. Whether it is a brief summary (llms.txt) or full text (llms-full.txt), it can meet different LLM needs.

If you are looking for an automated and efficient data crawling solution, try the Firecrawl API to make your LLM training more efficient and accurate!