Table of contents
Clean, high-quality text data is crucial in large-scale language model (LLM) training and data analysis. The LLMs.txt generator API provided by Firecrawl is able to extract structured text from any website and generate llms.txt and llms-full.txt files for LLM. This article will introduce its working principle, usage method and key parameters in detail to help you quickly master the efficient use of this tool.
Firecrawl's /llmstxt endpoint can crawl the content of a specified website and generate text data for LLM training and analysis. This API provides two text output formats:
llms.txt : Contains key information and summary of the website.
llms-full.txt : Complete web text content, suitable for deeper AI training.
1️⃣ Crawl the page of the target website and its links
2️⃣ Extract the core text content of the website and remove HTML code and useless information
3️⃣ Generate text files in two formats (simple and full version)
4️⃣ Return data through API for LLM training or analysis
Python code examples
from firecrawl import FirecrawlApp # Initialize API client firecrawl = FirecrawlApp(api_key="your_api_key") # define the generated parameters params = { "maxUrls": 2, # Maximum number of URLs to be crawled "showFullText": True # Does it contain full text} # Generate LLMs.txt results = firecrawl.generate_llms_text( url="https://example.com", params=params ) # Process return data if results['success']: print(f"Status: {results['status']}") print(f"Generated Data: {results['data']}") else: print(f"Error: {results.get('error', 'Unknown error')}")
url : The URL of the website that needs to extract text
maxUrls (optional): Maximum number of pages to be crawled, range 1-100 (default value 10)
showFullText (optional): Whether to generate llms-full.txt (default value False)
The generation of LLMs.txt is executed asynchronously and can be polled through the API.
Use cURL for status checking
curl "https://api.firecrawl.dev/v1/llmstxt/job_id" -H "Authorization: Bearer your_api_key"
Example returns the result:
Processing
{ "success": true, "data": { "llmstxt": "# Firecrawl.dev llms.txtnn- [Web Data Extraction Tool](https://www.firecrawl.dev/)...", "llmsfulltxt": "# Firecrawl.dev llms-full.txtnn" }, "status": "processing", "expiresAt": "2025-03-03T23:19:18.000Z" }
Completed
{ "success": true, "data": { "llmstxt": "# http://firecrawl.dev llms.txtnn- [Web Data Extraction Tool](https://www.firecrawl.dev/)...", "llmsfulltxt": "# http://firecrawl.dev llms-full.txtnn## Web Data Extraction Tool..." }, "status": "completed", "expiresAt": "2025-03-03T22:45:50.000Z" }
Only public web pages are supported, and content that is restricted or behind paywalls cannot be crawled.
The maximum number of URL crawls is 5000 (Alpha version limit).
The output format may be adjusted, please pay attention to the official Firecrawl update.
1 credit point for every URL captured
Control costs through maxUrls. For example, maxUrls=10 consumes 10 credit points
Firecrawl's LLMs.txt Generation API is an ideal tool for AI training and data analysis, which can quickly crawl web pages and generate clear and structured text data. Whether it is a brief summary (llms.txt) or full text (llms-full.txt), it can meet different LLM needs.
If you are looking for an automated and efficient data crawling solution, try the Firecrawl API to make your LLM training more efficient and accurate!