The Parse endpoint allows you to convert documents into structured, machine-readable formats with fine-grained control over the parsing process. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, e.g. RAG pipelines, custom ingestion pipelines, embeddings classification, etc.
Unlike processor and workflow runs, parsing is a synchronous endpoint and returns the parsed content in the response. Expected latency depends primarily on file size. This makes it suitable for workflows where you need immediate access to document content without waiting for asynchronous processing.
For a deeper guide on how to use the output of this endpoint, jump to Using Parsed Output.
If you already have an Extend file id (for instance from running a workflow or a previous file upload) then you can
use that file id when running the parse endpoint so that it leverage any cached data that might be available.
Whether an additional vision model will be utilized for advanced signature detection.
Recommended, for most use cases, but should be disabled if signature detection is not necessary and latency is a concern.
The identifier of the file that was parsed. This can be used as a parameter to other Extend endpoints, such as processor runs. This allows downstream processing to reuse a cache of the parsed file content to reduce your usage costs.
The Parse API returns document content in a structured format that provides both high-level formatted content and detailed block-level information. Understanding how to work with this output will help you get the most value from the parsing service.
Each chunk (currently only page-level chunks are supported) contains two key properties:
content: A fully formatted representation of the entire chunk in the target format (e.g., markdown). This is ready to use as-is if you need the complete formatted content of a page.
blocks: An array of individual content blocks that make up the chunk, each with its own formatting, position information, and metadata.
You need the complete, properly formatted content of a page, already doing the logical placement of blocks (e.g. grouping markdown sections and placing spatially, etc)
You want to display or process the document content as a whole (and can just combine all chunk.content values)
You’re integrating with systems that expect formatted text (e.g., markdown processors)
Use chunk.blocks when:
You need to work with specific elements of the document (e.g., only tables or figures)
You need spatial information about where content appears on the page, perhaps to build citation systems
You’re building a UI that shows or highlights specific document elements
By leveraging both the formatted content and the structured block information, you can build powerful document processing workflows that combine the convenience of formatted text with the precision of block-level access.
We provide custom error codes to make it easier for your system to know what happened in case of a failure. There will also be a retryable=true|false field in the response body, but you can also find a breakdown below. Most errors are not retryable and are client errors related to the file provided for parsing.
Error Code
Description
Retryable
INVALID_CONFIG_OPTIONS
Invalid combination of options in the incoming config.
❌
UNABLE_TO_DOWNLOAD_FILE
The system could not download the file from the provided URL, likely means your presigned url is expired, or malformed somehow.
❌
FILE_TYPE_NOT_SUPPORTED
The file type is not supported for parsing.
❌
FILE_SIZE_TOO_LARGE
The file exceeds the maximum allowed size.
❌
CORRUPT_FILE
The file is corrupt and cannot be parsed.
❌
OCR_ERROR
An error occurred in the OCR system. This is a rare error code and would indicate downtime, so requests can be retried. We’d suggest applying a retry with backoff for this error.
✅
PASSWORD_PROTECTED_FILE
The file is password protected and cannot be parsed.
❌
FAILED_TO_CONVERT_TO_PDF
The system could not convert the file to PDF format.
❌
FAILED_TO_GENERATE_TARGET_FORMAT
The system could not generate the requested target format.
❌
INTERNAL_ERROR
An unexpected internal error occurred. We’d suggest applying a retry with backoff for this error as it likely a result of some outage.