Parse File

The Parse endpoint allows you to convert documents into structured, machine-readable formats with fine-grained control over the parsing process. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, e.g. RAG pipelines, custom ingestion pipelines, embeddings classification, etc. Unlike processor and workflow runs, parsing is a synchronous endpoint and returns the parsed content in the response. Expected latency depends primarily on file size. This makes it suitable for workflows where you need immediate access to document content without waiting for asynchronous processing. For a deeper guide on how to use the output of this endpoint, jump to Using Parsed Output.

Body

file

object

required

A file object containing either a URL or base64 encoded content. Must contain either fileUrl or fileId.

Hide properties

fileName

string

The name of the file. If not set, the file name is taken from the url or generated in case of raw upload.

fileUrl

string

A URL for the file. For production use cases, we recommend using presigned URLs with a 5-15 minute expiration time.

fileId

string

If you already have an Extend file id (for instance from running a workflow or a previous file upload) then you can use that file id when running the parse endpoint so that it leverage any cached data that might be available.

config

object

required

Configuration options for the parsing process.

Show properties

target

string

default:"markdown"

The target format for the parsed content. Supported values:

markdown: Convert document to Markdown format
spatial: Preserve spatial information in the output

chunkingStrategy

object

Strategy for dividing the document into chunks.

Show properties

type

string

default:"page"

The type of chunking strategy. Supported values:

page: Chunk document by pages.
document: Entire document is a single chunk. Essentially no chunking.
section: Split by logical sections. Not support for target=spatial.

minCharacters

number

Specify a minimum number of characters per chunk.

maxCharacters

number

Specify a maximum number of characters per chunk.

blockOptions

object

Options for controlling how different block types are processed.

Show properties

figures

object

Options for figure blocks.

Show properties

enabled

boolean

default:"true"

Whether to include figures in the output.

figureImageClippingEnabled

boolean

default:"true"

Whether to clip and extract images from figures.

tables

object

Options for table blocks.

Show properties

enabled

boolean

default:"true"

Whether to include tables in the output.

targetFormat

string

default:"markdown"

The target format for the table blocks. Supported values:

markdown: Convert table to Markdown format
html: Convert table to HTML format

text

object

Options for text blocks.

Show properties

signatureDetectionEnabled

boolean

default:"true"

Whether an additional vision model will be utilized for advanced signature detection. Recommended, for most use cases, but should be disabled if signature detection is not necessary and latency is a concern.

advancedOptions

object

Advanced parsing options.

Show properties

pageRotationEnabled

boolean

default:"true"

Whether to automatically detect and correct page rotation.

Response

object

string

The type of object. Will always be “parser_run”.

string

A unique identifier for the parser run.

fileId

string

The identifier of the file that was parsed. This can be used as a parameter to other Extend endpoints, such as processor runs. This allows downstream processing to reuse a cache of the parsed file content to reduce your usage costs.

chunks

array

An array of chunks extracted from the document.

Hide properties

object

string

The type of object. Will be “chunk”.

type

string

The type of chunk (e.g., “page”).

content

string

The textual content of the chunk in the specified target format.

metadata

object

Metadata about the chunk.

Show properties

pageRange

object

The page range this chunk covers. Often will just be a partial page, in which cases start and end will be the same.

Show properties

start

number

The starting page number (inclusive).

end

number

The ending page number (inclusive).

blocks

array

An array of block objects that make up the chunk. See the Block object documentation for more detailed information about block structure and types.

Hide properties

object

string

The type of object. Will be “block”.

string

A unique identifier for the block.

type

string

The type of block. Possible values include:

text: Regular text content
heading: Section or document headings
section_heading: Subsection headings
table: Tabular data with rows and columns
figure: Images, charts, or diagrams

content

string

The textual content of the block.

details

object

Additional details specific to the block type.

metadata

object

Metadata about the block.

Show properties

page

object

Information about the page this block appears on.

Show properties

number

The page number.

width

number

The width of the page in inches.

height

number

The height of the page in inches.

polygon

array

An array of points defining the polygon that bounds the block.

boundingBox

object

A simplified bounding box for the block.

status

string

The status of the parser run. Possible values:

PROCESSED: The file was successfully processed
FAILED: The processing failed (see failureReason for details)

failureReason

string

The reason for failure if status is “FAILED”. Will be null for successful runs.

config

object

The configuration used for the parsing process, including any default values that were applied.

metrics

object

Metrics about the parsing process.

Show properties

processingTimeMs

number

The time taken to process the document in milliseconds.

pageCount

number

The number of pages in the document.

const axios = require("axios");

const parseDocument = async () => {
  try {
    const response = await axios.post(
      "https://api-prod.extend.app/parse",
      {
        file: {
          fileName: "example.pdf",
          fileUrl: "https://example.com/documents/example.pdf",
        },
        config: {
          target: "markdown",
          chunkingStrategy: {
            type: "page",
          },
          blockOptions: {
            figures: {
              enabled: true,
              figureImageClippingEnabled: true,
            },
            tables: {
              enabled: true,
            },
            text: {
              enabled: true,
              styleFormattingEnabled: true,
            },
          },
        },
      },
      {
        headers: {
          Authorization: "Bearer <API_TOKEN>",
          "Content-Type": "application/json",
        },
      }
    );

    console.log("Document parsed successfully:", response.data);
  } catch (error) {
    console.error("Error:", error.response?.data || error.message);
  }
};

parseDocument();

Using Parsed Output

The Parse API returns document content in a structured format that provides both high-level formatted content and detailed block-level information. Understanding how to work with this output will help you get the most value from the parsing service.

Working with Chunks

Each chunk (currently only page-level chunks are supported) contains two key properties:

content: A fully formatted representation of the entire chunk in the target format (e.g., markdown). This is ready to use as-is if you need the complete formatted content of a page.
blocks: An array of individual content blocks that make up the chunk, each with its own formatting, position information, and metadata.

When to use `chunk.content` vs. `chunk.blocks`

Use chunk.content when:
- You need the complete, properly formatted content of a page, already doing the logical placement of blocks (e.g. grouping markdown sections and placing spatially, etc)
- You want to display or process the document content as a whole (and can just combine all chunk.content values)
- You’re integrating with systems that expect formatted text (e.g., markdown processors)
Use chunk.blocks when:
- You need to work with specific elements of the document (e.g., only tables or figures)
- You need spatial information about where content appears on the page, perhaps to build citation systems
- You’re building a UI that shows or highlights specific document elements

Example: Extracting specific content types

// Extract all tables from a document
function extractTables(parseResult) {
  const tables = [];
  
  parseResult.chunks.forEach(chunk => {
    chunk.blocks.forEach(block => {
      if (block.type === 'table') {
        tables.push({
          content: block.content,
          pageNumber: block.metadata.pageNumber,
          position: block.boundingBox
        });
      }
    });
  });
  
  return tables;
}

// Extract all figures with their images
function extractFigures(parseResult) {
  const figures = [];
  
  parseResult.chunks.forEach(chunk => {
    chunk.blocks.forEach(block => {
      if (block.type === 'figure' && block.details.imageUrl) {
        figures.push({
          caption: block.content,
          imageUrl: block.details.imageUrl,
          figureType: block.details.figureType,
          pageNumber: block.metadata.pageNumber
        });
      }
    });
  });
  
  return figures;
}

Example: Reconstructing content with custom formatting

// Extract headings and their content to create a table of contents
function createTableOfContents(parseResult) {
  const toc = [];
  
  parseResult.chunks.forEach(chunk => {
    chunk.blocks.forEach(block => {
      if (block.type === 'heading' || block.type === 'section_heading') {
        toc.push({
          title: block.content,
          pageNumber: block.metadata.pageNumber
        });
      }
    });
  });
  
  return toc;
}

Spatial Information

Each block contains spatial information in the form of a polygon (precise outline) and a simplified boundingBox. This information can be used to:

Highlight specific content in a document viewer
Create visual overlays on top of the original document
Understand the reading order and layout of the document

// Create highlight coordinates for a document viewer
function createHighlights(parseResult, searchTerm) {
  const highlights = [];
  
  parseResult.chunks.forEach(chunk => {
    chunk.blocks.forEach(block => {
      if (block.type === 'text' && block.content.includes(searchTerm)) {
        highlights.push({
          pageNumber: block.metadata.pageNumber,
          boundingBox: block.boundingBox
        });
      }
    });
  });
  
  return highlights;
}

By leveraging both the formatted content and the structured block information, you can build powerful document processing workflows that combine the convenience of formatted text with the precision of block-level access.

Error Response Format

When an error occurs, the API returns a structured error response with the following fields:

code

string

A specific error code that identifies the type of error.

message

string

A human-readable description of the error.

requestId

string

A unique identifier for the request, useful for troubleshooting.

retryable

boolean

Indicates whether retrying the request might succeed.

Custom Error Codes

The API may return the following specific error codes:

Custom Error Codes

We provide custom error codes to make it easier for your system to know what happened in case of a failure. There will also be a retryable=true|false field in the response body, but you can also find a breakdown below. Most errors are not retryable and are client errors related to the file provided for parsing.

Error Code	Description	Retryable
`INVALID_CONFIG_OPTIONS`	Invalid combination of options in the incoming config.	❌
`UNABLE_TO_DOWNLOAD_FILE`	The system could not download the file from the provided URL, likely means your presigned url is expired, or malformed somehow.	❌
`FILE_TYPE_NOT_SUPPORTED`	The file type is not supported for parsing.	❌
`FILE_SIZE_TOO_LARGE`	The file exceeds the maximum allowed size.	❌
`CORRUPT_FILE`	The file is corrupt and cannot be parsed.	❌
`OCR_ERROR`	An error occurred in the OCR system. This is a rare error code and would indicate downtime, so requests can be retried. We’d suggest applying a retry with backoff for this error.	✅
`PASSWORD_PROTECTED_FILE`	The file is password protected and cannot be parsed.	❌
`FAILED_TO_CONVERT_TO_PDF`	The system could not convert the file to PDF format.	❌
`FAILED_TO_GENERATE_TARGET_FORMAT`	The system could not generate the requested target format.	❌
`INTERNAL_ERROR`	An unexpected internal error occurred. We’d suggest applying a retry with backoff for this error as it likely a result of some outage.	✅

HTTP error codes

Corresponding http error codes for different types of failures. We generally recommend relying on our custom error codes for programmatic handling.

400 Bad Request

Returned when:

Required fields are missing (e.g., file)
Neither fileUrl nor fileBase64 is provided in the file object
The provided fileUrl is invalid
The provided fileBase64 is invalid
The config contains invalid values (e.g., unsupported target format or chunking strategy)
The file type is not supported
The file size is too large

401 Unauthorized

Returned when:

The API token is missing
The API token is invalid

403 Forbidden

Returned when:

The authenticated workspace doesn’t have permission to use the parse functionality
The API token doesn’t have sufficient permissions

422 Unprocessable Entity

Returned when:

The file is corrupt and cannot be parsed
The file is password protected
The file could not be converted to PDF
The system failed to generate the target format

500 Internal Server Error

Returned when:

An OCR error occurs
A chunking error occurs
Any other unexpected error occurs during parsing

Handling Errors

Here are examples of how to handle errors from the Parse API:

const axios = require("axios");

const parseDocument = async () => {
  try {
    const response = await axios.post(
      "https://api-prod.extend.app/parse",
      {
        file: {
          fileName: "example.pdf",
          fileUrl: "https://example.com/documents/example.pdf",
        },
        config: {
          target: "markdown",
        },
      },
      {
        headers: {
          Authorization: "Bearer <API_TOKEN>",
          "Content-Type": "application/json",
        },
      }
    );

    console.log("Document parsed successfully:", response.data);
    return response.data;
  } catch (error) {
    if (error.response) {
      const { code, message, requestId, retryable } = error.response.data;
      
      // Handle specific error codes
      switch (code) {
        case "FILE_TYPE_NOT_SUPPORTED":
          console.error("Unsupported file type. Please use a supported format.");
          break;
        case "PASSWORD_PROTECTED_FILE":
          console.error("The file is password protected. Please provide an unprotected file.");
          break;
        case "CORRUPT_FILE":
          console.error("The file is corrupt and cannot be processed.");
          break;
        case "FILE_SIZE_TOO_LARGE":
          console.error("The file is too large. Please reduce the file size.");
          break;
        default:
          console.error(`Error (${code}): ${message}`);
      }
      
      // Log request ID for troubleshooting
      console.error(`Request ID: ${requestId}`);
      
      // Potentially retry if the error is retryable
      if (retryable) {
        console.log("This error is retryable. Consider retrying the request.");
      }
    } else {
      console.error("Network error:", error.message);
    }
    
    throw error;
  }
};

API Documentation

Workflow Endpoints

Processor Endpoints

Parse Endpoints

File Endpoints

Evaluation Set Endpoints

Objects

Guides

Webhooks

Body

Response

Using Parsed Output

Working with Chunks

When to use `chunk.content` vs. `chunk.blocks`

Example: Extracting specific content types

Example: Reconstructing content with custom formatting

Spatial Information

Error Response Format

Custom Error Codes

Custom Error Codes

HTTP error codes

Handling Errors

API Documentation

Workflow Endpoints

Processor Endpoints

Parse Endpoints

File Endpoints

Evaluation Set Endpoints

Objects

Guides

Webhooks

​Body

​Response

​Using Parsed Output

​Working with Chunks

​When to use chunk.content vs. chunk.blocks

​Example: Extracting specific content types

​Example: Reconstructing content with custom formatting

​Spatial Information

​Error Response Format

​Custom Error Codes

​Custom Error Codes

​HTTP error codes

​Handling Errors

Body

Response

Using Parsed Output

Working with Chunks

When to use `chunk.content` vs. `chunk.blocks`

Example: Extracting specific content types

Example: Reconstructing content with custom formatting

Spatial Information

Error Response Format

Custom Error Codes

Custom Error Codes

HTTP error codes

Handling Errors