API Documentation
Workflow Endpoints
Processor Endpoints
File Endpoints
Evaluation Set Endpoints
Objects
Webhooks
The File object
{
"object": "file",
"id": "file_1234",
"name": "example_file",
"type": "PDF",
"presignedUrl": "https://s3.example.com/file_1234.pdf",
"parentFileId": "file_5678", // Optional, only set if this file is a derivative of another file
"contents": {
"rawText": "This is the raw text content of the file...",
"pages": [
{
"pageNumber": 1,
"markdown": "This is the markdown content of the page...",
}
]
},
"metadata": {
"parentSplit": { // Optional, only set if this file is a derivative of another file
"id": "324kjlfsd",
"type": "addendum",
"identifier": "addendum_1",
"startPage": 7,
"endPage": 9
}
}
"createdAt": "2024-01-01T00:00:00Z",
"updatedAt": "2024-01-01T00:00:00Z"
}
The File object represents a file in Extend. Files are created for each workflow run, and can also be created directly via API for use in evaluation sets.
The type of the object, in this case it will always be “file”.
The file ID.
The name of the file.
The Extend normalized type of the file. One of IMG
PDF
TXT
DOCX
CSV
EXCEL
.
The ID of the parent file. Only included if this file is a derivative of another file, for instance if it was created via a Splitter in a workflow.
A presigned URL to download the file. Expires after 15 minutes.
The raw text content of the file. This is included for all file types if the rawText
query parameter is set to true in the endpoint request.
An array of page objects representing the content of each page in the file.
The page number of this page in the document.
Cleaned and structured markdown content of the page.
Available for PDF and IMG file types.
Only included if the markdown
query parameter is set to true in the endpoint request.
Cleaned and structured html content of the page.
Available for DOCX file types (that were not auto-converted to PDFs).
Only included if the html
query parameter is set to true in the endpoint request.
The number of pages in the file. This is only set for PDF/DOCX files.
The split metadata details. Only included if this file is a derivative of another file, for instance if it was created via a Splitter in a workflow.
The date and time the file was created.
The date and time the file was last updated.
Note: There are several deprecated fields that are still in the payload for backwards compatibility. These are:
- markdown/rawText in IMGs not nested under pages array. These will still be included in payloads until full deprecation in December 2024.
{
"object": "file",
"id": "file_1234",
"name": "example_file",
"type": "PDF",
"presignedUrl": "https://s3.example.com/file_1234.pdf",
"parentFileId": "file_5678", // Optional, only set if this file is a derivative of another file
"contents": {
"rawText": "This is the raw text content of the file...",
"pages": [
{
"pageNumber": 1,
"markdown": "This is the markdown content of the page...",
}
]
},
"metadata": {
"parentSplit": { // Optional, only set if this file is a derivative of another file
"id": "324kjlfsd",
"type": "addendum",
"identifier": "addendum_1",
"startPage": 7,
"endPage": 9
}
}
"createdAt": "2024-01-01T00:00:00Z",
"updatedAt": "2024-01-01T00:00:00Z"
}
{
"object": "file",
"id": "file_1234",
"name": "example_file",
"type": "PDF",
"presignedUrl": "https://s3.example.com/file_1234.pdf",
"parentFileId": "file_5678", // Optional, only set if this file is a derivative of another file
"contents": {
"rawText": "This is the raw text content of the file...",
"pages": [
{
"pageNumber": 1,
"markdown": "This is the markdown content of the page...",
}
]
},
"metadata": {
"parentSplit": { // Optional, only set if this file is a derivative of another file
"id": "324kjlfsd",
"type": "addendum",
"identifier": "addendum_1",
"startPage": 7,
"endPage": 9
}
}
"createdAt": "2024-01-01T00:00:00Z",
"updatedAt": "2024-01-01T00:00:00Z"
}