The File object represents a file in Extend. Files are created for each workflow run, and can also be created directly via API for use in evaluation sets.

object
string

The type of the object, in this case it will always be “file”.

id
string

The file ID.

name
string

The name of the file.

type
string

The Extend normalized type of the file. One of IMG PDF TXT DOCX CSV.

parentFileId
string

The ID of the parent file. Only included if this file is a derivative of another file, for instance if it was created via a Splitter in a workflow.

presignedUrl
string

A presigned URL to download the file. Expires after 15 minutes.

contents
object
rawText
string

The raw text content of the file. This is included for all file types if the rawText query parameter is set to true in the endpoint request.

pages
array

An array of page objects representing the content of each page in the file.

pageNumber
number

The page number of this page in the document.

markdown
string

Cleaned and structured markdown content of the page. Available for PDF and IMG file types. Only included if the markdown query parameter is set to true in the endpoint request.

html
string

Cleaned and structured html content of the page. Available for DOCX file types (that were not auto-converted to PDFs). Only included if the html query parameter is set to true in the endpoint request.

metadata
object
pageCount
number

The number of pages in the file. This is only set for PDF/DOCX files.

parentSplit
object

The split metadata details. Only included if this file is a derivative of another file, for instance if it was created via a Splitter in a workflow.

id
string

The ID of the split.

type
string

The type of the split.

identifier
string

The identifier of the split.

startPage
number

The start page of the split.

endPage
number

The end page of the split.

createdAt
string

The date and time the file was created.

updatedAt
string

The date and time the file was last updated.

Note: There are several deprecated fields that are still in the payload for backwards compatibility. These are:

  • markdown/rawText in IMGs not nested under pages array. These will still be included in payloads until full deprecation in December 2024.