Learn how to configure document processors through our API
Document processors are the core components that analyze and manipulate information from your documents. This guide explains how to configure processors through our API, including detailed examples for each processor type.
We support three types of document processors:
Generally we recommend using our UI for configuration, but the API can be useful in programmatic workflows, when you need to configure a large number of processors, or when you need to keep your configurations in source control and versioned.
You can also use our webhook events to consume changes made to configurations in the Extend Dashboard and keep your saved configurations in sync.
All processor configurations share these base properties:
They will be set by default to latest available on processor creation unless otherwise specified - see Changelog for more details. Specify these if you need to pin your processor to a specific underlying model version for consistent behavior or to use features available only in certain versions.
Extraction processors extract specific fields from documents.
schema
)This section is relevant for processors using the JSON Schema config type. If you are using the legacy Fields Array config type, please see the Fields Array Structure documentation. If you aren’t sure which config type you are using, please see the Migrating to JSON Schema documentation.
We use JSON Schema to define the structure of the data we extract. Before you get started, we recommend familiarizing yourself with the JSON Schema documentation to understand how to define your schema.
The standard JSON Schema is extremely flexible. We’ve implemented a subset of the standard to support the needs of document extraction. Your schema must follow these rules:
object
typestring
, number
, integer
, boolean
, object
, and array
string
, number
, boolean
, integer
) must be nullable (use array type with “null” as an option e.g. "type": ["string", "null"]
)null
option"extend:type": "currency"
, "extend:type": "signature"
, or "extend:type": "date"
property to the appropriate field type with the required properties. See below for examples."extend:name"
property. If supplied, this will override the name of the property as it will appear to the model, but not in the output returned to you. This is useful for providing more descriptive names or instructions to the model without altering the actual keys in your output data structure."extend:descriptions"
property.While we support the JSON Schema structure, we do not support many of the additional features some of which include:
anyOf
, oneOf
, allOf
, schema definitions, or recursive schemasAll primitive types must be nullable.
Objects must have properties. If you set a required array of the properties, we will respect that order when extracting. If you do not set required array, we will generate it and enforce order.
Arrays items must be objects.
Enums must include null as an option. Only strings are supported for enums. The extend:descriptions
is an optional array of strings. It is recommended to give more context for each enum option for more accurate extraction.
The extend:type
keyword enables custom pre-processing and post-processing of fields which bake in best practices and heuristics for the field type.
Date fields must be strings and use the extend:type
keyword with the value date
. This will guarantee the date format is always an ISO compliant date (yyyy-mm-dd).
Currency fields must be objects with specific properties.
Signature fields must be objects with specific properties. This will auto-enable our advanced signature detection in the parsing step prior to extraction, and apply a number of prompt and post-processing heuristics to improve accuracy, particularly on reduction of false positives for signature blocks that are not actually signed.
Basic Example
Example with nested fields
Example with nested arrays and objects
Example with signature, currency, and date fields
JSON Schema Type Definitions
fields
)This section is relevant for the Fields Array config type. If you are using the JSON Schema config type, please see the JSON Schema Structure documentation. If you aren’t sure which config type you are using, please see the Migrating to JSON Schema documentation.
Basic Example
Example with Nested Fields
Example with nested arrays and objects
Classification processors categorize documents into predefined types.
Splitter processors divide documents into logical sub-documents based on defined classifications.
Learn how to configure document processors through our API
Document processors are the core components that analyze and manipulate information from your documents. This guide explains how to configure processors through our API, including detailed examples for each processor type.
We support three types of document processors:
Generally we recommend using our UI for configuration, but the API can be useful in programmatic workflows, when you need to configure a large number of processors, or when you need to keep your configurations in source control and versioned.
You can also use our webhook events to consume changes made to configurations in the Extend Dashboard and keep your saved configurations in sync.
All processor configurations share these base properties:
They will be set by default to latest available on processor creation unless otherwise specified - see Changelog for more details. Specify these if you need to pin your processor to a specific underlying model version for consistent behavior or to use features available only in certain versions.
Extraction processors extract specific fields from documents.
schema
)This section is relevant for processors using the JSON Schema config type. If you are using the legacy Fields Array config type, please see the Fields Array Structure documentation. If you aren’t sure which config type you are using, please see the Migrating to JSON Schema documentation.
We use JSON Schema to define the structure of the data we extract. Before you get started, we recommend familiarizing yourself with the JSON Schema documentation to understand how to define your schema.
The standard JSON Schema is extremely flexible. We’ve implemented a subset of the standard to support the needs of document extraction. Your schema must follow these rules:
object
typestring
, number
, integer
, boolean
, object
, and array
string
, number
, boolean
, integer
) must be nullable (use array type with “null” as an option e.g. "type": ["string", "null"]
)null
option"extend:type": "currency"
, "extend:type": "signature"
, or "extend:type": "date"
property to the appropriate field type with the required properties. See below for examples."extend:name"
property. If supplied, this will override the name of the property as it will appear to the model, but not in the output returned to you. This is useful for providing more descriptive names or instructions to the model without altering the actual keys in your output data structure."extend:descriptions"
property.While we support the JSON Schema structure, we do not support many of the additional features some of which include:
anyOf
, oneOf
, allOf
, schema definitions, or recursive schemasAll primitive types must be nullable.
Objects must have properties. If you set a required array of the properties, we will respect that order when extracting. If you do not set required array, we will generate it and enforce order.
Arrays items must be objects.
Enums must include null as an option. Only strings are supported for enums. The extend:descriptions
is an optional array of strings. It is recommended to give more context for each enum option for more accurate extraction.
The extend:type
keyword enables custom pre-processing and post-processing of fields which bake in best practices and heuristics for the field type.
Date fields must be strings and use the extend:type
keyword with the value date
. This will guarantee the date format is always an ISO compliant date (yyyy-mm-dd).
Currency fields must be objects with specific properties.
Signature fields must be objects with specific properties. This will auto-enable our advanced signature detection in the parsing step prior to extraction, and apply a number of prompt and post-processing heuristics to improve accuracy, particularly on reduction of false positives for signature blocks that are not actually signed.
Basic Example
Example with nested fields
Example with nested arrays and objects
Example with signature, currency, and date fields
JSON Schema Type Definitions
fields
)This section is relevant for the Fields Array config type. If you are using the JSON Schema config type, please see the JSON Schema Structure documentation. If you aren’t sure which config type you are using, please see the Migrating to JSON Schema documentation.
Basic Example
Example with Nested Fields
Example with nested arrays and objects
Classification processors categorize documents into predefined types.
Splitter processors divide documents into logical sub-documents based on defined classifications.