Document Processor Configuration Guide

Document processors are the core components that analyze and manipulate information from your documents. This guide explains how to configure processors through our API, including detailed examples for each processor type.

Overview

We support three types of document processors:

Extraction Processors: Extract specific fields from documents.
Classification Processors: Categorize documents.
Splitter Processors: Divide documents into logical sub-documents.

Generally we recommend using our UI for configuration, but the API can be useful in programmatic workflows, when you need to configure a large number of processors, or when you need to keep your configurations in source control and versioned.

You can also use our webhook events to consume changes made to configurations in the Extend Dashboard and keep your saved configurations in sync.

Schema Definitions

Base Processor Schema

All processor configurations share these base properties:

type BaseProcessorConfig = {
  // Base properties inherited by all processor types
  baseProcessor?: string;
  baseVersion?: string;
};

They will be set by default to latest available on processor creation unless otherwise specified - see Changelog for more details. Specify these if you need to pin your processor to a specific underlying model version for consistent behavior or to use features available only in certain versions.

Extraction Processor Configuration

Extraction processors extract specific fields from documents.

// Either schema or fields must be provided, but not both
type ExtractionConfig = {
  type: "EXTRACT";
  baseProcessor?: string;
  baseVersion?: string;
  schema: RootJSONSchema; // See the JSON Schema Structure section below
  fields: ExtractionField[]; // (Deprecated) See the Fields Array section below
  customDefinitions?: string;
  customExtractionRules?: string;
  customDocumentKind?: string;
  includeBoundingBoxCitations?: boolean;
  includeModelReasoning?: boolean;
};

JSON Schema Structure (`schema`)

This section is relevant for processors using the JSON Schema config type. If you are using the legacy Fields Array config type, please see the Fields Array Structure documentation. If you aren’t sure which config type you are using, please see the Migrating to JSON Schema documentation.

We use JSON Schema to define the structure of the data we extract. Before you get started, we recommend familiarizing yourself with the JSON Schema documentation to understand how to define your schema.

The standard JSON Schema is extremely flexible. We’ve implemented a subset of the standard to support the needs of document extraction. Your schema must follow these rules:

The root must be an object type
Allowed types are string, number, integer, boolean, object, and array
All primitive fields (string, number, boolean, integer) must be nullable (use array type with “null” as an option e.g. "type": ["string", "null"])
Maximum nesting level is 3 (each non-root object counts as 1 level)
Property keys and names must only contain lowercase letters, numbers, and underscores
Array items must be objects
Enums must only contain strings and must contain a null option
Custom types are supported by adding a "extend:type": "currency", "extend:type": "signature", or "extend:type": "date" property to the appropriate field type with the required properties. See below for examples.
Property names can be added using the "extend:name" property. If supplied, this will override the name of the property as it will appear to the model, but not in the output returned to you. This is useful for providing more descriptive names or instructions to the model without altering the actual keys in your output data structure.
You can add descriptions to individual enum values using the "extend:descriptions" property.

Unsupported Features

While we support the JSON Schema structure, we do not support many of the additional features some of which include:

Schema composition like anyOf, oneOf, allOf, schema definitions, or recursive schemas
Regular expressions and other type-specific validation keywords
Conditional schema validation
Constant values

Schema Examples

Primitive Schema

All primitive types must be nullable.

{
  "field_name": {
    "type": ["string", "null"],
    "description": "Field description"
  },
  "numeric_field": {
    "type": ["number", "null"],
    "description": "A numeric field"
  },
  "integer_field": {
    "type": ["integer", "null"],
    "description": "An integer field"
  },
  "boolean_field": {
    "type": ["boolean", "null"],
    "description": "A boolean field"
  }
}

Object Schema

Objects must have properties. If you set a required array of the properties, we will respect that order when extracting. If you do not set required array, we will generate it and enforce order.

{
  "address": {
    "type": "object",
    "properties": {
      "street": {
        "type": ["string", "null"],
        "description": "Street address"
      },
      "city": {
        "type": ["string", "null"],
        "description": "City name"
      }
    },
    "required": ["street", "city"]
  }
}

Array Schema

Arrays items must be objects.

{
  "line_items": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "description": {
          "type": ["string", "null"],
          "description": "Item description"
        },
        "quantity": {
          "type": ["number", "null"],
          "description": "Item quantity"
        },
        "price": {
          "type": ["number", "null"],
          "description": "Item price"
        }
      },
      "required": ["description", "quantity", "price"]
    },
    "description": "List of items in the invoice"
  }
}

Enum Schema

Enums must include null as an option. Only strings are supported for enums. The extend:descriptions is an optional array of strings. It is recommended to give more context for each enum option for more accurate extraction.

{
  "status": {
    "enum": ["pending", "approved", "rejected", null],
    "extend:descriptions": [
      "Invoice is pending approval",
      "Invoice has been approved",
      "Invoice has been rejected",
      ""
    ],
    "description": "Current status of the invoice"
  }
}

Custom Field Types

The extend:type keyword enables custom pre-processing and post-processing of fields which bake in best practices and heuristics for the field type.

Date Schema

Date fields must be strings and use the extend:type keyword with the value date. This will guarantee the date format is always an ISO compliant date (yyyy-mm-dd).

{
  "invoice_date": {
    "type": ["string", "null"],
    "extend:type": "date",
    "description": "The invoice date"
  }
}

Currency Schema

Currency fields must be objects with specific properties.

{
  "price": {
    "type": "object",
    "extend:type": "currency",
    "properties": {
      "amount": {
        "type": ["number", "null"],
      },
      "iso_4217_currency_code": {
        "type": ["string", "null"],
      }
    },
    "required": ["amount", "iso_4217_currency_code"]
  }
}

Signature Schema

Signature fields must be objects with specific properties. This will auto-enable our advanced signature detection in the parsing step prior to extraction, and apply a number of prompt and post-processing heuristics to improve accuracy, particularly on reduction of false positives for signature blocks that are not actually signed.

{
  "signature": {
    "type": "object",
    "extend:type": "signature",
    "properties": {
      "printed_name": {
        "type": ["string", "null"],
      },
      "signature_date": {
        "type": ["string", "null"],
        "extend:type": "date",
      },
      "is_signed": {
        "type": ["boolean", "null"],
      },
      "title_or_role": {
        "type": ["string", "null"],
      }
    },
    "required": ["printed_name", "signature_date", "is_signed", "title_or_role"]
  }
}

Configuration Examples

Basic Example

const basicExtractionConfig = {
  type: "EXTRACT",
  schema: {
    type: "object",
    properties: {
      invoice_number: {
        type: ["string", "null"],
        description: "The unique identifier for this invoice",
      },
      invoice_amount: {
        type: "object",
        "extend:type": "currency",
        description: "The total amount of the invoice",
        properties: {
          amount: {
            type: ["number", "null"],
          },
          iso_4217_currency_code: {
            type: ["string", "null"],
          },
        },
        required: ["amount", "iso_4217_currency_code"],
      },
    },
    required: ["invoice_number", "invoice_amount"],
  },
};

Example with nested fields

const complexExtractionConfig = {
  type: "EXTRACT",
  schema: {
    type: "object",
    properties: {
      line_items: {
        type: "array",
        description: "Individual items in the invoice",
        items: {
          type: "object",
          properties: {
            item_name: {
              type: ["string", "null"],
              description: "Name of the item",
            },
            quantity: {
              type: ["number", "null"],
              description: "Number of items",
            },
            unit_price: {
              type: "object",
              properties: {
                amount: {
                  type: ["number", "null"],
                  description: "Price per unit",
                },
                iso_4217_currency_code: {
                  type: ["string", "null"],
                  description: "Currency code",
                },
              },
              required: ["amount", "iso_4217_currency_code"],
            },
          },
          required: ["item_name", "quantity", "unit_price"],
        },
      },
      payment_status: {
        description: "Current payment status",
        enum: ["PAID", "PENDING", null],
        "extend:descriptions": [
          "Payment has been completed",
          "Payment is pending",
          "",
        ],
      },
    },
    required: ["line_items", "payment_status"],
  },
  customExtractionRules: "- If ...", // Optional custom rules
  customDocumentKind: "invoice", // Optionally specify a document kind
  includeBoundingBoxCitations: true, // Turns on llm-powered citations and bounding box references
  includeModelReasoning: true, // Exposes the model's chain of thought reasoning for each field result
};

Example with nested arrays and objects

const nestedArrayConfig = {
  type: "EXTRACT",
  schema: {
    type: "object",
    properties: {
      orders: {
        type: "array",
        description: "List of customer orders",
        items: {
          type: "object",
          properties: {
            order_id: {
              type: ["string", "null"],
              description: "Unique identifier for the order",
            },
            customer_name: {
              type: ["string", "null"],
              description: "Name of the customer",
            },
            shipments: {
              type: "array",
              description: "List of shipments for this order",
              items: {
                type: "object",
                properties: {
                  tracking_number: {
                    type: ["string", "null"],
                    description: "Shipping tracking number",
                  },
                  ship_date: {
                    type: ["string", "null"],
                    "extend:type": "date",
                    description: "Date the shipment was sent",
                  },
                  carrier: {
                    type: ["string", "null"],
                    description: "Shipping carrier name",
                  },
                },
                required: ["tracking_number", "ship_date", "carrier"],
              },
            },
          },
          required: ["order_id", "customer_name", "shipments"],
        },
      },
    },
    required: ["orders"],
  },
};

Example with signature, currency, and date fields

const customFieldConfig = {
  type: "EXTRACT",
  schema: {
    type: "object",
    properties: {
      invoice_signature: {
        type: "object",
        description: "Details of the invoice signature",
        properties: {
          printed_name: {
            type: ["string", "null"],
            description: "The printed name of the signer",
          },
          signature_date: {
            type: ["string", "null"],
            "extend:type": "date",
            description: "The date the signature was applied",
          },
          is_signed: {
            type: ["boolean", "null"],
            description: "Indicates if the document is signed",
          },
          title_or_role: {
            type: ["string", "null"],
            description: "The title or role of the signer",
          },
        },
        required: [
          "printed_name",
          "signature_date",
          "is_signed",
          "title_or_role",
        ],
      },
      invoice_amount: {
        type: "object",
        "extend:type": "currency",
        description: "The amount of the invoice",
        properties: {
          amount: {
            type: ["number", "null"],
            description: "The numerical value of the amount",
          },
          iso_4217_currency_code: {
            type: ["string", "null"],
            description: "The ISO 4217 currency code (e.g., USD, EUR)",
          },
        },
        required: ["amount", "iso_4217_currency_code"],
      },
      invoice_date: {
        type: "date",
        description: "The date of the invoice",
      },
    },
    required: ["signature", "invoice_amount", "invoice_date"],
  },
};

Type Definitions

JSON Schema Type Definitions

// Root schema
type RootJSONSchema = {
  type: "object";
  properties: {
    [key: string]: JSONSchema;
  };
  required: string[];
  additionalProperties?: boolean;
};

// Common schema properties that all types can have
type BaseJSONSchema = {
  "extend:name"?: string;
  description?: string;
};

// Enum schema
type EnumJSONSchema = BaseJSONSchema & {
  enum: (string | null)[]; // null is a required option for enums
  "extend:descriptions"?: string[];
};

// String schema
type StringJSONSchema = BaseJSONSchema & {
  type: ["string", "null"];
  "extend:type"?: "date";
};

// Structure for a date field
type DateStringSchema = BaseJSONSchema & {
  type: ["string", "null"];
  "extend:type": "date";
};

// Number schema
type NumberJSONSchema = BaseJSONSchema & {
  type: ["number", "null"];
};

// Integer schema
type IntegerJSONSchema = BaseJSONSchema & {
  type: ["integer", "null"];
};

// Boolean schema
type BooleanJSONSchema = BaseJSONSchema & {
  type: ["boolean", "null"];
};

// Array schema
type ArrayJSONSchema = BaseJSONSchema & {
  type: "array";
  items: ObjectJSONSchema; // we only support objects in arrays for now
};

// Object schema
type ObjectJSONSchema = BaseJSONSchema & {
  type: "object";
  properties: {
    [key: string]: JSONSchema;
  };
  required: string[];
  additionalProperties?: boolean;
};

// Required structure for a currency field
type CurrencyObjectSchema = BaseJSONSchema & {
  type: "object";
  "extend:type": "currency";
  properties: {
    amount: {
      type: ["number", "null"];
    };
    iso_4217_currency_code: {
      type: ["string", "null"];
    };
  };
  required: ["amount", "iso_4217_currency_code"];
};

// Required structure for a signature field
type SignatureObjectSchema = BaseJSONSchema & {
  type: "object";
  "extend:type": "signature";
  properties: {
    printed_name: {
      type: ["string", "null"];
    };
    signature_date: {
      type: ["string", "null"];
      "extend:type": "date"; // Note: The date itself also uses extend:type
    };
    is_signed: {
      type: ["boolean", "null"];
    };
    title_or_role: {
      type: ["string", "null"];
    };
  };
  required: ["printed_name", "signature_date", "is_signed", "title_or_role"];
};

// Union of all schema types
type JSONSchema =
  | EnumJSONSchema
  | StringJSONSchema
  | NumberJSONSchema
  | IntegerJSONSchema
  | BooleanJSONSchema
  | ArrayJSONSchema
  | ObjectJSONSchema
  | DateStringSchema
  | CurrencyObjectSchema
  | SignatureObjectSchema;

Fields Array Structure (`fields`)

This section is relevant for the Fields Array config type. If you are using the JSON Schema config type, please see the JSON Schema Structure documentation. If you aren’t sure which config type you are using, please see the Migrating to JSON Schema documentation.

type ExtractionField = {
  id: string;
  name: string;
  type:
    | "string"
    | "number"
    | "currency"
    | "boolean"
    | "date"
    | "array"
    | "enum"
    | "object"
    | "signature";
  description: string;
  // Required for nested fields (arrays, objects, signatures)
  schema?: ExtractionField[];
  // Required for enums
  enum?: Enum[];
};

type Enum = {
  value: string;
  description: string;
};

Configuration Examples

Basic Example

const extractionConfig = {
  type: "EXTRACT",
  fields: [
    {
      id: "invoice_number",
      name: "Invoice Number",
      type: "string",
      description: "The unique identifier for this invoice",
    },
    {
      id: "amount",
      name: "Total Amount",
      type: "currency",
      description: "The total amount of the invoice",
    },
  ],
};

Example with Nested Fields

const complexExtractionConfig = {
  type: "EXTRACT",
  fields: [
    {
      id: "line_items",
      name: "Line Items",
      type: "array",
      description: "Individual items in the invoice",
      schema: [
        {
          id: "item_name",
          name: "Item Name",
          type: "string",
          description: "Name of the item",
        },
        {
          id: "quantity",
          name: "Quantity",
          type: "number",
          description: "Number of items",
        },
        {
          id: "unit_price",
          name: "Unit Price",
          type: "currency",
          description: "Price per unit",
        },
      ],
    },
    {
      id: "payment_status",
      name: "Payment Status",
      type: "enum",
      description: "Current payment status",
      enum: [
        {
          value: "PAID",
          description: "Payment has been completed",
        },
        {
          value: "PENDING",
          description: "Payment is pending",
        },
      ],
    },
  ],
  customExtractionRules: "- If ...", // Optional custom rules
  customDocumentKind: "invoice", // Optionally specify a document kind
  includeBoundingBoxCitations: true, // Turns on llm-powered citations and bounding box references
  includeModelReasoning: true, // Exposes the model's chain of thought reasoning for each field result
};

Example with nested arrays and objects

const nestedArrayConfig = {
  type: "EXTRACT",
  fields: [
    {
      id: "orders",
      name: "Orders",
      type: "array",
      description: "List of customer orders",
      schema: [
        {
          id: "order_id",
          name: "Order ID",
          type: "string",
          description: "Unique identifier for the order",
        },
        {
          id: "customer_name",
          name: "Customer Name",
          type: "string",
          description: "Name of the customer",
        },
        {
          id: "shipments",
          name: "Shipments",
          type: "array",
          description: "List of shipments for this order",
          schema: [
            {
              id: "tracking_number",
              name: "Tracking Number",
              type: "string",
              description: "Shipping tracking number",
            },
            {
              id: "ship_date",
              name: "Ship Date",
              type: "date",
              description: "Date the shipment was sent",
            },
            {
              id: "carrier",
              name: "Carrier",
              type: "string",
              description: "Shipping carrier name",
            },
          ],
        },
      ],
    },
  ],
};

Classification Processor Configuration

Classification processors categorize documents into predefined types.

type ClassificationConfig = {
  type: "CLASSIFY";
  baseProcessor?: string;
  baseVersion?: string;
  classifications: Classification[];
  customClassificationRules?: string;
  contextStrategy?: ContextStrategy;
};

type Classification = {
  id: string;
  type: string;
  description: string;
};

type ContextStrategy =
  | {
      type: "default";
      options: {};
    }
  | {
      type: "fixed"; // Set a fixed number of pages to be considered in the classification task.
      options: {
        pageLimit?: number;
      };
    }
  | {
      type: "max"; // Ensure the entire document is used as context in the classification task.
      options: {};
    };

Configuration Example

const classificationConfig = {
  type: "CLASSIFY",
  classifications: [
    {
      id: "invoice",
      type: "INVOICE",
      description: "Standard invoice document ...",
    },
    {
      id: "bill_of_lading",
      type: "BILL_OF_LADING",
      description: "Bill of Lading document ...",
    },
  ],
  customClassificationRules: "- If ...", // Optional custom rules
};

Splitter Processor Configuration

Splitter processors divide documents into logical sub-documents based on defined classifications.

type SplitterConfig = {
  type: "SPLITTER";
  baseProcessor?: string;
  baseVersion?: string;
  subDocumentClassifications: Classification[];
  customSplitterRules?: string;
  customReminders?: string;
  identifierRules?: string;
};

Configuration Example

const splitterConfig = {
  type: "SPLITTER",
  subDocumentClassifications: [
    {
      id: "purchase_contract",
      type: "PURCHASE_CONTRACT",
      description: "Purchase contract section",
    },
    {
      id: "addendum",
      type: "ADDENDUM",
      description: "Addendum section",
    },
  ],
  customSplitterRules: "- If ...", // Optional custom rules
  identifierRules: "- If ...", // Optional identifier rules
};

API Documentation

Workflow Endpoints

Processor Endpoints

Parse Endpoints

File Endpoints

Evaluation Set Endpoints

Objects

Guides

Webhooks

Processor configs

Document Processor Configuration Guide

Overview

Schema Definitions

Base Processor Schema

Extraction Processor Configuration

JSON Schema Structure (`schema`)

Unsupported Features

Schema Examples

Primitive Schema

Object Schema

Array Schema

Enum Schema

Custom Field Types

Date Schema

Currency Schema

Signature Schema

Configuration Examples

Type Definitions

Fields Array Structure (`fields`)

Configuration Examples

Classification Processor Configuration

Configuration Example

Splitter Processor Configuration

Configuration Example

API Documentation

Workflow Endpoints

Processor Endpoints

Parse Endpoints

File Endpoints

Evaluation Set Endpoints

Objects

Guides

Webhooks

​Document Processor Configuration Guide

​Overview

​Schema Definitions

​Base Processor Schema

​Extraction Processor Configuration

​JSON Schema Structure (schema)

​Unsupported Features

​Schema Examples

​Primitive Schema

​Object Schema

​Array Schema

​Enum Schema

​Custom Field Types

​Date Schema

​Currency Schema

​Signature Schema

​Configuration Examples

​Type Definitions

​Fields Array Structure (fields)

​Configuration Examples

​Classification Processor Configuration

​Configuration Example

​Splitter Processor Configuration

​Configuration Example

Document Processor Configuration Guide

Overview

Schema Definitions

Base Processor Schema

Extraction Processor Configuration

JSON Schema Structure (`schema`)

Unsupported Features

Schema Examples

Primitive Schema

Object Schema

Array Schema

Enum Schema

Custom Field Types

Date Schema

Currency Schema

Signature Schema

Configuration Examples

Type Definitions

Fields Array Structure (`fields`)

Configuration Examples

Classification Processor Configuration

Configuration Example

Splitter Processor Configuration

Configuration Example