Document Processor Configuration Guide

Document processors are the core components that analyze and manipulate information from your documents. This guide explains how to configure processors through our API, including detailed examples for each processor type.

Overview

We support three types of document processors:

  • Extraction Processors: Extract specific fields from documents.
  • Classification Processors: Categorize documents.
  • Splitter Processors: Divide documents into logical sub-documents.

Generally we recommend using our UI for configuration, but the API can be useful in programmatic workflows, when you need to configure a large number of processors, or when you need to keep your configurations in source control and versioned.

You can also use our webhook events to consume changes made to configurations in the Extend Dashboard and keep your saved configurations in sync.

Best Practices

  1. Field IDs: Use clear, lowercase, underscore-separated identifiers.
  2. Descriptions: Provide detailed descriptions for all fields and classifications.
  3. Field Types: Choose the most specific field type for your use case.
  4. Validation: Test your configurations with sample documents.
  5. Base Processors: Allow Extend to default to latest version of a processor on creation, or specify a pinned version to use instead. (See Changelog for more details).

Error Handling

The API will return validation errors if your configuration is invalid. Common issues include:

  • Missing required fields
  • Invalid field types
  • Malformed enum options
  • Invalid base processor references

Schema Definitions

Base Processor Schema

All processor configurations share these base properties:

interface BaseProcessorConfig {
  // Base properties inherited by all processor types
  baseProcessor?: string;
  baseVersion?: string;
}

They will be set by default to latest available on processor creation unless otherwise specified - see Changelog for more details.

Extraction Processor Schema

interface ExtractionConfig extends BaseProcessorConfig {
  type: "EXTRACT";
  fields: ExtractionField[];
  customDefinitions?: string;
  customExtractionRules?: string;
  customDocumentKind?: string;
}

interface ExtractionField {
  id: string;
  name: string;
  type: "string" | "number" | "currency" | "boolean" | "date" | 
        "array" | "enum" | "object" | "signature";
  description: string;
  // For nested fields (arrays, objects, signatures)
  schema?: ExtractionField[];
  enum?: Enum[];
}

interface Enum {
  value: string;
  description: string;
}

Classification Processor Schema

interface ClassificationConfig extends BaseProcessorConfig {
  type: "CLASSIFY";
  classifications: Classification[];
  customClassificationRules?: string;
}

interface Classification {
  id: string;
  type: string;
  description: string;
}

Splitter Processor Schema

interface SplitterConfig extends BaseProcessorConfig {
  type: "SPLITTER";
  subDocumentClassifications: Classification[];
  customSplitterRules?: string;
  customReminders?: string;
  identifierRules?: string;
}

Updating Processor Configurations

You can update a processor’s configuration using the following endpoint:

// POST /processors/:id
const updateProcessor = async (processorId: string, config: ProcessorConfig) => {
  const response = await fetch(`https://api-prod.extend.app/processors/${processorId}`, {
    method: 'POST',
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer YOUR_API_KEY`,
    },
    body: JSON.stringify({
      config,
    }),
  });
  return response.json();
};

Extraction Processor Configuration

Extraction processors are used to extract specific fields from documents. They support a wide range of field types and nested structures.

Basic Example

const extractionConfig = {
  type: "EXTRACT",
  fields: [
    {
      id: "invoice_number",
      name: "Invoice Number",
      type: "string",
      description: "The unique identifier for this invoice",
    },
    {
      id: "amount",
      name: "Total Amount",
      type: "currency",
      description: "The total amount of the invoice",
    },
  ],
};

All Field Types

Extraction processors support these field types:

  • string: Text values
  • number: Numeric values
  • currency: Monetary values
  • boolean: True/false values
  • date: Date values
  • array: Lists of values
  • enum: Predefined, constrained text output options
  • object: Nested structures
  • signature: Signature information

Example with Nested Fields

const complexExtractionConfig = {
  type: "EXTRACT",
  fields: [
    {
      id: "line_items",
      name: "Line Items",
      type: "array",
      description: "Individual items in the invoice",
      schema: [
        {
          id: "item_name",
          name: "Item Name",
          type: "string",
          description: "Name of the item",
        },
        {
          id: "quantity",
          name: "Quantity",
          type: "number",
          description: "Number of items",
        },
        {
          id: "unit_price",
          name: "Unit Price",
          type: "currency",
          description: "Price per unit",
        },
      ],
    },
    {
      id: "payment_status",
      name: "Payment Status",
      type: "enum",
      description: "Current payment status",
      enum: [
        {
          value: "PAID",
          description: "Payment has been completed",
        },
        {
          value: "PENDING",
          description: "Payment is pending",
        },
      ],
    },
  ],
  customExtractionRules: "- If ...", // Optional custom rules
  customDocumentKind: "invoice", // Optionally specify a document kind
};

Example with nested arrays and objects

const nestedArrayConfig = {
  type: "EXTRACT",
  fields: [
    {
      id: "orders",
      name: "Orders",
      type: "array", 
      description: "List of customer orders",
      schema: [
        {
          id: "order_id",
          name: "Order ID",
          type: "string",
          description: "Unique identifier for the order"
        },
        {
          id: "customer_name",
          name: "Customer Name",
          type: "string",
          description: "Name of the customer"
        },
        {
          id: "shipments",
          name: "Shipments",
          type: "array",
          description: "List of shipments for this order",
          schema: [
            {
              id: "tracking_number",
              name: "Tracking Number", 
              type: "string",
              description: "Shipping tracking number"
            },
            {
              id: "ship_date",
              name: "Ship Date",
              type: "date",
              description: "Date the shipment was sent"
            },
            {
              id: "carrier",
              name: "Carrier",
              type: "string", 
              description: "Shipping carrier name"
            }
          ]
        }
      ]
    }
  ]
};

Classification Processor Configuration

Classification processors categorize documents into predefined types.

Example

const classificationConfig = {
  type: "CLASSIFY",
  classifications: [
    {
      id: "invoice",
      type: "INVOICE",
      description: "Standard invoice document ...",
    },
    {
      id: "bill_of_lading",
      type: "BILL_OF_LADING",
      description: "Bill of Lading document ...",
    },
  ],
  customClassificationRules: "- If ...", // Optional custom rules
};

Splitter Processor Configuration

Splitter processors divide documents into logical sub-documents based on defined classifications.

Example

const splitterConfig = {
  type: "SPLITTER",
  subDocumentClassifications: [
    {
      id: "purchase_contract",
      type: "PURCHASE_CONTRACT",
      description: "Purchase contract section",
    },
    {
      id: "addendum",
      type: "ADDENDUM",
      description: "Addendum section",
    },
  ],
  customSplitterRules: "- If ...", // Optional custom rules
  identifierRules: "- If ...", // Optional identifier rules
};