Current Config Structure

This section gives some background on the current config structure and the new JSON Schema config structure. If you’d like to jump to migrating to the new JSON Schema config structure, you can go straight to the Migrating to JSON Schema section.

If your organization started using Extend before April 2025, you likely have been using the legacy Fields Array config type.

This means that the config object in processor has a fields array that contains the fields for the processor. Here is an example config object of this type:

{
  "type": "EXTRACT",
  "fields": [
    {
      "id": "invoice_number",
      "name": "invoice_number",
      "type": "string",
      "description": "The unique identifier for this invoice"
    },
    {
      "id": "amount",
      "name": "amount",
      "type": "currency",
      "description": "The total amount of the invoice"
    }
  ]
  // other fields...
}

This schema has worked well, however since releasing it, the industry has standardized around JSON Schema as the way response schemas are defined. To make our processors easier to use for developers, we are moving to JSON Schema as the way schemas are defined for processors.

New JSON Schema Config Structure

A JSON Schema config object equivalent of the above example is:

{
  "type": "EXTRACT",
  "schema": {
    "type": "object",
    "properties": {
      "invoice_number": {
        "type": ["string", "null"],
        "description": "The unique identifier for this invoice"
      },
      "amount": {
        "type": "object",
        "properties": {
          "value": {
            "type": ["number", "null"]
          },
          "iso_4217_currency_code": {
            "type": ["string", "null"]
          }
        },
        "required": ["value", "iso_4217_currency_code"],
        "additionalProperties": false
      }
    },
    "required": ["invoice_number", "amount"],
    "additionalProperties": false
  }
  // other fields...
}

You’ll notice that instead of the fields array, we have a schema object. This object is a JSON Schema object that describes the shape of the output you will receive from the processor.

The JSON Schema standard is extremely flexible. We’ve implemented a subset of the standard to support the needs of document extraction. Your schema must follow these rules:

  • The root must be an object type
  • Allowed types are string, number, integer, boolean, object, and array
  • All primitive fields (string, number, boolean, integer) must be nullable (use array type with “null” as an option e.g. "type": ["string", "null"])
  • Maximum nesting level is 3 (each non-root object counts as 1 level)
  • Property keys and names must only contain lowercase letters, numbers, and underscores
  • Array items must be objects
  • Enums must only contain strings and must contain a null option

While we support the JSON Schema structure, we do not support many of the additional features some of which include:

Current Output Structure

The current output structure for Extraction processors is an object with the field names as keys and the values inside an object with the following properties:

  • id: The unique identifier for the field
  • type: The type of the field
  • value: The value of the field
  • confidence: The confidence score of the field
  • insights: The insights for the field
  • references: The references for the field

Here is an example of the output:

{
  "invoice_number": {
    "id": "invoice_number",
    "type": "string",
    "value": "36995",
    "confidence": 0.98,
    "insights": [
      {
        "type": "reasoning",
        "content": "The invoice number is clearly labeled as 'Invoice #36995' at the top right of the document, making it straightforward to extract."
      }
    ],
    "references": [
      {
        "page": 1,
        "boundingBoxes": [
          {
            "left": 296.73359999999997,
            "top": 40.888799999999996,
            "right": 386.4168,
            "bottom": 52.1712
          }
        ],
        "referenceText": "Invoice #36995"
      }
    ]
  },
  "amount": {
    "id": "amount",
    "type": "number",
    "value": 15735.1,
    "confidence": 0.98,
    "insights": [
      {
        "type": "reasoning",
        "content": "The total amount is shown as '$15,735.1' in both the table summary and the bottom right of the document. The currency symbol '$' and the US address indicate the currency is USD. The value is numeric and matches the required format."
      }
    ],
    "references": [
      {
        "page": 1,
        "boundingBoxes": [
          {
            "left": 430.164,
            "top": 722.772,
            "right": 467.27279999999996,
            "bottom": 722.8296
          }
        ],
        "referenceText": "TOTAL  $15,735.1"
      }
    ]
  }
}

In this output, the metadata like confidence, insights, and references are nested inside each field’s object right next to the value. The benefit of this is it’s very easy to access the metadata for a specific field. The downside is that it doesn’t work very well for recursive fields like arrays and objects.

New JSON Schema Output Structure

The output structure for JSON Schema processors is composed of two properties: value and metadata.

The value property is the actual data extracted from the document which conforms to the JSON Schema defined in the processor config.

The metadata property contains additional information about the data extracted from the document like confidence scores, reasoning, and citations.

Below is an example of the output you will receive from a JSON Schema processor:

{
  "value": {
    "amount": {
      "amount": 15735.1,
      "iso_4217_currency_code": "USD"
    },
    "invoice_number": "36995"
  },
  "metadata": {
    "amount": {
      "insights": [
        {
          "type": "reasoning",
          "content": "The total amount is shown as '$15,735.1' in both the table summary and the bottom right of the document. The currency symbol '$' and the US address indicate the currency is USD. The value is numeric and matches the required format."
        }
      ],
      "citations": [
        {
          "page": 1,
          "polygon": [
            {
              "x": 430.164,
              "y": 722.772
            },
            {
              "x": 467.27279999999996,
              "y": 722.8296
            },
            {
              "x": 467.2584,
              "y": 731.6351999999999
            },
            {
              "x": 430.1496,
              "y": 731.5776
            }
          ],
          "referenceText": "TOTAL  $15,735.1"
        }
      ],
      "ocrConfidence": 0.992,
      "logprobsConfidence": 1
    },
    "invoice_number": {
      "insights": [
        {
          "type": "reasoning",
          "content": "The invoice number is clearly labeled as 'Invoice #36995' at the top right of the document, making it straightforward to extract."
        }
      ],
      "citations": [
        {
          "page": 1,
          "polygon": [
            {
              "x": 296.73359999999997,
              "y": 40.888799999999996
            },
            {
              "x": 386.4168,
              "y": 40.464000000000006
            },
            {
              "x": 386.4744,
              "y": 52.1712
            },
            {
              "x": 296.7912,
              "y": 52.596000000000004
            }
          ],
          "referenceText": "Invoice #36995"
        }
      ],
      "ocrConfidence": 0.986,
      "logprobsConfidence": 1
    }
  }
}

The benefit of this output structure is that it’s very easy to access the data for a specific field and it should be easy to ingest it as a typed object because it conforms to the JSON Schema defined in the processor config.

The Typescript types for the output are the following:

export type ExtractionOutput = {
  value: ExtractionValue;
  metadata: ExtractionMetadata;
};

export type ExtractionValue = Record<string, any>;
export type ExtractionMetadata = {
  [key: string]: ExtractionMetadataEntry | undefined;
};

export interface ExtractionMetadataEntry {
  ocrConfidence?: number | null;
  logprobsConfidence: number | null;
  citations?: Citation[];
  insights?: OutputInsight[];
}

export type Citation = {
  page?: number;
  referenceText?: string | null;
  polygon?: Point2D[];
};

type Point2D {
  /**
   * x coordinate - relative from the left side of the page
   */
  x: number;
  /**
   * y coordinate - relative from the top of the page
   */
  y: number;
}

export type Insight = {
  type: "reasoning";
  content: string;
};

Migrating to JSON Schema

To migrate a processor from the legacy Fields Array config type to the JSON Schema config type, you will need to:

  1. Go to the processor in Studio that you’d like to migrate.
  2. Click the button with the three vertical dots in the top right corner to open the settings menu.

  1. Click “Migrate to JSON Schema”. This will open a modal where you can select the version and choose the name for the new processor. Click “Migrate to JSON Schema”. This will create a new processor with the fields array replaced with a JSON Schema config object.

Please share any feedback you have on the new JSON Schema config type and output structure with us on Slack!