Configure an Extraction processor
How to configure an Extraction processor in the Extend Studio
Builder
Once you have created an Extraction processor, you can navigate to the processor detail view and select the “Build” tab.
If you have already configured extraction steps in the past in our legacy workflow extraction editor, this will look and feel similar.
You can save changes without running, or save and run by clicking the “Save and run extraction” button at the bottom. The results will then show up below.
Fields
To configure a field, add a semantically accurate field name and write a description that explains how to identify and extract that field from the document.
You must also configure the proper field type:
Text
Use the text data type when you want to extract a string of text from a document. For example, if you want to extract the name of a person from a document, you would use the text data type.
Number
Use the number data type when you want to extract a number from a document. For example, if you want to extract the age of a person from a document, you would use the number data type.
Currency
Use the currency data type when you want to extract a currency value from a document. For example, if you want to extract the price of a product from a document, you would use the currency data type.
Boolean
Use the boolean data type when you want to extract a boolean value from a document. For example, if you want to extract whether a product is in stock from a document, you would use the boolean data type.
Date
Use the date data type when you want to extract a date from a document. For example, if you want to extract the date of birth of a person from a document, you would use the date data type.
Signature
Use the signature data type when you want to extract a signature from a document. For example, if you want to extract the signature of a person from a document, you would use the signature data type. Signature fields will automatically extract all relevant details of a document’s signature block:
- is_signed
- printed_name
- signatory_title
- signature_date
Object
Use the object data type when you want to extract a set of related fields from a document. For example, if you want to extract the address, name, and birth date of a person from a document you would use the object data type.
Array
Use the array data type when you want to extract a list of related fields from a document. For example, if you want to extract a list of products that each have a name, price, and quantity from a document you would use the array data type.
Configuration table
The field config table also will allow you to select the drag button to move the field up or down. Performance is best when related fields in the document are positioned in related order in the configuration table.
You can also set a field ID
which is a unique identifier for the field to use in your downstream system, so that you can make changes to the semantic field name without
updating your downstream system.
Configuring Custom Settings
In addition to the fields, you can also configure custom settings for each field. These settings allow you to further customize the extraction process to better suit your specific needs. However, please note that these settings are experimental and may not work as expected in all cases.
Before using these settings, we recommend consulting with the Extend team to understand their potential impact on the extraction process.
Using the Run tab
While it often makes sense to run files from the “Build” tab when getting set up, once you are ready to start testing your processor at scale, you should move over to the “Run” tab.
From this tab you can:
- Quickly run any number of files in a batch
- Select the version of the processor you want to run (or default to the saved draft version)
- Run an existing Evaluation set for the processor
Once you run a batch of files, you will be redirected to a results page that looks like this:
From here you can:
- See at a glance the coverage of fields extracted and average confidence
- Drill down into individual files to see the extracted fields and confidence levels
- (optionally) correct/edit the results of each output, then turn the entire batch into an Evaluation set
Note: our recommendation is to not create Evaluation Sets until you have at least mostly finalized what fields you are extracting, even if you are still iterating on the field descriptions. The reason for this is that Evaluation sets will be used to compare current and expected outputs, so if you add or remove fields from the processor, the Evaluation set will no longer be valid. This is fine, and everything will run, but metrics like accuracy and coverage will drop as a result, and you will need to go and update the Evaluation set to reflect the new expected outputs. This can be a tedious process to do repeatedly, so it is best to wait until you are mostly done finalizing the set of fields you are extracting or set of classification types you are using.
Publishing
See the Publishing page for more information on how to publish and use processors.