Training Data Management (TDM) is where you prepare and manage the data used to train your models. In TDM, you review and annotate documents, build your training dataset, and improve model performance for your specific use case.
In this article, you’ll learn how TDM allows you to prepare, annotate, and manage training data for specialized ORCA VLM models.
Prerequisites
Before you start, ensure the following requirements are met:
The ORCA base model is installed. Follow the steps in Installing ORCA VLMs to install and configure the base model in your instance.
A Semi-structured layout with fields is locked and associated with the flow that is using ORCA. Learn more in Creating Semi-structured Layouts.
A model definition exists for the layout. The model definition links the layout to the training configuration and enables model training. Learn more in Model Definitions.
Learn how to navigate TDM for ORCA VLMs and understand its key sections below.
Model details page
The sections below explain the key information shown in the Overview tab.
Pre-trained model
Before you train a specialized model, the Model summary card will display a pre-trained candidate. This model is the ORCA base model. To learn more, see ORCA (Optical Reasoning and Cognition Agent) VLMs.
Model summary card
The Model summary card provides insights into all models trained for a specific model definition. See the table below to learn about the displayed details:
Field | Description | Notes |
|---|---|---|
State | Shows the model’s state. |
|
Projected automation | Displays the performance of the model that’s currently live. | The predicted automation based on the desired target accuracy. The projection is derived from the model’s training data. The system automatically ensures that the same data is not used for both projections and training. |
Test Target Accuracy | The accuracy percentage used to calculate the projected automation. | Indicates the desired overall system accuracy. |
Trained | Date the model was trained. | |
Layout version | The layout version used for this model definition. | Always use the latest locked layout version. |
Training data card
The Training data card displays information about your dataset. See the table below for more information:
Field | Description | Notes |
|---|---|---|
Training status | Indicates the status of your model based on the training data. |
|
Total documents | The number of annotated documents for this model. |
Projected Automation
The Projected Automation chart displays the performance of the currently live model, compared to the candidate one.
Expand it by clicking the arrow (
).
The chart displays how the target accuracy affects the automation. The lower the accuracy, the higher the automation, and vice versa.
Margin of Error
The Margin of Error (MoE) indicates the allowable range of inaccuracy in the system's results. It shows you how much the output can differ from the true value while still being acceptable. A smaller margin of error means the system is more accurate.To learn more, see Accuracy and Automation.
Training Data
The Training Data tab lists all training documents and allows you to annotate and manage them. The interactive demo below will show you how to use this tab:
Tagging documents
You can organize and manage the training documents more efficiently in TDM for ORCA VLMs by adding tags. This feature allows you to add, filter, import, and export tags for documents, making it easier to categorize and find the information you need.
It provides the following key capabilities:
Manual tagging — Hover over the Tags cell in the Training Data table to reveal a + button. Click it to open the drop-down list with all existing tags. From there, you can select an existing tag or create a new tag.
Tag filtering — Filter documents in the Training Data table by tag to find relevant items quickly.
Import tags — If the training data contains tags, they will be automatically imported.
Special-character handling — Tags cannot contain “;” or spaces (spaces are replaced with underscores).
Unused tags — Unassigned tags are automatically deleted.
VLM annotation
VLM annotation experience
While the ORCA VLM works out of the box, you can improve its performance by training it on your specific data, using annotated documents. Learn how to annotate documents in the demo below:
Automatically transcribed fields
Some fields may be pre-populated during annotation. This behavior is intended to assist the keyer and reduce manual typing. The keyer should review each value and correct it if necessary to ensure it matches the text in the document exactly. The reviewed values are then used as the for model training
History
The History tab lists all models trained for the selected model definition. From this tab, you can deploy, undeploy, or reject models, and view detailed information for each one. Starting in v43, you can also rename specialized models.
Base model entry in the History page
This entry represents the default ORCA-powered extraction configuration for a given layout, before custom training.
Learn how to navigate the History tab from the interactive demo below:
Column | Description | Notes |
|---|---|---|
Name | Model name | |
State | Model state |
|
Compatibility | Compatibility of the most recently live model for this definition. |
Learn more about compatibility in our Model Compatibility Logic article. |
Layout version | The layout version for this model. | |
Source | Where the model was trained. |
|
During training, the system uses the ORCA base model and the annotated documents to learn patterns specific to your use case. The training process produces a candidate model, which can then be evaluated and deployed.
Training does not modify the base ORCA model.
Instead, it creates a model tailored to the layout and dataset used for training. You can rename the specialized model, as shown in the walkthrough above.
Training results
After training completes:
a candidate model appears in the Overview tab
the system calculates projected automation
the candidate model can be reviewed and deployed.
You can retrain the model by adding more annotated documents and running training again.
Deploying the candidate model
To start using the trained model:
Open the Overview tab.
Review the candidate model summary.
Click Deploy from the Actions drop-down to promote the candidate model to Live.
Deploying a model from the History tab
You can also deploy your model from the History tab, as shown in the walkthrough above.
After deployment, the new model will be used for document processing.
Next steps
Train a specialized model for your specific use case on top of the ORCA base model and evaluate it, by following the instructions in Training a Specialized Model.