[v42.3] Training a Specialized Model

This feature is available in v42.3 and later.

ORCA is a Vision Language Model (VLM) that extracts information from documents. To learn more, see [v42.3] ORCA (Optical Reasoning and Cognition Agent) VLMs.

While the base model works out of the box, you can improve its performance by training it on your specific data using annotated documents. In v42.3 and later, ORCA VLMs can be managed and trained directly in Training Data Management (TDM). This article explains how to create a specialized model on top of the ORCA base model for your use case.

ORCA VLM specialization

The ORCA base model provides general-purpose extraction capabilities, but every customer’s documents are different.

Training the model on your documents allows you to:

Improve extraction Accuracy for your specific layouts. Learn more in Accuracy.
Increase Automation rates. To learn more, see Automation.
Reduce the number of fields that require human review.
Specialize the model to your organization’s document formats.

Training creates a model tailored to your use case while still leveraging the capabilities of the ORCA base model.

Accuracy and automation tradeoffs
Creating a specialized model on top of the ORCA base model allows you to define whether you need more accuracy or more automation for your specific use case. If you use an ORCA VLM without specialization, you will either fully automate the processing or send everything for human review. To learn more, see Accuracy.

To specialize an ORCA model:

Upload and annotate training documents.
Train the model.
Review the candidate model.
Deploy the candidate model.
Evaluate the candidate model.
Retrain if needed or promote the model to production.

Before specializing an ORCA model, ensure that:
An ORCA base model is installed. To learn more, see [v42.3] Installing ORCA VLMs.
You’ve configured a Semi-structured layout with the fields you need for your use case. ORCA VLMs extract fields from Semi-structured layouts only.
The latest Layout Version is locked.
A model definition exists for the layout. Learn more in [v42.3] Model Definitions.

Follow the steps below to adapt the ORCA base model to your specific use case.

Upload and annotate training documents

Before specializing an ORCA base model, you must upload and annotate training documents for your layout. Doing so provides the ground-truth values the model will learn from during training.

Dataset requirements
The uploaded documents should not exceed 5 pages. Larger documents may lead to an Out-of-Memory (OOM) Error. Supporting larger documents requires increasing GPU memory. Contact your Hyperscience representative for more information.
Ensure that all training documents are unique. Duplicate annotated documents are excluded from training.
If too many documents are excluded, the training process may fail due to insufficient data.
Include documents that represent the different patterns in your Dataset.
Annotate enough documents to capture these patterns before training the model.

To upload documents:

Go to Models > VLM Field Extraction.
Click the model definition associated with your layout.
On the Model Details page, click the Training Data tab.
In the Actions drop-down menu, click Upload Documents.
Select your documents and click Upload.

Once the documents are loaded in the system, you’ll be able to start the annotation process.

Annotating documents

To access the VLM Annotations experience, click the Document ID link in the Training Data table for each file you want to annotate.

When annotating documents for adapting the ORCA base model, follow these guidelines:

Enter text exactly as it appears in the document.
- Do not normalize dates, fix typos, or change the letter case unless specifically instructed.
Avoid formatting or input validation.
Review pre-populated values carefully.
- The application may automatically populate fields by detecting text inside a selected area. Always verify that the captured value matches the document exactly and correct it if needed.

Text Segmentation
For faster and easier annotation, use the dotted lines around each value. These boxes represent the text segmentation in the document. Learn more in Text Segmentation. You can drag and drop your cursor to create a bounding box.

Add missing text manually when necessary.
- If the automated detection misses part of the value, manually enter the missing characters to ensure the annotation is complete.
When a field appears multiple times in a document, annotate the values in natural reading order:
- Top to bottom
- Left to right
- First page to last page
If the same field appears across multiple pages, treat its values as a single continuous sequence for that document.
Make sure to click Save on each annotated document.
Navigate between documents by clicking the left or right arrow ().
- Documents will appear in the order you’ve sorted them in the Training Data table.

Train the model

During training, the system uses the annotated document values as ground truth and trains a model tailored to your document format.

Number of required documents
Ensure you have at least 120 annotated documents to train the model.
For more stable and reliable performance, we recommend using 150-200+ annotated documents, covering the variety of document patterns in your dataset. Using a larger and more diverse dataset generally improves model performance.

To initiate training:

Click on Train Model in the Actions drop-down menu.

Training results

Training produces a candidate model that you need to deploy and evaluate against production data. To deploy the candidate:

On the Model Details page, click History.
Find your candidate model and click Deploy.

The candidate model will be Live and you’ll be able to submit documents to be processed by the adapted ORCA base model.

Evaluate the candidate model

To evaluate the candidate model, you must first deploy it and then process documents through the system. Follow the steps below:

Using testing documents

We recommend setting aside 50-100 representative documents for testing your model’s performance. Doing so allows you to evaluate how the model performs on realistic data.

These documents should reflect the variety of inputs you expect in production.
They should not be seen by the model (i.e., should not be included in the training documents).

Evaluating

Deploy the candidate model
- Deploy the candidate model from the Actions drop-down menu on the Model Details page. Once deployed, the model becomes active and can process documents.

Run your testing documents through the system with a 100% QA sample rate. Follow the steps described in [v42.3] Installing ORCA VLMs to set your sample rate.
- This sample rate ensures that each processed document is reviewed in a VLM QA task, allowing you to assess extraction accuracy and automation trends before deployment in production. Learn how to perform the VLM QA task in our Vision Language Model Quality Assurance article.
Evaluate the model’s performance.
ORCA VLM transcriptions
Unlike traditional Identification models, ORCA VLMs directly generate transcriptions for each field.
- Use the QA results to assess if:
  - the Accuracy rate indicates that the generated transcriptions correctly match the document content.
  - the Automation rate indicates that documents can be processed without requiring manual transcription review.
- Compare the metrics between the new candidate and the previous model.

Decide next steps
- Based on the evaluation results, you can:
  - Compare performance with your current production model.
  - Retrain the model by adding more or higher-quality training data, then re-evaluate as described above.
  - Promote the model to production if the results meet your requirements.

Documentation Index