ORCA (Optical Reasoning and Cognition Agent) VLMs

Prev Next

This feature is available in v41.2 and later.

Accessing this feature

Your access to the feature described in this article depends on your license package and pricing plan.

To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

Our ORCA (Optical Reasoning and Cognition Agent) Vision Language Models (VLMs) leverage the power of GPUs to find and extract data in documents. In this article, you'll learn how to implement these VLMs and use the VLM features available in v41.2.

Benefits and considerations

In Hyperscience, ORCA VLMs work "out of the box" to extract data from documents. Because they do not require training to extract data, the use of VLMs can reduce implementation times, making them a valuable option for use cases that need to be set up as quickly as possible or that prevent models from being trained. This flexibility, along with their ability to detect visual elements (e.g., stamps, signatures), expands the data-extraction capabilities of the Hyperscience Platform.

However, these benefits come at a cost — VLMs require the use of GPUs, which are more expensive to run than CPUs are. Therefore, careful evaluation of your documents and available resources must be made before incorporating VLMs into your workflow.

In v41.2, Hyperscience provides assistance with the implementation of VLMs. If they've determined that using ORCA VLMs is the best option for your specific use case, your Hyperscience representative will give you the flow required to incorporate VLMs in the processing of your submissions.

Features available in v41.2

You can take advantage of the following features when using ORCA VLMs in v41.2.

Fine-tuning

With the sample documents you provide, the Hyperscience team will perform annotations, which are then used to generate an archive of use-case-specific weights. The weights are ingested by the layout's VLM flow in your instance, helping the model to detect and transcribe data more accurately than it would have otherwise.

Requirements

  • Sample documents — The team needs at least 40 representative documents (1-3 pages each) for each layout you intend to use VLMs with. Providing additional documents is recommended but not required to complete fine-tuning. While you can use VLMs to process documents containing more than 3 pages, those documents cannot be used for fine-tuning.

  • A list of fields you want to extract from the documents — This list of fields informs the layout-creation process.

  • A list of business- or use-case-specific rules used to process the documents — These rules help the team to make relevant and accurate annotations.

Thresholding and Quality Assurance

Thresholding allows you to set a target accuracy for the VLM's output. This target determines the volume of Supervision tasks the system generates, as well as with which fields are sent to Supervision and which can be processed automatically. In order to find the threshold for a given target accuracy, the Hyperscience team completes a set of Vision Language Model Quality Assurance (VLM QA) tasks in your instance.

Keyers at your organization can also complete VLM QA tasks after the VLM has been implemented and used to process submissions (see Vision Language Model Quality Assurance for more information). The results of these tasks are used to determine the accuracy of the VLM's layout, which can be found in the Manual Accuracy vs. Machine Accuracy Report (Reporting > Accuracy). To learn more about this report, see Manual Accuracy vs. Machine Accuracy.

If you are using the Fine-tuning feature, we recommend completing new thresholding any time the fine-tuning weights change for the use case.

Requirements

  • VLM flows — A separate copy of the VLM flow ("Vision Language Model Flow via GPU') for each layout that will be used with the VLM. You will use these flows to process documents that have those layouts.

Supervision

The system generates Flexible Extraction tasks for fields that it makes low-confidence predictions for. To learn more about Flexible Extraction tasks, see Transcription.

Requirements

  • Completion of thresholding — Because thresholds help determine which fields are sent to Supervision, thresholding work need to be completed before the system can generate Supervision tasks.

Installing the VLM

In v41.2, the ORCA VLM is installed when a submission is processed through the "VLM Vision Language Model Flow via GPU" flow. The type of model that is installed and used to process submissions is determined by the selection made in the Model Type setting for the flow. Because the Hyperscience team implements VLMs in v41.2, no action is required on your part to install the VLM.

Note that, if you used a VLM installed with the Install LLM/VLM Block in v41.1.3 or earlier, you need to change the model_name in your flow’s code from HSVLM to ORCA and the model_uuid to 14554188-cf8e-4f10-9057-d1df2f710072 after upgrading to v41.2.

“Vision Language Model Flow via GPU” settings

The settings available in the “Vision Language Model Flow via GPU” flow are listed below according to their type. To view settings of a particular type, select that type from the Settings Type drop-down list in the flow’s settings.

LLM Install

Name

Required?

Description

Model Name

Yes

The name of the model to be installed, if it is not already present. Installation occurs when a submission is processed through the flow.

This same model is used to process the flow’s submissions.

ORCA is the only valid value in v41.2.

Cloudsmith Key

No

The Cloudsmith key for your instance.

This setting is not applicable to SaaS deployments.

Hidden by default.

Vision Language Model

Name

Required?

Description

Target Accuracy

Yes

The submission-level transcription accuracy targeted by the system, entered as a value between 0.0 and 1.0, inclusive (cannot be blank).

Show Machine Predictions in Supervision

No

When enabled, predicted transcriptions that the system has low confidences in are pre-populated in this flow’s Supervision tasks.

Hidden by default.

Max Image Tokens

Yes

The maximum number of tokens used to read each page, which should be scaled based on the density of pages' content and how difficult it is to read any text (handwritten or printed) on the pages.

Hidden by default.

Sliding Window Size

Yes

The number of pages processed by the model at once, which may affect throughput and GPU-memory usage.

Hidden by default.

Max New Tokens

Yes

The maximum number of tokens used to extract data from each page, which should be scaled based on the number and length of fields to be extracted.

Hidden by default.

Quality Assurance

Name

Required?

Description

Quality Assurance Flow

No

The flow that is called to generate VLM QA tasks.

If you would like to include VLM QA tasks in your flow, select Vision Language Models QA in the drop-down list.