ORCA (Optical Reasoning and Cognition Agent) VLMs

Accessing this feature
Your access to the feature described in this article depends on your license package and pricing plan.
To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

Our ORCA (Optical Reasoning and Cognition Agent) Vision Language Models (VLMs) leverage the power of GPUs to find and extract data in documents. In this article, you'll learn how to implement these VLMs and use the VLM features available in v42.

Benefits and considerations

In Hyperscience, ORCA VLMs work "out of the box" to extract data from documents. Because they do not require training to extract data, the use of VLMs can reduce implementation times, making them a valuable option for use cases that need to be set up as quickly as possible or that prevent models from being trained. This flexibility, along with their ability to detect visual elements (e.g., stamps, signatures), expands the data-extraction capabilities of the Hyperscience Platform.

However, these benefits come at a cost — VLMs require the use of GPUs, which are more expensive to run than CPUs are. Therefore, careful evaluation of your documents and available resources must be made before incorporating VLMs into your workflow.

Infrastructure requirements

A GPU is required in order to use ORCA VLMs. To learn more about the technical requirements for GPUs, see Infrastructure Requirements (for Docker and Podman) or Kubernetes Installation Overview.

Features available in v42

You can take advantage of the following features when using ORCA VLMs in v42.

Fine-tuning

With the sample documents you provide, the Hyperscience team will perform annotations, which are then used to generate an archive of use-case-specific weights. The weights are ingested by the layout's VLM flow in your instance, helping the model to detect and transcribe data more accurately than it would have otherwise.

Requirements

Sample documents — The team needs at least 40 representative documents (1-3 pages each) for each layout you intend to use VLMs with. Providing additional documents is recommended but not required to complete fine-tuning. While you can use VLMs to process documents containing more than 3 pages, those documents cannot be used for fine-tuning.
A list of fields you want to extract from the documents — This list of fields informs the layout-creation process.
A list of business- or use-case-specific rules used to process the documents — These rules help the team to make relevant and accurate annotations.

Thresholding and Quality Assurance

Thresholding allows you to set a target accuracy for the VLM's output. This target determines the volume of Supervision tasks the system generates, as well as with which fields are sent to Supervision and which can be processed automatically. In order to find the threshold for a given target accuracy, the Hyperscience team completes a set of Vision Language Model Quality Assurance (VLM QA) tasks in your instance.

Keyers at your organization can also complete VLM QA tasks after the VLM has been implemented and used to process submissions (see Vision Language Model Quality Assurance for more information). The results of these tasks are used to determine the accuracy of the VLM's layout, which can be found in the Manual Accuracy vs. Machine Accuracy Report (Reporting > Accuracy). To learn more about this report, see Manual Accuracy vs. Machine Accuracy.

If you are using the Fine-tuning feature, we recommend completing new thresholding any time the fine-tuning weights change for the use case.

Requirements

VLM flows — A separate copy of the VLM flow ("Vision Language Model Flow via GPU”) or your own custom VLM flow for each layout that will be used with the VLM. You will use these flows to process documents that have those layouts.

Supervision

The system generates Flexible Extraction tasks for fields that it makes low-confidence predictions for. To learn more about Flexible Extraction tasks, see Transcription.

In v42, the system indicates the approximate location of where it predicts the field is located in the document:

In previous versions, keyers had no assistance in locating fields. By helping keyers to find fields, this update can reduce the time required to complete Flexible Extraction tasks created from VLM outputs.

Requirements

Completion of thresholding — Because thresholds help determine which fields are sent to Supervision, thresholding work needs to be completed before the system can generate Supervision tasks.

General Prompting Block

With the General Prompting Block, you can apply ORCA VLMs to use cases that extend beyond data extraction. For example, you can use them in place of LLMs in Document Chat, or you leverage them to send multiple prompts when completing complex tasks that span several pages or documents. To learn more, see Using the General Prompting Block.

ORCA Composite Block

The original ORCA VLM flow required several Code Blocks, making the flow complex and difficult to modify. To simplify the integration of VLMs into custom flows, we created the ORCA Composite Block.

When the ORCA Composite Block is included in a document-processing flow, it can replace the Machine Identification and Machine Transcription steps for Semi-structured documents. It also allows the Reprocessing feature to be applied to any ORCA extraction task.

Installing the VLM

The ORCA VLM is installed when a submission is processed through the "VLM Vision Language Model Flow via GPU" flow or through a custom flow that includes the Install LLM/VLM Block. The type of model that is installed and used to process submissions is determined by the selection made in the Model Type setting for the Install LLM/VLM Block or for the flow.

Note that, if you used a VLM installed with the Install LLM/VLM Block in v41.1.3 or earlier, you need to change the model_name in your flow’s code from HSVLM to ORCA and the model_uuid to 14554188-cf8e-4f10-9057-d1df2f710072 after upgrading to v42.