[v42.3] ORCA (Optical Reasoning and Cognition Agent) VLMs

Accessing this feature
Your access to the feature described in this article depends on your license package and pricing plan.
To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

Our ORCA (Optical Reasoning and Cognition Agent) Vision Language Models (VLMs) leverage the power of GPUs to find and extract data in documents. In this article, you'll learn how to implement these VLMs and use the VLM features available in v42.3.

This article describes features available in v42.3. If you are using ORCA in v42.0-v42.2, see [v42.0-v42.2] ORCA (Optical Reasoning and Cognition Agent).

Benefits and considerations

In Hyperscience, ORCA VLMs work "out of the box" to extract data from documents. Because they do not require training to extract data, the use of VLMs can reduce implementation times, making them a valuable option for use cases that need to be set up as quickly as possible or that prevent models from being trained. This flexibility, along with their ability to detect visual elements (e.g., stamps, signatures), expands the data-extraction capabilities of the Hyperscience Platform.

All VLMs, ORCA included, require GPU hardware. This hardware may need to be purchased for on-premise implementations or provisioned through a cloud provider. Note that ORCA requires a minimum GPU of 24GB of VRAM for inference and a minimum GPU of 24GB (48GB recommended) for training. Use of GPUs with VRAM under these respective amounts are unsupported by ORCA.

Infrastructure requirements

An application machine with a GPU is required in order to use ORCA VLMs. If you need to run training for ORCA model definitions, you can do so with a trainer machine that has a GPU. To learn more about the technical requirements for GPUs, see Infrastructure Requirements (for Docker and Podman) or Kubernetes Installation Overview.

Features available in v42.3

You can take advantage of the following features when using ORCA VLMs in v42.3.

Creating model definitions in the application

In v42.3 and later, you can create your own model definitions in the application, based on the layouts of the documents you would like to process through ORCA. These definitions are then used in combination with VLM QA results to achieve model performance that meets the target accuracy you specify. This training process replaces the fine-tuning and thresholding work that was managed by Hyperscience in previous versions.

To learn more about model definitions, see [v42.3] Model Definitions.

Supervision

Flexible Extraction tasks are generated for fields that the system makes low-confidence predictions for. To learn more about Flexible Extraction tasks, see Transcription.

The system indicates the approximate location of where it predicts the field is located in the document:

Requirements

Completion of training — Because thresholds help determine which fields are sent to Supervision, thresholding or training work needs to be completed before the system can generate Supervision tasks.

General Prompting Block

With the General Prompting Block, you can apply ORCA VLMs to use cases that extend beyond data extraction. For example, you can use them in place of LLMs in Document Chat, or you can leverage them to send multiple prompts when completing complex tasks that span several pages or documents. To learn more, see Using the General Prompting Block.

ORCA Composite Block

The original ORCA VLM flow required several Code Blocks, making the flow complex and difficult to modify. To simplify the integration of VLMs into custom flows, we created the ORCA Composite Block.

When the ORCA Composite Block is included in a document-processing flow, it can replace the Machine Identification and Machine Transcription steps for Semi-structured documents. It also allows the Reprocessing feature to be applied to any ORCA extraction task.

Setting up ORCA

In order to use ORCA in the processing of your submissions, you first need to configure your system’s infrastructure and install the VLM.

1. Configure your infrastructure for ORCA.

If you have an on-premise deployment of Hyperscience, follow the instructions in our “Enabling Application Machines with GPUs” article for Docker, Podman, or Kubernetes.
If you have a SaaS deployment of Hyperscience, contact your Hyperscience representative to have them make any necessary changes to your infrastructure.

2. Install the ORCA base model.

In v42.3 and later, you do not need to install ORCA through a flow; you can install it from the Assets page (Administration > Assets) in the application. Information on installed VLMs is provided on the VLM Field Extraction page (Models > VLM Field Extraction).

For more information about installing and activating ORCA base models, see [v42.3] Installing ORCA VLMs.

Using ORCA to process submissions

In order to process submissions through ORCA, you’ll need one of the following:

the “Vision Language Model Flow via GPU” flow obtained from Hyperscience
a flow that uses the “Document Processing with ORCA Subflow”, which is included in Hyperscience v42.3 and later
a custom flow that contains the ORCA blocks required for your use case.

Unless you are using ORCA through only the ORCA Composite Block in a custom flow, your flow needs to have a release with at least one layout in order to use ORCA.

ORCA can extract data from fields only; it cannot be used to extract data from tables.
It also cannot process Structured documents. In order to process Structured documents in a flow that includes an ORCA VLM, you will need to create a custom flow that can separate the processing of Structured and Semi-structured documents.

“Vision Language Model Flow via GPU” settings

The settings available in the “Vision Language Model Flow via GPU” flow are listed below according to their type. To view settings of a particular type, select that type from the Settings Type drop-down list in the flow’s settings.

LLM Install

Name	Required?	Description
Model Name	Yes	The name of the model to be installed, if it is not already present. Installation occurs when a submission is processed through the flow. This same model is used to process the flow’s submissions. ORCA is the only valid value in v42.
Cloudsmith Key	No	The Cloudsmith key for your instance. This setting is not applicable to SaaS deployments. Hidden by default.

Name

Required?

Description

Model Name

Yes

The name of the model to be installed, if it is not already present. Installation occurs when a submission is processed through the flow.

This same model is used to process the flow’s submissions.

ORCA is the only valid value in v42.

Cloudsmith Key

The Cloudsmith key for your instance.

This setting is not applicable to SaaS deployments.

Hidden by default.

Vision Language Model

Name	Required?	Description
Target Accuracy	Yes	The submission-level transcription accuracy targeted by the system, entered as a value between 0.0 and 1.0, inclusive (cannot be blank).
Show Machine Predictions in Supervision	No	When enabled, predicted transcriptions that the system has low confidences in are pre-populated in this flow’s Supervision tasks. Hidden by default.
Max Image Tokens	Yes	The maximum number of tokens used to read each page, which should be scaled based on the density of pages' content and how difficult it is to read any text (handwritten or printed) on the pages. Hidden by default.
Sliding Window Size	Yes	The number of pages processed by the model at once, which may affect throughput and GPU-memory usage. Hidden by default.
Max New Tokens	Yes	The maximum number of tokens used to extract data from each page, which should be scaled based on the number and length of fields to be extracted. Hidden by default.

Quality Assurance

Name	Required?	Description
Quality Assurance Flow	No	The flow that is called to generate VLM QA tasks. If you would like to include VLM QA tasks in your flow, select Vision Language Models QA in the drop-down list.