This feature is available in v41.2 and later.
Accessing this feature
Your access to the feature described in this article depends on your license package and pricing plan.
To learn which features are available to your organization and how to add more, contact your Hyperscience representative.
Our ORCA (Optical Reasoning and Cognition Agent) Vision Language Models (VLMs) leverage the power of GPUs to find and extract data in documents. In this article, you'll learn how to implement these VLMs and use the VLM features available in v41.2.
Benefits and considerations
In Hyperscience, ORCA VLMs work "out of the box" to extract data from documents. Because they do not require training to extract data, the use of VLMs can reduce implementation times, making them a valuable option for use cases that need to be set up as quickly as possible or that prevent models from being trained. This flexibility, along with their ability to detect visual elements (e.g., stamps, signatures), expands the data-extraction capabilities of the Hyperscience Platform.
However, these benefits come at a cost — VLMs require the use of GPUs, which are more expensive to run than CPUs are. Therefore, careful evaluation of your documents and available resources must be made before incorporating VLMs into your workflow.
In v41.2, Hyperscience provides assistance with the implementation of VLMs. If they've determined that using ORCA VLMs is the best option for your specific use case, your Hyperscience representative will give you the flow required to incorporate VLMs in the processing of your submissions.
Features available in v41.2
You can take advantage of the following features when using ORCA VLMs in v41.2.
Fine-tuning
With the sample documents you provide, the Hyperscience team will perform annotations, which are then used to generate an archive of use-case-specific weights. The weights are ingested by the layout's VLM flow in your instance, helping the model to detect and transcribe data more accurately than it would have otherwise.
Requirements
Sample documents — The team needs at least 40 representative documents (1-3 pages each) for each layout you intend to use VLMs with. Providing additional documents is recommended but not required to complete fine-tuning. While you can use VLMs to process documents containing more than 3 pages, those documents cannot be used for fine-tuning.
A list of fields you want to extract from the documents — This list of fields informs the layout-creation process.
A list of business- or use-case-specific rules used to process the documents — These rules help the team to make relevant and accurate annotations.
Thresholding and Quality Assurance
Thresholding allows you to set a target accuracy for the VLM's output. This target determines the volume of Supervision tasks the system generates, as well as with which fields are sent to Supervision and which can be processed automatically. In order to find the threshold for a given target accuracy, the Hyperscience team completes a set of Vision Language Model Quality Assurance (VLM QA) tasks in your instance.
Keyers at your organization can also complete VLM QA tasks after the VLM has been implemented and used to process submissions (see Vision Language Model Quality Assurance for more information). The results of these tasks are used to determine the accuracy of the VLM's layout, which can be found in the Manual Accuracy vs. Machine Accuracy Report (Reporting > Accuracy). To learn more about this report, see Manual Accuracy vs. Machine Accuracy.
If you are using the Fine-tuning feature, we recommend completing new thresholding any time the fine-tuning weights change for the use case.
Requirements
VLM flows — A separate copy of the VLM flow ("Vision Language Model Flow via GPU') for each layout that will be used with the VLM. You will use these flows to process documents that have those layouts.
Supervision
The system generates Flexible Extraction tasks for fields that it makes low-confidence predictions for. To learn more about Flexible Extraction tasks, see Transcription.
Requirements
Completion of thresholding — Because thresholds help determine which fields are sent to Supervision, thresholding work need to be completed before the system can generate Supervision tasks.
Installing the VLM
In v41.2, the ORCA VLM is installed when a submission is processed through the "VLM Vision Language Model Flow via GPU" flow. The type of model that is installed and used to process submissions is determined by the selection made in the Model Type setting for the flow. Because the Hyperscience team implements VLMs in v41.2, no action is required on your part to install the VLM.
Note that, if you used a VLM installed with the Install LLM/VLM Block in v41.1.3 or earlier, you need to change the model_name
in your flow’s code from HSVLM
to ORCA
and the model_uuid
to 14554188-cf8e-4f10-9057-d1df2f710072
after upgrading to v41.2.
“Vision Language Model Flow via GPU” settings
The settings available in the “Vision Language Model Flow via GPU” flow are listed below according to their type. To view settings of a particular type, select that type from the Settings Type drop-down list in the flow’s settings.
LLM Install
Name | Required? | Description |
---|---|---|
Model Name | Yes | The name of the model to be installed, if it is not already present. Installation occurs when a submission is processed through the flow. This same model is used to process the flow’s submissions. ORCA is the only valid value in v41.2. |
Cloudsmith Key | No | The Cloudsmith key for your instance. This setting is not applicable to SaaS deployments. Hidden by default. |
Vision Language Model
Name | Required? | Description |
---|---|---|
Target Accuracy | Yes | The submission-level transcription accuracy targeted by the system, entered as a value between 0.0 and 1.0, inclusive (cannot be blank). |
Show Machine Predictions in Supervision | No | When enabled, predicted transcriptions that the system has low confidences in are pre-populated in this flow’s Supervision tasks. Hidden by default. |
Max Image Tokens | Yes | The maximum number of tokens used to read each page, which should be scaled based on the density of pages' content and how difficult it is to read any text (handwritten or printed) on the pages. Hidden by default. |
Sliding Window Size | Yes | The number of pages processed by the model at once, which may affect throughput and GPU-memory usage. Hidden by default. |
Max New Tokens | Yes | The maximum number of tokens used to extract data from each page, which should be scaled based on the number and length of fields to be extracted. Hidden by default. |
Quality Assurance
Name | Required? | Description |
---|---|---|
Quality Assurance Flow | No | The flow that is called to generate VLM QA tasks. If you would like to include VLM QA tasks in your flow, select Vision Language Models QA in the drop-down list. |