This article describes the settings available in the Document Processing with ORCA Subflow included in v42.3. The settings available in custom flows may differ from those described here, depending on which blocks are included in those flows. To learn more about the settings for individual blocks, see Flow Blocks.
As part of our efforts to give you more precise control over your Hyperscience processes, we’ve made many of our settings configurable on the flow level.
While you can build custom flows, each instance of Hyperscience includes a Document Processing flow. To learn more about the version of this flow that comes with v42, see Document Processing Flow in V42.
The Document Processing flow contains several subflows, including the Document Processing with ORCA Subflow. This article focuses on the settings available in that subflow.
View the subflow’s settings
To view the settings of the Document Processing with ORCA Subflow:
Click Flows in the left-hand sidebar, and click on the name of the Document Processing flow that contains the Document Processing with ORCA Subflow whose settings you would like to view.
Click Edit Flows.
On the Flow Studio canvas, click the Start Document Processing Subflow Block.
Click the Settings Type drop-down list, and click on a setting type.
Edit the subflow’s settings
After you’ve viewed the subflow’s settings, you can make any necessary changes, and then click Save in the upper-right corner of the page. You can save changes to multiple settings types at once.
Available settings
The sections below describe the settings available for each setting type.
ORCA
Setting | Description | Default Value |
|---|---|---|
ORCA Base Model | The base model that will be used when processing submissions through this block. Note that the selected base model must be installed in the instance before using the Document Processing with ORCA Subflow to process submissions. To learn more, see Installing ORCA VLMs. | ORCA 1.0 |
ORCA Target Accuracy | Your desired accuracy for the extraction of field data. If the estimated accuracy of the model's prediction for a field is below this value, the system will send the field and all its occurrences (if any) to Flexible Extraction Supervision. Note that the target accuracy is applied only if a specialized model definition has been created for this flow from the base model. | 95 |
Quality Assurance Flow | The subflow that will be used to generate Vision Language Model QA tasks for submissions processed through this flow.
| None |
Task Restrictions | Determines which users can access Supervision tasks created by this block. To learn more, see Task Restrictions Overview. | None selected |
ORCA QA Sample Rate | The percentage of documents that the system will randomly select for Vision Language Model QA. This setting is available only if a flow is selected in Quality Assurance Flow. | 5 |
File Filter
Setting | Description | Default Value |
|---|---|---|
All Files or Images Only | Determines whether the filters in this block are applied to all files in submissions (Apply to all files) or only to image files (i.e., files whose MIME type is image) (Apply to images only). | Apply to all files |
Minimum Image Width (px) | The minimum width in pixels that an image needs to have in order to be allowed by the filter. This filter applies only to images (i.e., files whose MIME type is image) and has no impact on other files. | (Blank) |
Minimum Image Height (px) | The minimum height in pixels that an image needs to have in order to be allowed by the filter. This filter applies only to images (i.e., files whose MIME type is image) and has no impact on other files. | (Blank) |
Minimum File Size (KB) | The minimum size in kilobytes that a file needs to have in order to be allowed by the filter. | (Blank) |
File Extension Action | Select one of the following options:
| Do not filter files by extension |
File Extensions | A list of file extensions that the filter will allow or deny, based on the option selected in File Extension Action. Select the checkboxes for the file extensions that you would like to filter by. If zip is selected as a file extension, the filter will not decompress ZIP files included in submissions. Each ZIP file will be treated as an individual file, regardless of the numbers of types of files compressed within it. If there are file extensions that you want to filter by that do not appear in the drop-down list, select other, and enter the extensions in Other File Extensions. This field only appears if Allow only these file extensions or Deny files with these extensions is selected in File Extension Action. | (Does not appear) |
Other File Extensions | A comma-separated list of file extensions that do not appear in File Extensions. This field only appears if other is selected in File Extensions. | (Does not appear) |
Submission Bootstrap
AWS S3
S3 Submission Retrieval Store Configuration
If you are using an S3 bucket as your submission retrieval store and you are not authenticating through IAM roles, provide your AWS access key ID and secret access key in the S3 Submission Retrieval Store Configuration field.
To enter your credentials:
Click Edit value.
Enter your credentials in JSON format:
{ "aws_access_key_id": "<your_access_key_id>", "aws_secret_access_key": "<your_secret_key>" }You can authenticate requests using AWS Signature Version 2 (SigV2). To use AWS Signature Version 2, add the following variable and value to the S3 Submission Retrieval Store field:
"s3_signature_version":"s3"Click Done.
Click Save in the upper-right corner of the page.
In the dialog box that appears, click Save & Deploy.
For more information about AWS access key IDs and secret access keys, see Amazon's Understanding and getting your AWS credentials.
S3 Submission Retrieval Endpoint URL
If your submission retrieval store is not in the public cloud (i.e., its URL does not point to s3.amazonaws.com — for example, a government cloud or an S3-compatible internal setup), enter its URL in S3 Submission Retrieval Endpoint URL. You do not need to edit your “.env” file to update this URL.
To edit the endpoint URL for your S3 submission retrieval store:
Enter the URL in the S3 Submission Retrieval Endpoint URL field or edit its contents.
Click Save in the upper-right corner of the page.
In the dialog box that appears, click Save & Deploy.
If the bucket you’re using as your submission retrieval store is in a public cloud (as opposed to a government cloud or an S3-compatible internal setup), leave this field blank.
OCS
If you are using an OSC submission retrieval store, enter the configuration details for your file store in these fields.
When you are finished entering or editing these field’s values, click Save in the upper-right corner of the page. Then, in the dialog box that appears, click Save & Deploy.
[v42.0-v42.2] OCS Configuration
To enter your configuration details:
Click Edit value.
Enter the configuration details in JSON format:
{ "host_url": "<your_host_url>", "username": "<your_username>", "password": "<your_password>", "ssl_cert": "<CA_bundle_filename_OR_SKIP>" }The value of ssl_cert should match the CA bundle filename inside the $HS_PATH/certs directory. To disable certificate validation, set this value to SKIP.
Click Done.
Click Save in the upper-right corner of the page.
In the dialog box that appears, click Save & Deploy.
v42.3
Name | Required? | Description |
|---|---|---|
OCS Host URL | Yes, if using an OCS submission retrieval store | The OCS host URL for the submission retrieval store. |
OCS Username | Yes, if using an OCS submission retrieval store | The OCS username for authenticating into the submission retrieval store. |
OCS Password | Yes, if using an OCS submission retrieval store | The OCS password for authenticating into the submission retrieval store. |
OCS SSL Certificate | Yes, if using an OCS submission retrieval store | The CA bundle filename inside the |
Generic Web Storage (HTTP/HTTPS)
Generic Web Storage (HTTP/HTTPS) Configuration
If you are using a generic web storage submission file store, enter the configuration details for your file store in this field.
We use Basic Authentication for Generic Web Storage Configuration.
To enter your configuration details:
Click Edit value.
Enter the configuration details in JSON format:
{ "username": "<your_username>", "password": "<your_password>", "ssl_cert": "<CA_bundle_filename_OR_SKIP>" }The value of ssl_cert should match the CA bundle filename inside the $HS_PATH/certs directory. To disable certificate validation, set this value to SKIP.
Click Done.
Click Save in the upper-right corner of the page.
In the dialog box that appears, click Save & Deploy.
Azure Blob Storage
If you are using Azure Blob Storage as your submission retrieval store, you can use the fields described below to configure the system’s connection to the blob.
Azure Blob Storage Authentication Type
From the Azure Blob Storage Authentication Type drop-down list, select the authentication type the system should use to access the blob:
SAS Token Only
Service Principal
Managed Identity
Account Key
When you select an authentication type, additional settings appear.
Settings for SAS Token Only authentication
Name | Required? | Description |
|---|---|---|
Azure Blob Storage Account URL | Yes | The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net) |
Settings for Service Principal authentication
Name | Required? | Description |
|---|---|---|
Azure Blob Storage Account URL | Yes | The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net) |
Azure Blob Storage Tenant ID | No | The tenant ID of the service principal |
Azure Blob Storage Client ID | No | The client ID of the service principal. If multiple client IDs exist for the service principle, and Azure Blob Storage Client ID is left blank, the default client ID will be used. |
Azure Blob Storage Client Secret | No | The client secret for the service principal |
Azure Blob Storage Authority Host | No | The host of the Microsoft Entra authority for the storage account. If omitted, the host of the Azure Public Cloud authority (login.microsoftonline.com) is used. For a list of valid values, see Microsoft’s azure.identity.AzureAuthorityHosts class. |
Settings for Managed Identity authentication
Name | Required? | Description |
|---|---|---|
Azure Blob Storage Account URL | Yes | The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net) |
Azure Blob Storage Client ID | No | The client ID of the managed identity. If multiple client IDs exist for the managed identity, and Azure Blob Storage Client ID is left blank, the default client ID will be used. |
Settings for Account Key authentication
Name | Required? | Description |
|---|---|---|
Azure Blob Storage Account URL | Yes | The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net) |
Azure Blob Storage Account Key | No | The access key for the storage account |
Azure Blob Storage Account Name | No | The name of the storage account |
If incorrect authentication information is entered, the flow runs for the attempted file-ingestion attempts will fail. The flow runs’ output will contain error messages passed to the system by Azure.
For more information about troubleshooting flow runs, see Testing and Debugging Flows.
GCS Storage
If you are using GCS Storage as your submission retrieval store, you can use the fields described below to configure the system’s connection to the blob.
Name | Required? | Description |
|---|---|---|
Use Workload Identity | Must be selected if no value for GCS Service Account JSON is provided | Credentials obtained by using Workload Identity Federation, which applies to Hyperscience installations inside GKE clusters. |
GCS Service Account JSON | Yes, if Use Workload Identity is deselected | The service account JSON credential that allows access to the retrieval-store bucket. To enter the JSON:
|
Other settings
Setting | Description | Default Value |
|---|---|---|
Enable File Page-Limit Check | Allows you to specify the maximum number of pages that submissions’ files can have. Enabling this option reveals the Maximum Pages Allowed Per File setting, where you can specify the maximum number of pages each file can have. | Disabled |
Maximum Pages Allowed Per File | The maximum number of pages each file in a submission can have. If a file has more pages than this maximum, the submission will fail. This setting is only available if Enable File Page-Limit Check is enabled. | (None) |
Collation
Setting | Description | Default Value |
|---|---|---|
Replace Case Data From Duplicate File Names | If this option is enabled, and a file is added to a case that matches the name of a file currently in the case, the cases will retain data from the most recently submitted version of the file. Note that the data found in the existing version of the file is not deleted; it is only removed from the case. This option is only relevant when adding files to cases. | Disabled |
Retention Period | The number of days to retain cases after they are updated by this block. If no value is provided, no changes to the deletion date will be made. | (Blank) |
Refresh Retention Period | If enabled, applies the retention period in Retention Period to all cases in the block. If disabled, only cases without a pre-existing deletion date are updated. | Enabled |
Classification
Setting | Description | Default Value |
|---|---|---|
Structured Layout Match Threshold | The minimum confidence score a page must have in order for it to be matched to a layout. If the page's confidence score is below this value, the system sends it to Classification Supervision (if enabled) or marks it as "No Layout Found." | 0.6 |
Semi-structured Classification | Enables the management of a model that automatically classifies Semi-structured and Additional documents. | Enabled |
Manual Classification Supervision | Enables Classification Supervision. | Disabled |
Semi-structured Classification Target Accuracy | Your desired accuracy for the classification of Semi-structured and Additional documents. If the estimated accuracy of the model's prediction for a document is below this value, the system will send the document to Classification Supervision (if enabled) or mark it as "No Layout Found." | 99 |
Semi-structured Classification Grouping Logic | Determines how multiple pages are matched to the same layout variation in a given submission will be handled. To learn more about this setting, see Document Classification Settings. | Consecutive pages as a document |
Semi-structured Classification QA Sample Rate | The percentage of documents that the system will randomly select for Classification QA. | 5 |
Image Correction | Identifies and corrects the orientation of Semi-structured images by rotating them. Cannot be enabled if Faster PDF Transcription is enabled. | Enabled |
Mobile Processing | Improves machine readability of Semi-structured documents captured by mobile devices. To rotate and properly process Semi-structured documents captured by mobile devices, we recommend enabling both Mobile Processing and Image Correction. Before enabling Mobile Processing, make sure that the majority of the pages you will be processing are captured by mobile devices. Contact your Hyperscience representative for more information. | Disabled |
Faster PDF Transcription | If enabled, the system processes pages in PDF files in their native format, allowing for faster transcription. If disabled, the system processes PDF pages by creating images of them and extracting data from those images. To ensure that this feature works as intended, only enable Faster PDF Transcription when submitting PDFs whose pages are correctly oriented and do not require rotation before processing. Cannot be enabled if Image Correction is enabled. If you are processing PDFs and other file types in your flow, consider creating a custom flow that routes PDFs to a Machine Classification Block that has Faster PDF Transcription enabled. | Disabled |
Validate Classification Using LayoutID | Enabling this setting allows Structured documents to be matched using a layout identifier. When this setting is enabled, the system checks for a matching layout identifier in the document. If the identifier matches the expected one in the layout variation, the document is classified accordingly. If it doesn't match, the document is either sent for further review or to Document Drift Management, preventing misclassification. | Disabled |
Bypass Validation if LayoutID is Missing | This setting should be enabled when certain layouts do not contain a layout identifier. It bypasses validation by layout identifier if the matched layout variation doesn’t have an identifier specified. In these cases, the bypass allows the system to continue classifying documents even without layout identifiers, ensuring that documents are still processed but not necessarily tied to a specific layout variation. This setting is available only if Validate Classification Using LayoutID is enabled. | Disabled |
Flexible Extraction
Setting | Description | Default Value |
|---|---|---|
Flexible Extraction Transcription Masking | If enabled, prevents keyers from entering invalid characters in Flexible Extraction tasks. | Enabled |
Default Task Restrictions | Determines which users can access Flexible Extraction tasks created by this block. To learn more, see Task Restrictions Overview. | None selected |
Flexible Extraction Show Machine Predictions | If enabled, low-confidence machine transcriptions are populated in Flexible Extraction tasks so they can be reviewed by keyers. | Enabled |
Document Rendering
Setting | Description | Default Value |
|---|---|---|
Document Rendering Enabled | If enabled, allows you to download a PDF file from submissions that have gone through Machine or Manual Classification. After a submission is completed, a download URL is available in the submission’s JSON output. To download the documents:
| Disabled |
Document Measurement Unit | The measurement unit used for Document Width and Document Height:
This setting is available only if Document Rendering Enabled is enabled. | Inches |
Document Width | The width of pages in the generated PDF file, in the units specified in Document Measurement Unit. The Document Renderer Block has a page-width limit of 600mm. Many common formats exceed this limit. If you encounter issues rendering larger pages, consider adjusting your workflow or resizing the pages so they are smaller than 600mm x 600mm. This setting is available only if Document Rendering Enabled is enabled. | 8.5 |
Document Height | The height of pages in the generated PDF file, in the units specified in Document Measurement Unit. The Document Renderer Block has a page-height limit of 600mm. Many common formats exceed this limit. If you encounter issues rendering larger pages, consider adjusting your workflow or resizing the pages so they are smaller than 600mm x 600mm. This setting is available only if Document Rendering Enabled is enabled. | 11 |
Quality | The desired quality of the generated PDF file. This setting is available only if Document Rendering Enabled is enabled. | 50 |
Output Image Mode | Determines how images appear in the generated PDF:
Selecting Convert to grayscale or Convert to black & white can reduce the final file size of the PDF, optimizing performance. The Quality setting applies only when Output Image Mode is set to Keep original colors or Convert to grayscale. This setting is available only if Document Rendering Enabled is enabled. | Keep original colors |
Reprocessing
Setting | Description | Default Value |
|---|---|---|
Reprocessing Enabled | If enabled, keyers can click Mark Layout Variation Incorrect during Flexible Extraction tasks to indicate that a document has been matched to an incorrect layout variation. Then, the document is sent back to Document Classification to be matched to the correct layout variation by a keyer. | Enabled |