Document Processing with ORCA Subflow Settings

Prev Next

This article describes the settings available in the Document Processing with ORCA Subflow included in v42.3. The settings available in custom flows may differ from those described here, depending on which blocks are included in those flows. To learn more about the settings for individual blocks, see Flow Blocks.

As part of our efforts to give you more precise control over your Hyperscience processes, we’ve made many of our settings configurable on the flow level.

While you can build custom flows, each instance of Hyperscience includes a Document Processing flow. To learn more about the version of this flow that comes with v42, see Document Processing Flow in V42.

The Document Processing flow contains several subflows, including the Document Processing with ORCA Subflow. This article focuses on the settings available in that subflow.

View the subflow’s settings

To view the settings of the Document Processing with ORCA Subflow:

  1. Click Flows in the left-hand sidebar, and click on the name of the Document Processing flow that contains the Document Processing with ORCA Subflow whose settings you would like to view.

  2. Click Edit Flows.

  3. On the Flow Studio canvas, click the Start Document Processing Subflow Block.

  4. Click the Settings Type drop-down list, and click on a setting type.

Edit the subflow’s settings

After you’ve viewed the subflow’s settings, you can make any necessary changes, and then click Save in the upper-right corner of the page. You can save changes to multiple settings types at once.

Available settings

The sections below describe the settings available for each setting type.

ORCA

Setting

Description

Default Value

ORCA Base Model

The base model that will be used when processing submissions through this block.

Note that the selected base model must be installed in the instance before using the Document Processing with ORCA Subflow to process submissions. To learn more, see Installing ORCA VLMs.

ORCA 1.0

ORCA Target Accuracy

Your desired accuracy for the extraction of field data. If the estimated accuracy of the model's prediction for a field is below this value, the system will send the field and all its occurrences (if any) to Flexible Extraction Supervision.

Note that the target accuracy is applied only if a specialized model definition has been created for this flow from the base model.

95

Quality Assurance Flow

The subflow that will be used to generate Vision Language Model QA tasks for submissions processed through this flow.

Because there is no default selection for this setting, you must manually select ORCA Quality Assurance Flow in order for QA tasks to be generated and accuracy to be measured for this flow.

None

Task Restrictions

Determines which users can access Supervision tasks created by this block.

To learn more, see Task Restrictions Overview.

None selected

ORCA QA Sample Rate

The percentage of documents that the system will randomly select for Vision Language Model QA.

This setting is available only if a flow is selected in Quality Assurance Flow.

5

File Filter

Setting

Description

Default Value

All Files or Images Only

Determines whether the filters in this block are applied to all files in submissions (Apply to all files) or only to image files (i.e., files whose MIME type is image) (Apply to images only).

Apply to all files

Minimum Image Width (px)

The minimum width in pixels that an image needs to have in order to be allowed by the filter.

This filter applies only to images (i.e., files whose MIME type is image) and has no impact on other files.

(Blank)

Minimum Image Height (px)

The minimum height in pixels that an image needs to have in order to be allowed by the filter.

This filter applies only to images (i.e., files whose MIME type is image) and has no impact on other files.

(Blank)

Minimum File Size (KB)

The minimum size in kilobytes that a file needs to have in order to be allowed by the filter.

(Blank)

File Extension Action

Select one of the following options:

  • Do not filter files by extension

  • Allow only these file extensions

  • Deny files with these extensions

Do not filter files by extension

File Extensions

A list of file extensions that the filter will allow or deny, based on the option selected in File Extension Action. Select the checkboxes for the file extensions that you would like to filter by.

If zip is selected as a file extension, the filter will not decompress ZIP files included in submissions. Each ZIP file will be treated as an individual file, regardless of the numbers of types of files compressed within it.

If there are file extensions that you want to filter by that do not appear in the drop-down list, select other, and enter the extensions in Other File Extensions.

This field only appears if Allow only these file extensions or Deny files with these extensions is selected in File Extension Action.

(Does not appear)

Other File Extensions

A comma-separated list of file extensions that do not appear in File Extensions.

This field only appears if other is selected in File Extensions.

(Does not appear)

Submission Bootstrap

AWS S3

S3 Submission Retrieval Store Configuration

If you are using an S3 bucket as your submission retrieval store and you are not authenticating through IAM roles, provide your AWS access key ID and secret access key in the S3 Submission Retrieval Store Configuration field.

To enter your credentials:

  1. Click Edit value.

  2. Enter your credentials in JSON format:

    {
    "aws_access_key_id": "<your_access_key_id>",
    "aws_secret_access_key": "<your_secret_key>"
    }

    You can authenticate requests using AWS Signature Version 2 (SigV2). To use AWS Signature Version 2, add the following variable and value to the S3 Submission Retrieval Store field:

    "s3_signature_version":"s3"
  3. Click Done.

  4. Click Save in the upper-right corner of the page.

  5. In the dialog box that appears, click Save & Deploy.

For more information about AWS access key IDs and secret access keys, see Amazon's Understanding and getting your AWS credentials.

S3 Submission Retrieval Endpoint URL

If your submission retrieval store is not in the public cloud (i.e., its URL does not point to s3.amazonaws.com — for example, a government cloud or an S3-compatible internal setup), enter its URL in S3 Submission Retrieval Endpoint URL. You do not need to edit your “.env” file to update this URL.

To edit the endpoint URL for your S3 submission retrieval store:

  1. Enter the URL in the S3 Submission Retrieval Endpoint URL field or edit its contents.

  2. Click Save in the upper-right corner of the page.

  3. In the dialog box that appears, click Save & Deploy.

If the bucket you’re using as your submission retrieval store is in a public cloud (as opposed to a government cloud or an S3-compatible internal setup), leave this field blank.  

OCS

If you are using an OSC submission retrieval store, enter the configuration details for your file store in these fields.

When you are finished entering or editing these field’s values, click Save in the upper-right corner of the page. Then, in the dialog box that appears, click Save & Deploy.

[v42.0-v42.2] OCS Configuration

To enter your configuration details:

  1. Click Edit value.

  2. Enter the configuration details in JSON format:  

    {
    "host_url": "<your_host_url>", 
    "username": "<your_username>", 
    "password": "<your_password>", 
    "ssl_cert": "<CA_bundle_filename_OR_SKIP>"
    }

    The value of ssl_cert should match the CA bundle filename inside the $HS_PATH/certs directory. To disable certificate validation, set this value to SKIP.

  3. Click Done.

  4. Click Save in the upper-right corner of the page.

  5. In the dialog box that appears, click Save & Deploy.

v42.3

Name

Required?

Description

OCS Host URL

Yes, if using an OCS submission retrieval store

The OCS host URL for the submission retrieval store.

OCS Username

Yes, if using an OCS submission retrieval store

The OCS username for authenticating into the submission retrieval store.

OCS Password

Yes, if using an OCS submission retrieval store

The OCS password for authenticating into the submission retrieval store.

OCS SSL Certificate

Yes, if using an OCS submission retrieval store

The CA bundle filename inside the $HS_PATH/certs directory. To disable certificate validation, set this value to SKIP.

Generic Web Storage (HTTP/HTTPS)

Generic Web Storage (HTTP/HTTPS) Configuration

If you are using a generic web storage submission file store, enter the configuration details for your file store in this field.

We use Basic Authentication for Generic Web Storage Configuration.

To enter your configuration details:

  1. Click Edit value.

  2. Enter the configuration details in JSON format:  

    { 
    "username": "<your_username>", 
    "password": "<your_password>", 
    "ssl_cert": "<CA_bundle_filename_OR_SKIP>"
    }

    The value of ssl_cert should match the CA bundle filename inside the $HS_PATH/certs directory. To disable certificate validation, set this value to SKIP.

  3. Click Done.

  4. Click Save in the upper-right corner of the page.

  5. In the dialog box that appears, click Save & Deploy.

Azure Blob Storage

If you are using Azure Blob Storage as your submission retrieval store, you can use the fields described below to configure the system’s connection to the blob.

Azure Blob Storage Authentication Type

From the Azure Blob Storage Authentication Type drop-down list, select the authentication type the system should use to access the blob:

  • SAS Token Only

  • Service Principal

  • Managed Identity

  • Account Key

When you select an authentication type, additional settings appear.

Settings for SAS Token Only authentication

Name

Required?

Description

Azure Blob Storage Account URL

Yes

The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net)

Settings for Service Principal authentication

Name

Required?

Description

Azure Blob Storage Account URL

Yes

The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net)

Azure Blob Storage Tenant ID

No

The tenant ID of the service principal

Azure Blob Storage Client ID

No

The client ID of the service principal.

If multiple client IDs exist for the service principle, and Azure Blob Storage Client ID is left blank, the default client ID will be used.

Azure Blob Storage Client Secret

No

The client secret for the service principal

Azure Blob Storage Authority Host

No

The host of the Microsoft Entra authority for the storage account.

If omitted, the host of the Azure Public Cloud authority (login.microsoftonline.com) is used.

For a list of valid values, see Microsoft’s azure.identity.AzureAuthorityHosts class.

Settings for Managed Identity authentication

Name

Required?

Description

Azure Blob Storage Account URL

Yes

The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net)

Azure Blob Storage Client ID

No

The client ID of the managed identity.

If multiple client IDs exist for the managed identity, and Azure Blob Storage Client ID is left blank, the default client ID will be used.

Settings for Account Key authentication

Name

Required?

Description

Azure Blob Storage Account URL

Yes

The URL of the storage account (e.g., https://<account_name>.blob.core.windows.net)

Azure Blob Storage Account Key

No

The access key for the storage account

Azure Blob Storage Account Name

No

The name of the storage account

If incorrect authentication information is entered, the flow runs for the attempted file-ingestion attempts will fail. The flow runs’ output will contain error messages passed to the system by Azure.

For more information about troubleshooting flow runs, see Testing and Debugging Flows.

GCS Storage

If you are using GCS Storage as your submission retrieval store, you can use the fields described below to configure the system’s connection to the blob.

Name

Required?

Description

Use Workload Identity

Must be selected if no value for GCS Service Account JSON is provided

Credentials obtained by using Workload Identity Federation, which applies to Hyperscience installations inside GKE clusters.

GCS Service Account JSON

Yes, if Use Workload Identity is deselected

The service account JSON credential that allows access to the retrieval-store bucket.

To enter the JSON:

  1. Click Edit value.

  2. Enter your Service Account credentials in valid JSON format.

  3. Click Done.

  4. Click Save in the upper-right corner of the page.

  5. In the dialog box that appears, click Save & Deploy.

Other settings

Setting

Description

Default Value

Enable File Page-Limit Check

Allows you to specify the maximum number of pages that submissions’ files can have.

Enabling this option reveals the Maximum Pages Allowed Per File setting, where you can specify the maximum number of pages each file can have.

Disabled

Maximum Pages Allowed Per File

The maximum number of pages each file in a submission can have. If a file has more pages than this maximum, the submission will fail.

This setting is only available if Enable File Page-Limit Check is enabled.

(None)

Collation

Setting

Description

Default Value

Replace Case Data From Duplicate File Names

If this option is enabled, and a file is added to a case that matches the name of a file currently in the case, the cases will retain data from the most recently submitted version of the file.

Note that the data found in the existing version of the file is not deleted; it is only removed from the case. This option is only relevant when adding files to cases.

Disabled

Retention Period

The number of days to retain cases after they are updated by this block.

If no value is provided, no changes to the deletion date will be made.

(Blank)

Refresh Retention Period

If enabled, applies the retention period in Retention Period to all cases in the block.

If disabled, only cases without a pre-existing deletion date are updated.

Enabled

Classification

Setting

Description

Default Value

Structured Layout Match Threshold

The minimum confidence score a page must have in order for it to be matched to a layout. If the page's confidence score is below this value, the system sends it to Classification Supervision (if enabled) or marks it as "No Layout Found."

0.6

Semi-structured Classification

Enables the management of a model that automatically classifies Semi-structured and Additional documents.

Enabled

Manual Classification Supervision

Enables Classification Supervision.

Disabled

Semi-structured Classification Target Accuracy

Your desired accuracy for the classification of Semi-structured and Additional documents. If the estimated accuracy of the model's prediction for a document is below this value, the system will send the document to Classification Supervision (if enabled) or mark it as "No Layout Found."

99

Semi-structured Classification Grouping Logic

Determines how multiple pages are matched to the same layout variation in a given submission will be handled.

To learn more about this setting, see Document Classification Settings.

Consecutive pages as a document

Semi-structured Classification QA Sample Rate

The percentage of documents that the system will randomly select for Classification QA.

5

Image Correction

Identifies and corrects the orientation of Semi-structured images by rotating them.

Cannot be enabled if Faster PDF Transcription is enabled.

Enabled

Mobile Processing

Improves machine readability of Semi-structured documents captured by mobile devices. To rotate and properly process Semi-structured documents captured by mobile devices, we recommend enabling both Mobile Processing and Image Correction.

Before enabling Mobile Processing, make sure that the majority of the pages you will be processing are captured by mobile devices. Contact your Hyperscience representative for more information.

Disabled

Faster PDF Transcription

If enabled, the system processes pages in PDF files in their native format, allowing for faster transcription. If disabled, the system processes PDF pages by creating images of them and extracting data from those images.

To ensure that this feature works as intended, only enable Faster PDF Transcription when submitting PDFs whose pages are correctly oriented and do not require rotation before processing.

Cannot be enabled if Image Correction is enabled. If you are processing PDFs and other file types in your flow, consider creating a custom flow that routes PDFs to a Machine Classification Block that has Faster PDF Transcription enabled.

Disabled

Validate Classification Using LayoutID

Enabling this setting allows Structured documents to be matched using a layout identifier. When this setting is enabled, the system checks for a matching layout identifier in the document. If the identifier matches the expected one in the layout variation, the document is classified accordingly. If it doesn't match, the document is either sent for further review or to Document Drift Management, preventing misclassification.

Disabled

Bypass Validation if LayoutID is Missing

This setting should be enabled when certain layouts do not contain a layout identifier. It bypasses validation by layout identifier if the matched layout variation doesn’t have an identifier specified. In these cases, the bypass allows the system to continue classifying documents even without layout identifiers, ensuring that documents are still processed but not necessarily tied to a specific layout variation.

This setting is available only if Validate Classification Using LayoutID is enabled.

Disabled

Flexible Extraction

Setting

Description

Default Value

Flexible Extraction Transcription Masking

If enabled, prevents keyers from entering invalid characters in Flexible Extraction tasks.

Enabled

Default Task Restrictions

Determines which users can access Flexible Extraction tasks created by this block.

To learn more, see Task Restrictions Overview.

None selected

Flexible Extraction Show Machine Predictions

If enabled, low-confidence machine transcriptions are populated in Flexible Extraction tasks so they can be reviewed by keyers.

Enabled

Document Rendering

Setting

Description

Default Value

Document Rendering Enabled

If enabled, allows you to download a PDF file from submissions that have gone through Machine or Manual Classification.

After a submission is completed, a download URL is available in the submission’s JSON output. To download the documents:

  1. Go to the Submissions page.

  2. Open the submission whose documents you want to download.

  3. Click Actions, and then click View JSON Output.

  4. Use your browser’s search function to locate download_url.

  5. Copy the URL and append it to your environment’s URL (e.g., example.hyperscience.com/api/<URL>).

  6. Choose a folder on your local machine to save the file, and the download will begin.

Disabled

Document Measurement Unit

The measurement unit used for Document Width and Document Height:

  • Inches

  • Millimeters

This setting is available only if Document Rendering Enabled is enabled.

Inches

Document Width

The width of pages in the generated PDF file, in the units specified in Document Measurement Unit.

The Document Renderer Block has a page-width limit of 600mm. Many common formats exceed this limit. If you encounter issues rendering larger pages, consider adjusting your workflow or resizing the pages so they are smaller than 600mm x 600mm.

This setting is available only if Document Rendering Enabled is enabled.

8.5

Document Height

The height of pages in the generated PDF file, in the units specified in Document Measurement Unit.

The Document Renderer Block has a page-height limit of 600mm. Many common formats exceed this limit. If you encounter issues rendering larger pages, consider adjusting your workflow or resizing the pages so they are smaller than 600mm x 600mm.

This setting is available only if Document Rendering Enabled is enabled.

11

Quality

The desired quality of the generated PDF file.

By default, the quality is set to 50%, which balances image clarity and file size. We recommend using this default setting for best results. Lowering the quality reduces the file size but may make images less clear, while increasing the quality creates larger files with sharper images. For example, a document originally 1 MB in size can grow to 40 MB when rendered in high resolution.

This setting is available only if Document Rendering Enabled is enabled.

50

Output Image Mode

Determines how images appear in the generated PDF:

  • Keep original colors

  • Convert to grayscale

  • Convert to black & white

Selecting Convert to grayscale or Convert to black & white can reduce the final file size of the PDF, optimizing performance.

The Quality setting applies only when Output Image Mode is set to Keep original colors or Convert to grayscale.

This setting is available only if Document Rendering Enabled is enabled.

Keep original colors

Reprocessing

Setting

Description

Default Value

Reprocessing Enabled

If enabled, keyers can click Mark Layout Variation Incorrect during Flexible Extraction tasks to indicate that a document has been matched to an incorrect layout variation. Then, the document is sent back to Document Classification to be matched to the correct layout variation by a keyer.

Enabled