Semi-structured Document Classification

Accessing this feature
Your access to the feature described in this article depends on your license package and pricing plan.
To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

Classification models in Hyperscience

Classification models are a crucial part of document processing as they help the system determine which layout should be used to process each page you upload. In Hyperscience, we have two types of document classification:

Semi-structured Document Classification - Automatically classifies documents that don’t follow a consistent layout pattern (e.g., invoices, bank statements, etc.).

Structured Document Classification - Automatically classifies documents that follow a consistent layout pattern (e.g., tax forms, standardized applications) by assigning them to the correct layout in Hyperscience. To learn more, see Structured Document Classification.

In this article, you’ll learn when and how to use Semi-structured document classification.

How it works

Semi-structured classification is handled by the Non-Structured Layout Classifier (NLC), which relies on the words in the document to predict the most likely layout group. Note that the NLC works on a page-level.

Flows Settings
Semi-structured Classification should be enabled in your flow. To learn more, see the Classification section in our Document Processing Subflow Settings article.

For each page, the model outputs a confidence score that reflects its certainty about the classification. This score is then compared against the configured Target Accuracy Threshold:

High-confidence matches - the document is classified automatically.
Low-confidence matches - behavior depends on your flow configuration:
- If Manual Classification is enabled, low-confidence pages can trigger a Document Classification Task for manual review. Learn more about the Document Classification task in Structured Document Classification.
- If Manual Classification is disabled, low-confidence pages are marked as No Layout Found.

This approach ensures that high-confidence predictions are automated, while low-confidence cases get the necessary human validation to maintain Accuracy. To learn more about target accuracy, see our Accuracy article.

Semi-structured Classification Grouping Logic

When classifying Semi-structured documents, the system needs to decide how to group consecutive pages that are matched to the same layout. This behavior is controlled by the Semi-structured Classification Grouping Logic setting in your flow. Learn more in Document Processing Subflow Settings.

You can choose between three options:

Consecutive pages as a document - All consecutive pages classified to the same layout are grouped together into a single multi-page document.
Consecutive pages as separate documents - Each page classified to the same layout is treated as a separate document, even if the pages are consecutive.
Manual review of consecutive pages - The system flags consecutive pages for human review, allowing users to manually confirm whether they should be grouped into one document or kept separate. Ensure that the Manual Classification Supervision setting is enabled in your flow. Learn more in Document Processing Subflow Settings

By default, the flow-level grouping logic inherits the grouping configuration defined at the layout level. To learn more, see Auto-Splitting.

Releases and Classification models

Each release in Hyperscience contains a set of layouts. Learn more in our What is a Release? article.

When you create a new release, the system automatically generates a new Classification Model for all Semi-structured and Additional layouts in that release.

If a new release contains the same combination of Semi-structured and Additional Layouts as an existing release, it reuses the existing Classification Model
If the release introduces a new set of Semi-structured or Additional layouts, a new model is created automatically.
Adding or removing Structured layouts has no impact on model compatibility.

Over time, as you add layouts and retrain, your Semi-structured Classification model evolves to handle a wider range of documents.

Training the Classification model

To classify Semi-structured and Additional layouts accurately, the system needs to be trained on examples of these documents. Training Data comes from three sources:

Submitted documents - Pages already matched to a layout, whether classified by the machine or by a user. These pages have been processed and passed QA.

Send documents to Training Data Management
To use submitted documents for model training, enable the Send documents to Training Data Management setting in Administration > System Settings. When enabled, Submission data can be sampled into TDM, regardless of QA.

Training Data Management - Documents you add directly from the Classification Model detail page.
“Excluded” documents - Pages used to train the model on what not to classify. This reduces unnecessary manual effort by keeping irrelevant documents out of processing.

Model performance depends on how close new submissions are to the training data:

If documents differ significantly (e.g., new invoice formats), add training pages and retrain the model.
If changes are unexpected, Continuous Model Improvement (if enabled) helps the model adapt automatically, though Automation may temporarily dip after deployment of a new model. Learn more about Continuous Model Improvement in Application Settings Overview.

For step-by-step instructions and training limits, see TDM for Classification Models and Product Limits and Guidelines.

To maintain accuracy, a percentage of classified documents is sampled for Classification QA. These tasks ask users to confirm that the model’s predictions are correct and are required for reporting automation and accuracy.

Well-trained models reduce manual workload. If the model isn’t trained enough, more pages fall below the threshold and generate Supervision tasks.

In rare cases, the model may make high-confidence mistakes, which can be flagged as incorrect during any ID or Transcription Supervision task. To learn more, see the Reprocessing section of Structured Document Classification article.

Adding training pages for these cases improves future performance.
Blank Pages - Pages with little text are automatically marked as Blank.

Learn more about improving the model’s performance in Model Validation Tasks.