Structured Document Classification

Prev Next

Classification models in Hyperscience

Classification models are a crucial part of document processing as they help the system determine which layout should be used to process each page you upload. In Hyperscience, we have two types of document classification:

  • Structured Document Classification - Automatically classifies documents that follow a consistent layout pattern (e.g., tax forms, standardized applications) by assigning them to the correct layout in Hyperscience. Once classified, the documents can proceed to the next step - Field Identification Task, Transcription Task, or Flexible Extraction, depending on your business logic.

  • Semi-structured Document Classification - Automatically classifies documents that don’t follow a consistent layout pattern (e.g., invoices, bank statements, etc.). To learn more, see Semi-structured Document Classification and TDM for Classification Models.

In this article, you’ll learn how to work with Structured Document Classification. Learn more about Identification in TDM for Identification Models. To learn about Transcription and Flexible Extraction, see our Transcription and Flexible Extraction articles.

Structured Document Classification

How it works

When a document is submitted to the system, it is first split into individual pages. Structured Document Classification then runs through the following steps:

  1. Visual Page Classifier (Visual Page Classifier (VPC)):

    • The system runs VPC, which returns layout page-level candidates for each submission page. At this stage, only candidates are proposed, not full documents.

      • For every page, the VPC produces a list of possible layout matches, ranked by confidence. To learn more about structured layouts, see Creating Structured Layouts.

        • Example: Submission page 1 may match layout page 1 of Layout A; submission page 2 may match layout page 2 of Layout A, etc.

  2. Distribute to Forms:

    • Using the list of candidates generated from the VPC, the system attempts to group pages into complete documents.

    • The goal is to minimize the number of documents while maximizing confidence scores across all pages.

    • Pages that fail to align with any layout at this stage are passed to Semi-structured classification (Non-Structured Layout Classifier (NLC)). To learn more, see Semi-structured Document Classification.

  3. Registration:

    • For each proposed distribution, the system runs a Registration step to validate the page-to-layout matches.

      • If confidence for all candidates is above the acceptance threshold (e.g., >0.6), the document is accepted.

      • If one or more candidates are rejected, the system re-runs Distribute to Forms with alternative matches.

  4. Re-distribution & Finalization:

    • The Distribution - Registration cycle runs up to three times:

      • First attempt: Initial grouping of candidates.

      • Second attempt: Re-distribution if a candidate is rejected.

      • Third attempt: Final re-distribution.

    • If all candidates are successfully registered, the current distribution is finalized and used.

    • If some candidates are rejected, the system generates a new distribution and retries registration.

    • If none of the attempts result in a fully valid distribution, the system returns the best-scoring distribution from the three tries.

    • This process repeats up to 3 times.

  5. Manual Review:

    • Any pages that fail to classify after these steps are marked as No Layout Found and routed to Semi-structured Classification. Depending on your flows settings and your use case, these can be handled via Document Classification Supervision Task.To learn more, see Semi-structured Document Classification.

Blank Pages

If a page contains very little text, VPC may match it as blank and will not attempt to classify it to a layout.

Structured Layout Match Threshold

The Structured Layout Match Threshold defines the minimum confidence score required for a page to be automatically matched to a structured layout.

  • The default threshold is 0.6, but it can be adjusted to fit your use case.

  • Lowering the threshold may increase the risk of incorrect matches.

Layout Matching Confidence

Confidence in layout matching directly affects the accuracy. The more confident the system is in its layout match, the more reliable the extracted data will be. To learn more, see our Accuracy article.

Contact your Hyperscience representative to determine the best threshold for your use case.

Expand the sections below to learn more about the structured document classification settings and layout identifiers.

Structured Document Classification Settings

Before you start, configure structured document classification behaviour in your flow.

  • Enable Manual Classification Supervision to use the Document Classification task, as described in Document Processing Subflow Settings.

  • Structured Layout Match Threshold

    • This threshold controls whether a Structured page is matched to a layout.

      • Pages with confidence below the threshold are sent to Document Classification or marked as No Layout Variation Found.

      • Pages with a confidence score above the threshold are automatically assigned to a layout.

  • Validate Classification Using Layout ID - Enabling this setting allows Structured documents to be matched using a layout identifier. When this setting is enabled, the system checks for a matching layout identifier in the document.

  • Bypass Validation if Layout ID is Missing - This setting should be enabled when certain layouts do not contain a layout identifier.

Learn more about these settings in the Classification section of our Document Processing Subflow Settings article.

Layout Identifiers

Classifying Variations

Some layouts can look almost identical, with only minor visual differences. To avoid misclassification in these cases, you can create layout variations. Each variation represents a small difference in the layout’s pattern, while still belonging to the same overall layout group. Learn more in Adding a Variation to a Layout.

Layout Identifiers

Even with variations, the system may sometimes classify incorrectly. To improve accuracy, Hyperscience can use Layout identifiers to force the correct match.

  • If the identifier in the document matches the expected ID in a layout variation, the system will classify the document to that variation, regardless of the confidence score.

  • If the identifier does not match, the document is routed either to Document Classification Task or to Document Drift Management (Layout Triage), depending on your flow settings:

Using Layout Identifiers

  1. Go to Library > Layouts.

  2. Find the layout to which you want to add a layout ID, and click on its name.

  3. Find the variation to which you want to add a Layout ID, and click on its name.

  4. Click Fields in the toolbar, and then click Layout IDs.

  5. Click and drag to draw bounding boxes around each layout ID.

    • Once you draw a box, the machine will read and transcribe the value inside it.

      • Any incorrect transcriptions can be edited in the field list.

  6. When you’re finished making changes to the variation, do one of the following:

    • If you’re ready to apply your changes to the variation, click Commit Changes and save it as a new version. Learn more in Editing and Finalizing a Layout Version.

    • If you’re not ready to apply your changes, click the X button in the upper-right corner of the page.

Layout Identifiers Best Practices

The following best practices for using layout identifiers will ensure the highest levels of accuracy for the classification of Structured documents.

  • Use unique text at the top or the bottom of the page as a layout identifier.

    • The ideal layout identifier is a piece of unique text placed at the top or bottom of a page. Many documents include layout-specific versions, dates, or other identifiers in these areas. Using such identifiers helps reduce errors and improve the accuracy of matching pages to Structured layouts.

  • If no unique text appears at the top or bottom of the page, use another piece of variation-specific text elsewhere.

    • If no identifier is available at the top or bottom of the page, look for unique text within the document that appears only in a specific variation. For example, a clause included in just one variation can serve as a distinguishing factor. Using this text as a layout identifier helps the system correctly classify that variation.

  • Add only the layout identifiers required to tell similar layouts apart.

    • Each layout page should have only one identifier. In rare cases where multiple layouts share the same identifier, you can add a second one to distinguish between them. Avoid adding extra identifiers, as they reduce matching accuracy.

    • Structured layouts are limited to two identifiers. If you try to add more, the system displays a warning message.

  • Keep the number of layout identifiers consistent across all variations of the same layout.

    • If a layout has multiple variations, use the same number of identifiers across all of them. Otherwise, variations with more identifiers will receive a higher confidence boost, which may cause the wrong layout to be matched.

Document Classification task

Document Classification is a Supervision task used to group pages into documents and assign the correct layout. These tasks are created when automatic classification isn’t possible (e.g., when no model is available or the model’s confidence is low).

Expand the section below to learn how to navigate the interface and complete a Document Classification task.

Document Classification Interface

To open a Document Classification task, go to the Tasks section and click Perform Tasks under the Supervision task type table.

Document Classification allows you to:

  • Add uncategorized pages to grouped documents.

  • Reorder pages in grouped documents.

  • Remove pages from grouped documents.

  • Classify (apply layouts to) manually grouped documents.

  • Reclassify (apply different layouts to) machine-misclassified documents.

Left Panel - Uncategorized

The left panel displays all uncategorized pages - pages that haven’t yet been grouped into a document.

Order of Uncategorized pages

Pages in this panel appear in the order they were submitted to the system, with a thumbnail preview for quick scanning.

  • Select pages using your mouse or keyboard shortcuts.

    • For a full shortcut list, click thebutton.

  • You can also select all uncategorized pages by clicking the Select All button.

  • Click the Create New Doc (OPTION+N) button to group the selected pages into a new document.

    • To submit the Document Classification task, all pages from the left panel must be categorized.

Middle Panel - Grouped Documents

The middle panel shows all documents you’ve grouped.

In this panel, you can:

  • Select a layout from the drop-down.

    • If a page doesn’t match any of the available layouts, you can use one of the categories below:

Searching for a specific layout

You can search for a specific layout using the Layout Group drop-down.

  • Select a layout variation from the drop-down.

Searching for layout variations

If layout variations exist for the selected layout group, you can search for one and select it from the Layout Variation drop-down.

  • Create a new document using the selected pages.

  • Reclassify documents that were incorrectly grouped.

Page Numbers

All pages in the left and middle panels show their submission page number based on the order in which they were uploaded.

  • Grouped documents are sorted by the submission number of their first page.

  • When you create a new document manually, the pages are automatically ordered by submission number.

  • Add, remove, and reorder pages within a document.

    • Add a page from one document to another, using the Add to Doc button. Note that this will be treated as a manual match, and the content on the page will require manual extraction.

      • To move a page between documents, make sure both documents use the same layout variation.

    • To move a page to the last position in the document, click the Append page button. Note that if you choose to append it, the machine’s confidence will drop, and it will be sent for manual extraction.

    • Remove a page by clicking the button or pressing DEL on your keyboard.

    • Reorder pages using the shortcuts or

      • Select the page you want to reorder and drag it to the desired position.

Right Panel - Document Preview

The right panel shows a zoomable full-page view of the selected page.

  • Rotate the page by clicking the button or using your keyboard.

Machine-classified grouped documents rotation

Pages in machine-classified grouped documents can not be modified in place. You need to remove the page from the document and once it falls under Uncategorized, you can rotate it.

  • You can rotate all selected Uncategorized pages 90° clockwise.

  • During Machine Classification, the page image may be adjusted in order to obtain a match. Deselect Machine Adjusted Image by clicking the button on the page’s preview in the left or right panel to reset the image to its submission state.

  • To complete the task, click Submit (CMD+Enter).

Reprocessing Misclassified documents

If a document is marked as Layout Incorrect during Flexible Extraction, Identification, or Transcription Supervision, you can reprocess it and send it back to Document Classification for rework.

Instead of being marked as Complete, the submission remains in progress until all documents are properly classified.

Using Reprocessing

You can trigger reprocessing from any Identification or Transcription Supervision or Flexible Extraction task. To learn more, see Supervision and QA Introduction.

  1. Go to Submissions.

  2. Find the mismatched submission by its ID.

  3. Click on the Perform Tasks link in the Tasks column for the submission.

  4. In the right sidebar of the task, expand the document information and click Mark Layout Variation Incorrect.

    • Doing so removes the layout from all pages of the current document and creates a Document Classification task.

  1. Click Continue on the warning message.

    • The document status changes to Reprocessing and is sent back to Document Classification.

      • On the Document Classification task, you’ll see the following message:

  • If multiple documents are affected, all must be completed before the submission status updates.

  1. Reclassify your documents by following the guidance in Document Classification Interface section.

After you’ve classified your Structured documents, the next step is to extract the data you need. Learn how to work with Flexible Extraction by reading our Flexible Extraction article.