Training an Identification Model

Hyperscience extracts data from documents and converts them into a machine-readable format. We support Structured, Semi-structured, and Additional documents. To learn how to differentiate between the document types, see Understand Document Types.

Semi-structured use cases
Layouts for Semi-structured documents help identify and extract data from pages that do not have a consistent structure or fixed visual templates. While the information you need to extract (e.g., identification number, address) remains the same, its location may vary and appear under different names or labels. Examples of Semi-structured documents are paystubs and invoices. They contain key pieces of information that are always present, but their placement can vary significantly across different versions of the document.
Use Semi-structured layouts in the following scenarios:
When field positions vary
When tables vary in size or structure across documents

Train a Field Identification Model / Field Locator Model or Table Locator model to extract data from semi-structured documents using annotated examples in Training Data Management (TDM).

This article walks you through the full process — from preparing your data to training and evaluating your model.

Using features for Semi-structured documents
This article mentions features used in the processing of Semi-structured documents. Your access to those features depends on your license package and pricing plan.
To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

Requirements for training an ID model

Model training is handled by the Trainer, which operates independently from the main application to prevent performance degradation during document processing. Learn about the trainer in our Trainer article.
- For optimal performance, the trainer requires a dedicated machine with at least 64GB of RAM and 16 CPU cores. To learn more, see Infrastructure Requirements.
Learn more about the product’s limits in Product Limits and Guidelines.
The system's default minimum requirement to run a model training is 100 documents.
To train a new Identification model, it is generally recommended to have at least 400 training documents.

Step 1 — Sampling documents

Review your documents

Having a diverse, representative Training Set is crucial for a high-quality Identification model. Selecting the appropriate documents for training will optimize your semi-structured model’s performance.

Determine the common types of documents you'll be processing, and ensure you have at least:
- 15 documents for a Field Identification model, or
- 20 documents for a Table Identification model.

To create a more generalized model that's able to handle a wide range of different documents, you need a diverse dataset.
We recommend using documents with similar visual patterns for a robust model.
Provide examples of each type of document you want to include.
For example, if you are training a model to process invoices from multiple vendors, make sure to include at least 20 invoices per Vendor to ensure high model performance.

Become familiar with the edge cases (i.e., documents that are completely different from the main ones) and determine their variety. Exclude them if they are not suitable for your use case.
Remove documents that would reduce model performance (e.g., documents containing unrelated information, highly distorted pages (pages that are noisy, skewed, pixelated ones, duplicates).
Choose at least 50-100 documents for testing purposes.
- Note that these documents should be representative of the data you expect in production.

Reusing an existing model vs. training a new one
Adding more vendors’ documents to a model can be effective, but if the document structures vary too much, it may impact your model’s performance. If the formats and field requirements are similar, a single model can work well. Otherwise, training a separate model can help ensure more consistent results.
For example, if your model is trained on invoices where key fields like Invoice Number, Total Amount, and Due Date consistently appear in similar locations, adding vendors’ documents with the same structure is straightforward. But if documents from a new vendor introduce a visual layout where those fields appear in different locations — such as the Due Date in the footer or payment details split across sections — we recommend training a separate model to maintain accuracy. Learn more about model performance in Monitoring Model Performance and Improving Model Performance.

Review your fields and columns

Your fields and columns should be representative of the information you want to extract.
Ensure they are present in your documents to achieve a high-performance model.
Review any interchangeable fields or columns, as this might result in poor model performance.

For optimal results, the data in your training set should be representative of the data expected in production.

Step 2 — Build a layout, add it to a release, and assign it to a flow

Once you’ve identified the fields you want to extract, create a layout to define how those fields are captured.

Your layout determines:

which fields are extracted
how the model learns to identify them

Example

Consider the following insurance claim form:

This is a typical semi-structured document where key information is spread across sections and may appear in different locations depending on the format.

Identification models are designed to extract this type of information consistently, even when layouts vary.

Use the interactive walkthrough below to create your layout and configure the required fields.

Guidelines

Use unique names for your fields or columns to avoid model training failure and simplify the Annotation process.
Make sure to set the proper data type for each field or column you create to obtain a high-performance model.
- Learn more about data types and how to choose them in What is a Data Type? and Choosing a Data Type.
Ensure your configurations are suitable for the fields and columns for extraction:
- Check Multiple Occurrences if your fields have more than one occurrence.
- Enable the Multiline setting if required.
- Set Identification Supervision to Always for each field you want to guarantee a manual review for.
  - Enabling this setting will always generate Field ID tasks, regardless of the machine’s confidence. To learn more, see Scoring Field Output Accuracy.
- Set Transcription Supervision to Always if there are issues in the document that could prevent the machine from reading the field or the column. That way, the system will always send it to Manual Transcription, ensuring review from your keyers. Learn more about accuracy in Accuracy and Transcription Accuracy and Automation.
  - Find more field configurations in the Defining field metadata section of Creating Semi-structured Layouts.

Layout versions
Identification training always uses the latest live layout version, not the latest committed draft.
Example: If the live version of a layout does not have the multi-line checkbox enabled, but the latest committed version does, the model will still train on the live version (without the checkbox).

To learn more about flows and releases, see Flows Overview and What is a Release?.

Step 3 — Training Data Management (TDM)

Use the tools in Training Data Management to control, manage, and adjust the ground truth of your training sets for Identification and Classification models. In this section, you will learn how to upload your data using TDM. Before you start:

Ensure you meet the requirements outlined in Requirements for training an ID model section of this article.
Make sure to keep 50-100 documents for testing purposes.
- Note that they should be representative of the data expected in production. You’ll upload them after the model training is completed.

Ground truth is manually annotated data used to train our machine-learning models. We use a subset of this data to assess the performance of your models.

Use the interactive walkthrough below to upload documents to TDM:

All uploaded documents will appear on the Training Data card.

Table models
If you have a table in your layout, switch to the Table Identification tab.

Note that the status of your documents will be Ready to annotate. Learn more about statuses in Training Data Management.

Using Submission data in TDM
The Send documents to Training Data Management setting for Identification and Classification models allows you to control whether submission data is used for model training.
It is disabled by default and can be managed from the System Settings (Administration > System Settings).

Step 4 — Analyze your data

Training Data Analysis allows you to group your training documents and receive recommendations to improve the quality of your dataset. Learn more in Training Data Analysis

Running Training Data Analysis
We recommend running training data analysis once you’ve uploaded your documents. The system will create groups based on the similarity of your training documents which improves the efficiency of the annotation process.

Learn how to receive insights for improving your training data in the interactive walkthrough below:

Do NOT edit or upload documents while the analysis is taking place, as they’ll be excluded from the analysis.

The results will appear in the Training Data Health card. Learn more in TDM for Identification Models.

Analysis results

The results show you the eligibility and importance of each document. Learn more in Training Data Management Features.

Re-analyze your data
The system does not reanalyze the training data automatically. Make sure to re-analyze when:
you upload new documents or
you edit the existing documents.

Groups - Training data analysis groups your training set by visual similarity. For the best data representation, we recommend having at least 10 groups of each document type.

Groups with Excess Documents
Having a group with excess documents (e.g., more than 15 samples for Field ID and 20 samples for Table ID) does not necessarily mean that you need to remove the excess data. Depending on the specific use case and the performance of your model, you may want to enrich the annotations by adding more annotated examples from a particular group. Contact your Hyperscience representative for more information.

Importance - The Training Data Curator labels each training document as having high or low importance.

Training Data Curator
The importance is calculated by determining which data would best contribute to the model’s performance. For each group of documents, the system labels the most impactful ones as having high importance. The goal is to improve the efficiency of the annotation process by requesting an optimal subset that reflects the variety of documents whose data you expect to identify with the model. Learn more about how data is curated in Training Data Curator.

Eligibility - with Document Eligibility Filtering, you can see which documents are incompatible with training and why, allowing you to address any issues accordingly and achieve better model performance. Learn more in Document Eligibility Filtering.
Detect anomalies - Re-analyze your data and find inconsistencies across your annotations with Labeling Anomaly Detection. For more information, see our Labeling Anomaly Detection article.

Consistent annotations are crucial for a high-performance locator model.

Learn how to annotate fields in Field Identification and how to annotate tables in Table Identification.

Step 5 — Annotate your documents

Learn how to annotate your documents by following the best practices listed below.

General guidelines

Once you analyze the data, you’ll be able to annotate by group. Doing so provides you with more control over the dataset. Annotating by group and by priority helps you determine which groups have more documents and which groups are underrepresented.
After annotating 2-3 documents per group, you’ll be able to use guided data labeling. This feature gives suggestions provided by the machine that will help you to annotate more quickly.
Follow the general rule for annotating: left to right, top to bottom.
Make sure to maintain consistent annotations for your fields or columns. When a single value of a field or a column appears in different sections of the document, annotate it strictly in one location to avoid confusing the model.
Search by Text Segment
Search by Text Segment is available in TDM for Field Identification and Table Identification models.
This feature allows you to search for fields or cells by specific text segments directly within the Training Data Management interface. Search by Text Segment helps you locate, review, and annotate data faster during the model-training process. Learn how to use Search by Text Segment in Field Identification. To learn more about segments, see our Text Segmentation article
Always use the machine predictions when drawing the bounding box. Avoid drawing it manually.
Adjust the machine predictions ONLY if the bounding boxes are overlapping and preventing the proper extraction of the data.
Do NOT interchange fields or columns, as doing so may lead to uncertainty for the model.
If a field or a table cell is not present, do not replace it with a similar value.
If you don’t see a box made of dashed lines around a value, do NOT annotate it. If there is no such box, it means that our internal ML models are not reading any values for that field or cell.

The annotations serve as Ground truth labels that guide the model through the training process. Aligning the annotations with the machine’s predictions will ensure that the model learns from accurate and consistent information. Inconsistencies, such as annotating the same information in different locations within a document, can affect the model’s ability to learn patterns accurately, which may result in lower performance or incorrect predictions. To learn more see our Text Segmentation article.

Field Identification

Annotate fields with Multiple Occurrences only when multiple instances of a field are present. Learn more about Multiple Occurrences in Field Identification.
Use multiple bounding boxes when a text is logically connected. Learn more in the Multiple bounding boxes for fields section of Field Identification.
If you don’t see a value for a field (i.e., the field is blank), do NOT annotate it.

Table Identification

When annotating a table, make sure to select a row where all data is present. The row you select is your template row, or the row in your table that is most representative of the table’s content.
The template row doesn't need to be the first row in the table. Hyperscience uses the copycat tool to populate the annotation from the template row to the rest of the rows. The copycat is not always accurate, so make sure to double-check the annotations before you submit.
Always find your table's first and last rows and ensure they are properly annotated.
Always press the ESC button before submitting a table to ensure the annotations are correct.
Draw one large bounding box capturing all rows of your table, and press the S button on your keyboard. That way, you’ll activate the Split tool and be able to define or correct the rows of your table faster. Make sure to double-check the annotations.

Tags
Starting in v43, you can manage documents by assigning tags during annotation.

Learn more tips and tricks on annotating tables in Table Identification.

Once you’re ready with your annotations, re-analyze the data, and use Anomaly Detection to ensure that your annotations are correct and consistent. Learn more in Labeling Anomaly Detection. You can reanalyze your data after each iteration to maximize the quality of the training set.

Next steps

Check if all training documents are eligible for training.
- The number next to Eligible for training on the model details page is the number of documents that will be used in your training set. This number may change as documents are annotated and each time you analyze your training data.
Ensure you have the required number of training documents.
- The number of Required documents shown on the model details page is the number of additional documents you need to upload and annotate to run a model training.

Step 6 - Review your flow and train your model

Once you’ve reviewed your annotations and addressed any potential anomalies, you’re ready to initiate model training.

Before you run a model training:

Review the flow’s configurations:

Your system might consist of several workflows, called flows. Each flow contains blocks, representing important stages of the data-extraction process. Learn more in Flows Overview.

For more precise control over the process, you can configure your flow’s settings.

Set your Target Accuracy to achieve better performance.

The system uses QA data and the Field Identification Target Accuracy or Table Identification Target Accuracy values to calculate the optimal confidence threshold that will allow the system to reach the target accuracy with the minimum amount of manual effort. We recommend using the default values (95% for Field ID and 96% for Table ID) for the initial training to compare the results with the next iterations and adjust accordingly later. Change the target accuracy as follows after the first iteration:
If you want to achieve high automation, set a lower percentage.
If you need high accuracy, set a higher value.
To learn more, see Accuracy and Automation.

Run Training

Initiate a model training by clicking the Run Training button.

The button will be grayed out if you don’t have the minimum number of required documents.

You’ll receive a notification in the Notification section (), which is located in the upper-right corner of the application, once the training is completed.

A single Trainer attached to your instance will train one model at a time. For example, if you run a model training for Field ID, and then start a model training for Table ID, the one that you’ve started first will be running, and the second one will be queued. To learn more, see our What is the Trainer? article.

Monitor the training jobs in the Running and Queued cards on the Trainer page (Administration > Trainer).

Step 7 — Evaluate the training results

Deploy your model

Once the model training is complete, you’ll find the candidate model in the model details page. To deploy it, click on your candidate model, then click Deploy Model.

The model is now live and ready for document processing. You can see insights on the automation and accuracy on the model details page. Learn more in Model Validation Tasks and Evaluating Model Training Results.

Аlways verify layout-version compatibility when switching between model versions
The Live version of a model always uses the most recent layout version, regardless of which layout version it was originally trained with.
This pairing can lead to unexpected behavior, especially if changes were made to the layout after training (e.g. new fields, field-setting updates).
Example: If a model is trained on v3 of layout, and v4 of that layout is created after the training, the model will use v4 of the layout when deployed.

Evaluate the performance

Use the documents you’ve chosen for testing purposes to evaluate the performance of your model. Note that, to measure the performance accurately, these documents should not be ones that were used for the training. To learn more about evaluating the model’s performance, see our Monitoring Model Performance article.

Before you start, ensure that your flow configurations match the ones you expect in production.
- Enable Manual Identification Supervision if you have fields that you want to review manually. Doing so will generate Manual Identification tasks, which should be performed by a keyer.
- If required, enable any combination of Field Identification Quality Assurance, Table Identification Quality Assurance, and Transcription Quality Assurance.
  - Set the QA sample rate for each type of quality assurance you enable (Field Identification QA Sample Rate, Table Identification QA Sample Rate, or Transcription QA Sample Rate).
QA Sample Rate
The QA Sample Rate value represents the percentage of documents selected for Field ID, Table ID, or Transcription QA tasks. To learn more, see our Accuracy article.
Learn more about flow-level configurations in Document Processing Subflow Settings article.

Upload your testing documents

Upload your documents as submissions by following the steps below:

Go to Submissions.
Click Create Submission.
Upload the testing documents. If you’re uploading multiple documents at once, select One Submission per file to evaluate the performance for each document.
Click Next.
Choose the flow you’re using for the model from the Flow drop-down list.
Choose the layout used for the model from the Layout drop-down list.
Click Upload.

Results

Observe the results based on your flow settings on the Document Output page. Learn more in Document Output Page.

Next Steps
If the model is performing poorly, we suggest going over the training documents, as described in Improving Model Performance:
Check for potential annotation errors and inconsistencies and fix them.
Re-run training data analysis to use the labeling anomaly detection for more accurate results.
Based on the results, you can also decide to enrich the training set by adding more documents.
If the model is performing well and the projected automation meets the target accuracy, we do not recommend retraining the model unless
changes are made to the layout or
the data distribution of incoming documents has changed (e.g., new visual template). Learn more in the section below.

Retraining existing models

When you adjust a semi-structured layout, some of the changes require retraining your Identification Models, while others do not.

Retraining ensures that the model can correctly recognize new or updated fields and tables, and helps prevent unexpected behavior in production.

This article explains which types of layout changes require retraining and which ones can be applied without it. Understanding this distinction helps you save time and maintain model accuracy.

To learn how to train an Identification model, see Training a Semi-structured model
Learn how to monitor and improve your models in our Monitoring Model Performance and Improving Model Performance articles.

When retraining is required

Field Identification Models

Retraining is required if:

A new field is added.
The Multiline setting for an existing field is toggled. Learn more in Creating Semi-structured Layouts.
The Multiple Occurrences setting is toggled.

Retraining required
Adding a new field without retraining will not make it functional.
You must retrain the model with documents that include annotations for the new or updated field. Until retrained, the new field will remain unsupported. You’ll see a message in Training Data Management when new fields are missing from the training data:

Table Identification Models

Retraining is required if:

A new column has been added.
The Multiline setting for an existing column is toggled.

Retraining required
Adding a new column without retraining is not enough. The model must be retrained with annotated documents that contain the new or updated column. You’ll see a message in Training Data Management when new columns are missing from the training data:

When retraining is not required

Updating any of the following settings in an existing field does not require retraining your Field or Table ID model:

Output name
Transcription Supervision
Identification Supervision
Required
Not in English

Аlways verify layout-version compatibility when switching between model versions
The Live version of a model always uses the most recent layout version, regardless of which layout version it was originally trained with.
This pairing can lead to unexpected behavior, especially if changes were made to the layout after training (e.g. new fields, field-setting updates).
Example: If a model is trained on v3 of layout, and v4 of that layout is created after the training, the model will use v4 of the layout when deployed.

Field ID models

No retraining needed if:

A field is removed
A field’s data type is changed. To learn more, see Data Types.

Table ID models

A Table ID model does not need to be retrained if:

A column is removed
A column’s data type is changed

Additional Considerations

If you retrain with existing training data only, new fields/columns or ones with updated Multiline setting will not be included. Retraining must be done with enough annotated examples covering the new or modified layout elements.