Identification models in Hyperscience are machine learning models used to extract data from semi-structured documents by learning how textual segments correspond to specific fields or table entries. To learn more, see Text Segmentation. Using annotated examples, they learn patterns based on both document content and layout, enabling consistent data extraction across varying documents.
Structured and Semi-structured use cases
Use the layout of your documents to determine the right approach:
Structured use cases
The layout of your documents is highly consistent. Data appears in predictable locations across documents, with little to no variation.
Best suited for structured layouts, where data can be extracted based on fixed positions. To learn more, see Building a Structured Use Case.
Semi-structured use cases
The information is present across documents, but its position or labeling may vary. Layouts are similar, but not identical.
Best suited for Identification models, which learn patterns to extract data despite variation. Learn how to train an ID model in our Training an Identification Model.
Learn more about document types in Understanding Document Types.
In this article, you’ll learn:
When Identification models are the right choice for your use case.
Types of Identification models and when to use them.
Types of Identification models
Identification models are divided into two types based on how data appears in a document:
Field Identification focuses on extracting individual data points, such as names, dates, or IDs, based on learned layout patterns.
Table Identification focuses on extracting repeating groups of related data, such as line items or transactions, where multiple values belong together within each entry.
The key difference is whether the data appears as standalone values or as part of a table.
Model training
Identification models are trained at the layout level. Learn more in Creating Semi-structured Layouts.
Field Identification
A Field Identification model learns how individual data points correspond to specific fields within semi-structured documents. Using annotated training examples, it identifies and extracts each field independently, based on learned patterns from both the text and layout of the document. It is designed to handle cases where:
Data points appear independently across a document.
The same field may appear once or multiple times within the document.
Layouts vary, but each field can be identified on its own.
Use Field Identification when you need to extract individual data points that are not associated with other fields, such as:
Single-value fields (e.g., address, number, date, name).
Repeated standalone fields (e.g., multiple dates or IDs across a document).
Table Identification
A Table Identification model learns how repeating groups of related data are organized within semi-structured documents. Using annotated examples, it identifies data points that belong together (such as quantities, prices, and descriptions within a single line item) and extracts them as grouped entries. It is designed to handle cases where:
Data appears in repeating groups across a document.
Multiple values are associated with each entry (e.g., a line item).
Layouts vary, but the relationship between data points remains consistent.
Use Table Identification when you need to extract repeating, related data points that are grouped together such as:
Line items (e.g., description, quantity, price per item).
Transactions (e.g., date, amount, description).
Entries in reports or listings.
How Identification models learn
Annotation process
Identification models learn and improve over time from annotated training documents:
To train an Identification model, you first annotate documents in Training Data Management (TDM), defining the fields and tables you want to extract.
The training documents must be representative of your production data. Learn more about the document sampling in Training an Identification Model.
Analyse the training documents to ensure quality and consistency in the annotations. To learn more about the features used for the annotation process, see Training Data Management.
Training process
Once you have enough consistently annotated examples, you can train a model. Training uses annotated data from TDM, allowing the model to learn from a consistent set of examples.
Training resources
The model training requires computing resources and is handled by the Trainer. Learn more in our Trainer article.
After training is initiated, the model runs as a background Job and becomes available once training is complete. It is common to run multiple rounds of training as you refine annotations and expand your Dataset.
Evaluation and maintenance
After the model is live, you can evaluate its performance through Quality Assurance (QA), review how it behaves on unseen production documents and improve it over time by refining annotations or adding more training examples. Learn more about model performance in our Model Maintenance category.
Model performance
The quality, consistency, and coverage of your training data directly impact model performance.
Learn how to annotate your data and train a model in our Training an Identification Model article.