Model Mitigation and Saturation

This article is intended for users who continue to experience poor model performance even after completing the initial monitoring and retraining steps. Before using the guidance in this article, make sure you’ve completed the following:

You’ve identified a reduction in performance using the steps described in Monitoring Model Performance.
You’ve addressed common model performance issues by applying the steps outlined in Improving Model Performance.

If you’ve completed all of the above and the model still underperforms, you’re likely facing one of two advanced challenges:

Overcomplicated model — You’re trying to apply your business rules through the Identification model.
Model saturation — Your dataset is too diverse for the model to generalize effectively.

In this article, you’ll learn how to:

Distinguish between business logic and identification logic.
Recognize signs of saturation and complexity.
Apply strategies such as defining the ingestion of your documents and utilizing a Classification model to meet your needs.

Business logic vs identification logic

To improve performance, it’s critical to separate business logic from identification logic.

Business logic defines how extracted data should be interpreted or used, based on outside context like vendor rules, system processes, or downstream conditions.
- Example: The same field “Reference Number” may mean “Customer Number” for one vendor and “Account Number” for another. The model cannot determine the meaning; it depends on who sent the document.

Vendor
A vendor is a third-party entity that conducts business with your company and sends you documents. The meaning of the data can vary depending on the vendor.

Identification logic defines what the model should extract and where to find it on the page, based on relative position to other text.
- Example: Extract the “Reference Number” from a specific zone, regardless of what the number represents.

Handle business logic outside the model
Apply business rules upstream (before ingestion) or downstream (post-processing). Contact your Hyperscience representative for more information.

By keeping business rules out of the model and focusing only on consistent text and layout patterns, you ensure more reliable results. The next sections explain how to achieve this in practice and how to handle complex datasets.

How to achieve reliable results

To obtain reliable results, it's essential to apply consistent rules when annotating:

Choose one way of identifying a field across documents and don’t deviate from it, even if the label or exact placement varies slightly.
Use the same approach across documents that belong to the same group, such as different vendors with similar layouts.
Avoid encoding business context or subjective decisions in your annotations.
- If а value can only be identified by who sent the document or by other outside knowledge, this is business logic and should be handled outside the model. Learn more about the annotation process in Training a Semi-structured Model.

Understanding the root cause

Based on the business and identification logic, you can determine the root cause of poor model performance. Most common scenarios are:

Overcomplicating the model - You’re trying to apply business logic in the model instead of identification logic.
Model saturation - You’re training the model on a dataset that’s too diverse or complex to generalize from.

The next sections will help you recognize which of these scenarios applies to your case and what to do about it.

Overcomplicating the model

In some cases, model underperformance is caused by how it’s being used. This scenario often arises when the model is expected to perform tasks beyond its design, such as applying business logic or making decisions that should occur outside the platform. The meaning of the extracted data depends on the business context and information outside of the documents. The model can’t learn these kinds of rules.

Main goal of Identification models
The main goal of an identification model is to locate and extract fields, not to validate data, interpret meaning, or apply conditions based on business context. When annotation rules are based on business context rather than document patterns, the model delivers unstable results.

Mitigating model underperformance

Focus identification logic on what's explicitly on the page (e.g., text and the location) instead of assumptions that require outside context or knowledge. Based on your use case, we recommend the following steps:

Understand the business goal
- Start by clarifying what you're trying to achieve. Ensure you’ve defined:
  - the business rules,
  - the data you need to extract, and
  - the conditions on how the extracted data should be used.
Move business rules outside of the model
- Apply business conditions, such as “only if status is Delivered” or “use this value only for Vendor X,” in post-processing or using Custom Code Blocks. Learn more in Modifying Custom Code Blocks.

Business rules should not be incorporated into the annotation process
These rules can be applied upstream or in post-processing. Contact your Hyperscience representative for more information on applying custom logic to your flows.

Review your annotation rules
- Ensure that each rule is based on text and location — not on business conditions or context that is not explicit in the document.
- Annotate based on consistent dependencies between fields when they’re visible on the page.
  - Example: “Always annotate the ‘Bill to’ address as the one below the invoice address.” The model can learn and generalize from such relative positioning. To learn more, see Text Segmentation.
- Avoid rules that require external context or knowledge outside the document.
  - Example: “If the invoice comes from Wales, annotate the address to the right of the invoice address; otherwise, annotate the one below.” This rule depends on information the model cannot see and will lead to poor performance.

Avoid conditional annotations
Fields should always be annotated based on layout patterns and text — not on meaning or conditions.
If the same field is annotated with one meaning in some documents and a different meaning in others, the model will learn inconsistent patterns and deliver unstable results.

Prioritize average handling time (AHT) over automation
- Not all documents need to be fully automated. We recommend prioritizing stability and accuracy for high-volume cases. In many cases, it’s more efficient to focus on a portion of the data that covers a large part of your volume, even if it doesn’t completely represent the diversity across documents.
  - For example, 10% of your vendors may account for 90% of your documents. To learn more, see Accuracy and Automation.
  - Handle complex edge cases manually or through business logic, not through overcomplicated models.

Model saturation

Even if your field definitions are clear and your annotation strategy is consistent, the model may still underperform. In these cases, the issue is often related to dataset complexity, not to logic misuse.

Model saturation happens when a model is trained on documents that are too diverse. There’s so much variation that the model can’t find stable patterns to learn from. Even with correct annotations, the model performs poorly because the task is too broad or too complex.

Model saturation can usually be recognized by symptoms like low accuracy after retraining, inconsistent results, or datasets that break down into many small groups after training data analysis.

For example, a group with fewer than 15 documents is typically considered too small to teach the model a reliable pattern. To learn more, see Improving Model Performance.

Handling complex tasks

When a model is trained on too many diverse documents, performance becomes unstable. In these situations, it’s better to reduce the scope or simplify the task. This means breaking it down into several smaller tasks that are easier to manage and train. The goal is to reduce variation, improve consistency, and ensure the model sees repeatable patterns during training. We recommend the following solutions, based on your use case:

Separate the dataset into smaller groups
- Break down large, diverse datasets into smaller groups of documents that follow a similar pattern. Each group should represent a clearly defined use case with consistent field positioning and formatting.
Route documents to separate flows using multiple input connections
- When possible, configure multiple folder or queue listeners to route different documents (e.g., documents with different patterns) into dedicated flows.
  - Doing so allows each flow to process a manageable group of documents and train its own model if needed. This approach is preferred when documents can be separated upstream, before they enter Hyperscience. To learn more about document ingestion, see Input Blocks.
Use a Classification model to route documents into layout-based groups
- Hyperscience provides a Classification model that helps you organize your documents into logical subgroups before training. The model works at the text level and can split documents based on differences in wording, headers, or field labels. This approach is especially useful when your documents follow a similar pattern but can still be distinguished by specific text features. То learn more, see Document Classification and TDM for Classification Models.

Next steps

After you’ve grouped the dataset into documents with similar patterns, you can train a dedicated model for each group following the guidelines described in Training a Semi-structured Model.

Important limits
To ensure stability and reliable results, keep your dataset within the following platform limits:
Training set: up to ~5,000 pages
Training Data Analysis: up to ~2,000 documents
Document groups: avoid training one model on more than ~100 groups As a guideline, each group should include at least 15–20 documents. Groups below that threshold often result in inconsistent behavior and poor generalization.
To learn more about the platform’s limits, see our Product Limits and Guidelines article or contact your Hyperscience representative.

Monitor the performance of your models, based on the steps outlined in Monitoring Model Performance.
Improve your model if needed by following the best practices in Improving Model Performance.