Training Data Analysis

Training Data Analysis is a tool in Training Data Management (TDM) that helps you understand the quality of your dataset before training a model. It analyzes the uploaded documents to identify patterns based on text and location. The details you can see from the analysis are:

Groups — Training Data Analysis groups similar documents based on text and location. This streamlines the annotation process and ensures consistency in the training dataset. Groups can also help you determine the diversity of the dataset when the model’s performance needs improvement. To learn more, see Improving Model Performance.
Importance — The importance is calculated by determining which data would contribute to the model’s performance. Based on text and location, the system labels the most impactful documents as having high importance. Learn more in Training Data Curator.
Anomalies — Training Data Analysis provides information on your annotations by highlighting inconsistencies and gaps that could affect the model’s quality. For more information, see Labeling Anomaly Detection.
Eligibility — Training Data Analysis indicates whether a document can be used for training, based on internal system checks and machine learning criteria. It also explains why certain documents were excluded from the training set. To learn more, see Document Eligibility Filtering.

Use Training Data Analysis to:

Identify groups of similar documents and understand variation in your training dataset.
Prioritize documents that are most valuable for training.
Detect missing or inconsistent examples before training.
Reduce redundant annotation effort.
Make informed decisions when preparing data for retraining.

Running Training Data Analysis before training or retraining helps improve data quality and reduce the risk of poor or unstable model performance. This article explains how documents are grouped and how to use those groups during the annotation process.

Running Training Data Analysis

Training Data Analysis runs on the documents in your training dataset. To begin the analysis:

Upload your dataset to TDM.
Click the Analyse Data button.

Adding or editing documents during analysis
Do NOT edit or upload documents while the analysis is taking place, as they’ll be excluded from the analysis.The results appear in the Training Data Health card.

Reanalyze data

The analysis evaluates the current state of the dataset and generates groups, importance scores, anomaly indicators, and eligibility results. Because the analysis is relative to the current dataset, results may change each time you rerun it. Make sure to reanalyze the data each time you:

add or remove training documents
update annotations
are preparing the dataset for retraining.

Grouping logic

As described above, Training Data Analysis groups documents based on similarities in text and location. Each group represents a distinct pattern within the current training dataset and helps you determine how documents are distributed. Grouping is relative to the dataset at the time the analysis is run. When documents are added, removed, or updated, and the analysis is rerun, group composition changes. That’s why new groups may appear, existing groups may merge, or documents may shift between groups.

Using groups during annotation

Annotating documents by group improves efficiency in the process and consistency in the training data. Since documents within a group share similarities, annotating them together helps maintain consistent field labeling.

When preparing training data:

Review all document groups before beginning annotation to identify similarities and edge cases.
Use the same approach across documents that belong to the same group, such as different Vendors with similar patterns.
Avoid introducing ambiguity in the annotation rules.
Annotate group by group to identify gaps in the dataset diversity.

Identification models work with patterns and consistent annotations
Identification logic defines what the model should extract and where to find it on the page, based on relative position to other text.
Avoid encoding business context or subjective decisions in your annotations. If а value can only be identified by who sent the document or by other outside knowledge, this is business logic and should be handled outside the model. Learn more about the annotation process in Training an Identification Model.

Groups and dataset diversity

The number and size of groups provide insights into dataset diversity.

Several large groups suggest consistency within the dataset.
Many small groups indicate high document diversity or documents that differ in patterns. Note that this might affect automation performance.
The same type of document across multiple groups suggests different layout patterns.