Training Data Management

Prev Next

Accessing this feature

Your access to the feature described in this article depends on your license package and pricing plan.

To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

Training Data Management allows you to improve and supervise models by working directly with the training data (“Ground truth”) obtained from each document in the Training Set. You can group documents, see incompatible ones, annotate representative parts of them, and detect potential inconsistencies.

The performance of your models depends on the quality of the pages, the diversity of the documents, and the consistency of the annotations. For more information on model-training results, see Evaluating Model Training Results.

TDM includes tools for controlling and managing the Identification and Classification models’ performance. Learn more about model performance in Monitoring Model Performance and Improving Model Performance.

TDM for Identification models

TDM for Identification models includes the following features: 

  • Document Eligibility Filtering — indicates whether a document is eligible for training, based on internal checks in the application and our machine learning logic. It provides additional information about documents that were excluded from the training set. 

  • Training Data Curator — labels each training document as having high or low importance. The importance is calculated by determining which data would best contribute to the model’s performance. 

  • Labeling Anomaly Detection for Fields and Tables — identifies potential discrepancies in the training datasets before running model training. Once the annotations are ready, the user can analyze the data to find inconsistencies and ensure a top-performing locator model. 

  • Search by Text Segment — search for fields or cells by specific text segments directly within the Training Data Management interface. This feature allows you to locate, review, and annotate data faster during the model-training process.

Learn how to use these features to maximize the performance of your identification model in our Training a Semi-Structured Model article.

TDM for Classification 

TDM for Classification models allows you to add, remove, and update training pages for Classification models. Learn more in TDM for Classification.

Accessing Training Data Management tools

If you have the View Training Data permission (given to System Admin and Business Admin permission groups by default), you can access the Training Data Management tools for a model. Learn more in Permission Groups.

  1. Go to the Models section. Learn more about the models table in Model Management.

  2. Click on the specific tab to view the models you need:

    • Classification

    • Identification

    • Text Classification

    • Transcription

  1. Click on the name of the model you would like to view training data for:

    • For ID models:  

      • Click the Field Identification or the Table Identification tab, depending on the type of training data you would like to view.

      • The Training Data Management tools are located on the Training Data Health card. 

    • For Classification models: 

      • Click on the Training Data tab to edit the documents used for training. 

Continuous Model Training

When you import a model from another environment while Continuous Field Locator model improvement and/or Continuous Classification model improvement are enabled, the model’s automation rates may decrease. Models only learn from the training data available in their current environment. If the new instance contains limited or no training data, the imported model may be replaced by a lower-performing version.

To maintain optimal performance:

  • Train models manually after import.

  • Keep Continuous Field Locator model improvement and Continuous Classification model improvement disabled, unless specifically advised otherwise by a Hyperscience representative.