Preparing training data

The quality of your Training Data directly affects how well your Identification model performs. Before training a model, prepare a Dataset that accurately represents the documents you expect to process in production.

While you'll first create a Semi-structured Layout to upload documents in Training Data Management (TDM), the quality, diversity, and consistency of your training data have the greatest impact on model performance.

In this step, you'll prepare a representative dataset that reflects the document patterns, formats, and edge cases your model is expected to process.

Well-prepared dataset
A well-prepared dataset improves model performance and helps avoid common issues later in the process, such as poor extraction Accuracy or repeated retraining.

In this article, you’ll learn

how to select a representative set of documents for training
what makes a high-quality training dataset
how to handle different document patterns and edge cases
when to separate documents into different models
what to exclude from your training data.

Choose representative documents

Selecting the right documents is the most important part of preparing your training data. A representative dataset allows the model to learn patterns that reflect the documents it will process in production.

Identify your document types

Start by identifying the document types (e.g., invoices, paystubs, claim forms):

Determine the main document patterns (e.g., structures, fields, tables)
Ensure that the fields or tables you need to extract are present in these documents.

Number of documents in your dataset
To run a model training, you need at least 100 documents.
For reliable performance, we recommend using at least 500 training documents, distributed across all document types.

Document distribution

Your training data should reflect the real distribution of documents you expect to process.

Include examples from the document types you expect to process with this model

For Field Identification models: include at least 15 documents per type.
For Table Identification models: include at least 20 documents per type.
If your use case includes multiple formats (e.g., different Vendors/suppliers), include enough examples for each.
- For example, if you process invoices from multiple vendors, include at least 15/ 20 invoices per vendor.

Ensure documents contain the required data

Your documents must support the data you want to extract.

Ensure that every field you want to extract appears in your dataset.
Include enough examples of each field across documents.
- Ensure that your training data reflects how fields appear in production documents.
If documents with missing information are expected in production, include representative examples in your training dataset. This helps the model learn patterns that reflect production data.
- For example, if some invoices are received without a Purchase order number, include representative examples of those invoices in your training data.
Each file should contain only one document from the intended use case.
Do not include documents that mix different document types (e.g., invoices with packing lists or purchase orders).
For multi-page documents, ensure all pages are included. The order of pages in your training data should reflect how documents are received in production.

Model training
The model can only learn from the data available in your documents. Ensure important data is consistently present and readable across documents.

Ensure dataset diversity

Your training dataset should reflect the range of document patterns your model will encounter in production. Including diverse examples helps the model generalize and perform consistently across different document formats.

Generalized model
A well-generalized model learns patterns from a diverse training dataset and can accurately process a wider variety of documents it has not seen before.

Include diverse document patterns

Documents of the same type may differ in how they present information. That’s why it is important to include documents where:

fields appear in different locations
sections are arranged differently
tables vary in size.

Document representation
Identification models are designed to handle variability, but only if it is represented in the training data.
Prioritize patterns that represent the majority of your production volume.
Avoid over-focusing on rare or edge-case documents.

Capture differences in labels and formatting

The same information may appear under different names or formats.

Include documents with patterns such as:

different field labels (e.g., “Invoice No” vs “Invoice ID”)
different date or number formats

Formatting and labels differences
This helps the model recognize the same data across different representations.

Handle edge cases

Not all documents should be treated equally. Some documents differ significantly from the main dataset and can negatively impact model performance if included without consideration.

For example, if most documents follow a standard invoice layout but a small subset uses a significantly different pattern, consider whether those documents should be handled by a separate model. Learn more in Model Mitigation and Saturation.

Identify edge cases

Review your dataset and look for documents that do not follow the common structure. Edge cases may include:

documents with completely different layouts
documents where key fields appear in unusual locations
documents with missing or inconsistent structure

Edge cases
These documents do not represent the typical patterns your model should learn.

Decide whether to include or exclude them

Once identified, decide how to handle each edge case.

Prioritize document types that best represent your production data.
Exclude them if they are rare or not relevant to your use case.

Model performance
Including irrelevant edge cases can introduce noise and reduce model performance.

Avoid mixing incompatible patterns

Identification models can learn document patterns. However, documents that differ significantly from the main dataset may require a different approach. For example, if most documents in your dataset are invoices, but a small subset consists of insurance claims, consider handling those documents separately. Learn more in Model Mitigation and Saturation.

Clean your dataset

Before using your documents for training, remove any data that could negatively impact model performance. Low-quality or irrelevant documents can introduce noise and reduce the model’s ability to learn consistent patterns.

Image correction
If Image correction is enabled when uploading documents in TDM, documents that are upside down or rotated by 90 degrees are automatically corrected during processing. You do not need to manually rotate these documents before adding them to your dataset.

Remove low-quality documents

Exclude documents that are difficult to read or poorly formatted.

distorted or skewed scans
pixelated or low-resolution images
documents with heavy noise or artifacts.

Poor quality documents
Poor-quality documents can lead to inaccurate annotations and unreliable model behavior

Remove duplicates

Avoid including identical documents, such as repeated samples of the same document.

Learning value
Duplicates do not add learning value and can reduce the model's ability to generalize to different document patterns.

Exclude irrelevant documents

Remove documents that do not match your use case:

documents containing unrelated information
documents outside the intended document type

Irrelevant data
Irrelevant data introduces noise and reduces overall model performance.

Prepare a testing set

Before training your model, set aside a portion of your documents for evaluation. A testing set allows you to measure model performance accurately after training.

Reserve documents for testing

Select a subset of your dataset to use only for evaluation. Set aside 50–100 documents for testing:

Do not include these documents in your training set.
Ensure the testing documents reflect your production data.
Include edge cases.

Testing set representation
As a guideline, your testing set should represent approximately 20% of your training dataset.

Keep training and testing data separate

Avoid using the same documents for both training and evaluation.

Misleading results
Using the same data for both training and testing leads to misleading results, such as False Positives and False Negatives.

Ensure representative coverage

Your testing set should mirror the variability of your dataset:

Include all relevant document types.
Capture the pattern diversity in your documents.

A representative testing set ensures that your evaluation reflects real-world performance.

Common issues

Using too few documents

Training with a small dataset limits the model’s ability to learn patterns and generalize to new documents.

Make sure you meet the minimum requirement (100 documents) and aim for a larger, well-distributed dataset for better performance.

Unbalanced representation of document types

Avoid over-representing a single document type, vendor, or format.

Once a document type is sufficiently represented, adding more similar documents is unlikely to improve model performance and may increase annotation effort.

Ensure each document type is sufficiently represented in your dataset.
Combining documents with significantly different layouts in a single model can confuse the model and reduce accuracy.

Training separate models
If document structures vary too much, consider training separate models. To learn more, see Model Mitigation and Saturation.

Consider separate models when:

fields consistently appear in different sections
layouts follow unrelated structures
annotations require different logic.

Ignoring layout patterns/formats

Training on a limited set of formats can lead to poor performance when new patterns/formats are introduced.

Include documents with different layouts, field placements, and labels.

Including low-quality or noisy documents

Distorted, skewed, or low-resolution documents can negatively impact both annotation and training.

Remove documents that are difficult to read or contain visual noise.

Including irrelevant or incomplete documents

Documents that are missing key fields or contain unrelated information can introduce noise into the dataset.

Only include documents that match your use case and contain the required data.

Documentation Index