The quality of your Training Data directly affects how well your Identification model performs. Before training a model, prepare a Dataset that accurately represents the documents you expect to process in production.
While you'll first create a Semi-structured Layout to upload documents in Training Data Management (TDM), the quality, diversity, and consistency of your training data have the greatest impact on model performance.
In this step, you'll prepare a representative dataset that reflects the document patterns, formats, and edge cases your model is expected to process.
Well-prepared dataset
A well-prepared dataset improves model performance and helps avoid common issues later in the process, such as poor extraction Accuracy or repeated retraining.
In this article, you’ll learn
how to select a representative set of documents for training
what makes a high-quality training dataset
how to handle different document patterns and edge cases
when to separate documents into different models
what to exclude from your training data.
Choose representative documents
Selecting the right documents is the most important part of preparing your training data. A representative dataset allows the model to learn patterns that reflect the documents it will process in production.
Identify your document types
Start by identifying the document types (e.g., invoices, paystubs, claim forms):
Determine the main document patterns (e.g., structures, fields, tables)
Ensure that the fields or tables you need to extract are present in these documents.
Number of documents in your dataset
To run a model training, you need at least 100 documents.
For reliable performance, we recommend using at least 500 training documents, distributed across all document types.
Document distribution
Your training data should reflect the real distribution of documents you expect to process.
Include examples from the document types you expect to process with this model
For Field Identification models: include at least 15 documents per type.
For Table Identification models: include at least 20 documents per type.
If your use case includes multiple formats (e.g., different Vendors/suppliers), include enough examples for each.
For example, if you process invoices from multiple vendors, include at least 15/ 20 invoices per vendor.
Ensure documents contain the required data
Your documents must support the data you want to extract.
Ensure that every field you want to extract appears in your dataset.
Include enough examples of each field across documents.
Ensure that your training data reflects how fields appear in production documents.
If documents with missing information are expected in production, include representative examples in your training dataset. This helps the model learn patterns that reflect production data.
For example, if some invoices are received without a Purchase order number, include representative examples of those invoices in your training data.
Each file should contain only one document from the intended use case.
Do not include documents that mix different document types (e.g., invoices with packing lists or purchase orders).
For multi-page documents, ensure all pages are included. The order of pages in your training data should reflect how documents are received in production.
Model training
The model can only learn from the data available in your documents. Ensure important data is consistently present and readable across documents.
Ensure dataset diversity
Your training dataset should reflect the range of document patterns your model will encounter in production. Including diverse examples helps the model generalize and perform consistently across different document formats.
Generalized model
A well-generalized model learns patterns from a diverse training dataset and can accurately process a wider variety of documents it has not seen before.
Include diverse document patterns
Documents of the same type may differ in how they present information. That’s why it is important to include documents where:
fields appear in different locations
sections are arranged differently
tables vary in size.
Document representation
Identification models are designed to handle variability, but only if it is represented in the training data.
Prioritize patterns that represent the majority of your production volume.
Avoid over-focusing on rare or edge-case documents.
Capture differences in labels and formatting
The same information may appear under different names or formats.
Include documents with patterns such as:
different field labels (e.g., “Invoice No” vs “Invoice ID”)
different date or number formats
Formatting and labels differences
This helps the model recognize the same data across different representations.
Handle edge cases
Not all documents should be treated equally. Some documents differ significantly from the main dataset and can negatively impact model performance if included without consideration.
For example, if most documents follow a standard invoice layout but a small subset uses a significantly different pattern, consider whether those documents should be handled by a separate model. Learn more in Model Mitigation and Saturation.
Identify edge cases
Review your dataset and look for documents that do not follow the common structure. Edge cases may include:
documents with completely different layouts
documents where key fields appear in unusual locations
documents with missing or inconsistent structure
Edge cases
These documents do not represent the typical patterns your model should learn.
Decide whether to include or exclude them
Once identified, decide how to handle each edge case.
Prioritize document types that best represent your production data.
Exclude them if they are rare or not relevant to your use case.
Model performance
Including irrelevant edge cases can introduce noise and reduce model performance.
Avoid mixing incompatible patterns
Identification models can learn document patterns. However, documents that differ significantly from the main dataset may require a different approach. For example, if most documents in your dataset are invoices, but a small subset consists of insurance claims, consider handling those documents separately. Learn more in Model Mitigation and Saturation.
Clean your dataset
Before using your documents for training, remove any data that could negatively impact model performance. Low-quality or irrelevant documents can introduce noise and reduce the model’s ability to learn consistent patterns.
Image correction
If Image correction is enabled when uploading documents in TDM, documents that are upside down or rotated by 90 degrees are automatically corrected during processing. You do not need to manually rotate these documents before adding them to your dataset.
Remove low-quality documents
Exclude documents that are difficult to read or poorly formatted.
distorted or skewed scans
pixelated or low-resolution images
documents with heavy noise or artifacts.
Poor quality documents
Poor-quality documents can lead to inaccurate annotations and unreliable model behavior
Remove duplicates
Avoid including identical documents, such as repeated samples of the same document.
Learning value
Duplicates do not add learning value and can reduce the model's ability to generalize to different document patterns.
Exclude irrelevant documents
Remove documents that do not match your use case:
documents containing unrelated information
documents outside the intended document type
Irrelevant data
Irrelevant data introduces noise and reduces overall model performance.
Prepare a testing set
Before training your model, set aside a portion of your documents for evaluation. A testing set allows you to measure model performance accurately after training.
Reserve documents for testing
Select a subset of your dataset to use only for evaluation. Set aside 50–100 documents for testing:
Do not include these documents in your training set.
Ensure the testing documents reflect your production data.
Include edge cases.
Testing set representation
As a guideline, your testing set should represent approximately 20% of your training dataset.
Keep training and testing data separate
Avoid using the same documents for both training and evaluation.
Misleading results
Using the same data for both training and testing leads to misleading results, such as False Positives and False Negatives.
Ensure representative coverage
Your testing set should mirror the variability of your dataset:
Include all relevant document types.
Capture the pattern diversity in your documents.
A representative testing set ensures that your evaluation reflects real-world performance.
Common issues
Using too few documents
Training with a small dataset limits the model’s ability to learn patterns and generalize to new documents.
Make sure you meet the minimum requirement (100 documents) and aim for a larger, well-distributed dataset for better performance.
Unbalanced representation of document types
Avoid over-representing a single document type, vendor, or format.
Once a document type is sufficiently represented, adding more similar documents is unlikely to improve model performance and may increase annotation effort.
Ensure each document type is sufficiently represented in your dataset.
Combining documents with significantly different layouts in a single model can confuse the model and reduce accuracy.
Training separate models
If document structures vary too much, consider training separate models. To learn more, see Model Mitigation and Saturation.
Consider separate models when:
fields consistently appear in different sections
layouts follow unrelated structures
annotations require different logic.
Ignoring layout patterns/formats
Training on a limited set of formats can lead to poor performance when new patterns/formats are introduced.
Include documents with different layouts, field placements, and labels.
Including low-quality or noisy documents
Distorted, skewed, or low-resolution documents can negatively impact both annotation and training.
Remove documents that are difficult to read or contain visual noise.
Including irrelevant or incomplete documents
Documents that are missing key fields or contain unrelated information can introduce noise into the dataset.
Only include documents that match your use case and contain the required data.