To extract meaningful data from documents, it is necessary to identify both context-based and structured information within text. The Entity Recognition Block is a processing block used within a flow to identify and extract this information from your documents. It combines two complementary approaches:
Context-based recognition — identifying entities based on their meaning and context within the text.
Pattern-based detection — a rule-based approach that identifies structured data using predefined patterns and keywords.
By combining both approaches, flows can extract a wider range of information from document text with greater accuracy and control. In this article, you’ll learn how to leverage the Entity Recognition Block for your use case.
Context-based recognition
The Entity Recognition Block uses a context-based approach to identify entities based on their meaning and context within the text. Because the block relies on contextual understanding, it is well-suited for extracting information that does not follow a fixed or predictable format, including:
Names of people
Names of organizations
Addresses
How it works
The block analyzes the surrounding text to determine whether a word or phrase represents a specific type of entity. Instead of relying on predefined patterns, it uses a trained model to classify entities based on context. Use this approach when extracting information that:
does not follow a fixed format,
depends on context for correct interpretation, or
may appear in different forms across documents.
What affects the results
Performance depends on the quality and structure of the input text. Key factors include:
Text quality — errors in transcription (e.g., OICR mistakes) can impact recognition. To learn more, see Text Segmentation.
Context availability — entities are identified based on surrounding words, so limited context may reduce accuracy.
Variability in wording — unusual phrasing or formatting can make entity recognition more difficult.
Limitations of context-based recognition
This approach does not rely on pre-defined patterns and may not consistently detect highly structured values. For example, it is not well suited for:
account numbers,
IDs, or
other values that follow a strict format.
In such cases, pattern-based approaches provide more reliable results.
Pattern-based detection
Pattern-based detection is a rule-based approach that identifies structured data using predefined patterns and keywords. It detects entities by:
Matching text against regular expressions (regex).
Optionally validating matches using keywords.
This approach is suitable for extracting information that:
Follows a consistent and predictable format.
Can be defined using patterns (regular expressions).
May require precise control over how values are detected.
Typical use cases include the following:
account numbers
IDs
dates
emails and other formatted values
How it works
It processes text input (typically from a Transcription block) and applies configured rules to detect matching values. It supports two main configuration approaches:
Regex-based detection — identifies values based on their format.
Keyword-based detection — narrows down matches using surrounding keywords.
Starting v43, we support detecting entities that span multiple lines or pages, improving accuracy for real-world documents where values may be split across lines or continue onto the next page.
What affects the results
Pattern-based detection performance depends on how well the detection rules are defined and how closely the input text matches those rules.
Key factors include:
Regex accuracy — incorrectly defined patterns may result in missed or incorrect matches.
Keyword configuration — using relevant keywords can improve precision by narrowing down matches.
Input consistency — this approach performs best when the data follows a predictable format.
Error-tolerance settings — allowing variations in pattern matching can increase coverage but may introduce a False Positive.
Limitations
The pattern-based detection does not interpret meaning and cannot rely on context to identify entities. As a result:
It may detect values that match a pattern but are not relevant.
It requires manual configuration and tuning.
It is less effective for extracting information that varies significantly in wording or structure.
Example
The Entity Recognition Block can extract both context-based and structured values from a customer application form:
.png?sv=2022-11-02&spr=https&st=2026-05-07T12%3A01%3A11Z&se=2026-05-07T12%3A17%3A11Z&sr=c&sp=r&sig=i%2BOssTrnz3eV%2BK8DiQqT6zh6WrGvjsOMDwBcLKyZ%2BNY%3D)
From this document, the Entity Recognition Block identifies:
John Doe → Person name (context-based)
John Doe Inc → Organization (context-based)
123 Example Street, Example City → Address (context-based)
APP-1001 → Application ID (pattern-based)
john.doe@email.com → Email (pattern-based)
+1 415 555 0123 → Phone number (pattern-based)
10 Mar 2026 → Date (pattern-based)
Regex-based detection
The pattern-based approach allows the system to identify values based on their format, regardless of context.
For example, using a pattern such as:
[A-Z]{3}-\d{4}
It detects:
“APP-1001” or similar structured IDs that match the defined format
Using predefined patterns
The approach includes predefined patterns and keyword types for common entities such as emails, phone numbers, and identification numbers, reducing the need for custom configuration.
From the same document:
john.doe@email.com → Email
+1 415 555 0123 → Phone number
10 Mar 2026 → Date
These entities can be detected using predefined configurations without defining custom regex patterns.
Unlike the context-based recognition, the pattern-based approach does not rely on context and will match any value that fits the defined pattern. This quality makes it highly effective for structured data but dependent on correct configuration.
Handling variations in structured data
Pattern-based detection can be configured to allow small variations in matched values by introducing tolerance in pattern matching. For example, a pattern can be set to match values that are slightly different from the expected format, such as missing characters or minor deviations in structure. Doing so can improve coverage when data is inconsistent but may also increase the risk of incorrect matches.
Keyword-based detection
In addition to matching values based on their format, the Entity Recognition Block can use keywords to determine the meaning of those values within the document.
From the example:
APP-1001 → Application ID
+1 415 555 0123 → Phone number
10 Mar 2026 → Date
These values are identified based on their structure and their association with nearby keywords, such as Application ID, Phone Number, or Submission Date.
Combining patterns and keywords
The block can combine patterns with keywords to improve precision. For example, a pattern may match multiple values in a document, but keywords such as Account, ID, or Number can help associate the detected value with the correct field.
This combination helps reduce false positives and ensures that detected values are associated with the correct fields.
Regex patterns (core formats)
Learn how to configure the Entity Recognition Block using regular expressions to match specific value formats.
Regex-based configuration
The Entity Recognition Block uses regular expressions (regex) to define patterns for structured values. These patterns match values based on format, regardless of context:
Parameter | Description | Example |
|---|---|---|
Key | Label used to group detected values |
|
Regex pattern | Defines the structure of the value to match |
|
Matching behavior | Matches any value that fits the pattern | AB1234 567 |
Keyword-based configuration
Parameter | Description |
|---|---|
| Base keywords used to identify a field (e.g., "Account", "Acc") |
| Expands keywords with variations (e.g., "ID", "Number", "#") |
| Allows matching root keywords without a suffix |
| Additional keywords outside the main keyword pattern |
| Defines the structure of the value to match |
| Controls whether keyword matching is case-sensitive |
Advanced detection parameters
Entity recognition provides additional configuration options to refine detection behavior and handle more complex document layouts:
Parameter | Description |
|---|---|
| Allows small variations in matched values (e.g., missing or extra characters) |
| Enables matching based on table column headers |
| Searches for the closest value below a keyword |
| Allows matching values located above the keyword |
| Controls whether words between the keyword and the value are allowed |
Supported regex patterns (examples)
Pattern | Description | Example |
|---|---|---|
| Two letters, four digits, space, three digits | AB1234 567 |
| Three letters, dash, four digits | APP-1001 |
| Nine-digit number | 133563585 |
| Alphanumeric string (four or more characters) | AB12, A123B |
| EIN format | 10-55583948 |
| SSN format | 051-54-5373 |
| Credit card format | 4556-7515-7353-2924 |
Supported entity types
The block includes predefined patterns and keyword types for common entities, reducing the need for custom configuration.
Entity type | Description | Example | Notes | ||
|---|---|---|---|---|---|
Detects standard email formats | john.doe@example.com | ||||
Phone number | Detects common phone formats | +1 415 555 0123 | |||
Date | Supports multiple date formats | 10 Mar 2026 | |||
Account number | Matches numeric values with separators | 12345, 123-456 | Common suffixes such as NUMBER, NO, NUM, or # may be used | ||
Customer ID | Matches alphanumeric identifiers | AB12 345G | Common suffixes such as NUMBER, NO, NUM, or # may be used Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION | ||
Employee ID | Matches numeric identifiers | 58391 | Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION | ||
Employer ID | Matches EIN-style identifiers | 10-55583948 | Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION | ||
Passport number | Matches alphanumeric passport formats | 123456A37B | |||
Credit card number | Matches standard card formats | 4556-7515-7353-2924 | |||
Routing number | Matches 9-digit numbers | 133563585 | Common suffixes such as NUMBER, NO, NUM, or # may be used | ||
Application number | Matches alphanumeric values with separators | AA69994B22 | Matches alphanumeric values (min length 4), allowing spaces and dashes | ||
File number | Matches alphanumeric values with separators | AA-69994-B22 | Matches alphanumeric values (min length 4), allowing spaces and dashes |