Entity Recognition Block

To extract meaningful data from documents, it is necessary to identify both context-based and structured information within text. The Entity Recognition Block is a processing block used within a flow to identify and extract this information from your documents. It combines two complementary approaches:

Context-based recognition — identifying entities based on their meaning and context within the text.
Pattern-based detection — a rule-based approach that identifies structured data using predefined patterns and keywords.

By combining both approaches, flows can extract a wider range of information from document text with greater accuracy and control. In this article, you’ll learn how to leverage the Entity Recognition Block for your use case.

Context-based recognition

The Entity Recognition Block uses a context-based approach to identify entities based on their meaning and context within the text. Because the block relies on contextual understanding, it is well-suited for extracting information that does not follow a fixed or predictable format, including:

Names of people
Names of organizations
Addresses

How it works

The block analyzes the surrounding text to determine whether a word or phrase represents a specific type of entity. Instead of relying on predefined patterns, it uses a trained model to classify entities based on context. Use this approach when extracting information that:

does not follow a fixed format,
depends on context for correct interpretation, or
may appear in different forms across documents.

What affects the results

Performance depends on the quality and structure of the input text. Key factors include:

Text quality — errors in transcription (e.g., OICR mistakes) can impact recognition. To learn more, see Text Segmentation.
Context availability — entities are identified based on surrounding words, so limited context may reduce accuracy.
Variability in wording — unusual phrasing or formatting can make entity recognition more difficult.

Limitations of context-based recognition

This approach does not rely on pre-defined patterns and may not consistently detect highly structured values. For example, it is not well suited for:

account numbers,
IDs, or
other values that follow a strict format.

In such cases, pattern-based approaches provide more reliable results.

Pattern-based detection

Pattern-based detection is a rule-based approach that identifies structured data using predefined patterns and keywords. It detects entities by:

Matching text against regular expressions (regex).
Optionally validating matches using keywords.

This approach is suitable for extracting information that:

Follows a consistent and predictable format.
Can be defined using patterns (regular expressions).
May require precise control over how values are detected.

Typical use cases include the following:

account numbers
IDs
dates
emails and other formatted values

How it works

It processes text input (typically from a Transcription block) and applies configured rules to detect matching values. It supports two main configuration approaches:

Regex-based detection — identifies values based on their format.
Keyword-based detection — narrows down matches using surrounding keywords.

Starting v43, we support detecting entities that span multiple lines or pages, improving accuracy for real-world documents where values may be split across lines or continue onto the next page.

What affects the results

Pattern-based detection performance depends on how well the detection rules are defined and how closely the input text matches those rules.

Key factors include:

Regex accuracy — incorrectly defined patterns may result in missed or incorrect matches.
Keyword configuration — using relevant keywords can improve precision by narrowing down matches.
Input consistency — this approach performs best when the data follows a predictable format.
Error-tolerance settings — allowing variations in pattern matching can increase coverage but may introduce a False Positive.

Limitations

The pattern-based detection does not interpret meaning and cannot rely on context to identify entities. As a result:

It may detect values that match a pattern but are not relevant.
It requires manual configuration and tuning.
It is less effective for extracting information that varies significantly in wording or structure.

Example

The Entity Recognition Block can extract both context-based and structured values from a customer application form:

From this document, the Entity Recognition Block identifies:
- John Doe → Person name (context-based)
- John Doe Inc → Organization (context-based)
- 123 Example Street, Example City → Address (context-based)
- APP-1001 → Application ID (pattern-based)
- john.doe@email.com → Email (pattern-based)
- +1 415 555 0123 → Phone number (pattern-based)
- 10 Mar 2026 → Date (pattern-based)

Regex-based detection

The pattern-based approach allows the system to identify values based on their format, regardless of context.

For example, using a pattern such as:
- [A-Z]{3}-\d{4}
It detects:
- “APP-1001” or similar structured IDs that match the defined format

Using predefined patterns

The approach includes predefined patterns and keyword types for common entities such as emails, phone numbers, and identification numbers, reducing the need for custom configuration.

From the same document:

john.doe@email.com → Email
+1 415 555 0123 → Phone number
10 Mar 2026 → Date

These entities can be detected using predefined configurations without defining custom regex patterns.

Unlike the context-based recognition, the pattern-based approach does not rely on context and will match any value that fits the defined pattern. This quality makes it highly effective for structured data but dependent on correct configuration.

Handling variations in structured data
Pattern-based detection can be configured to allow small variations in matched values by introducing tolerance in pattern matching. For example, a pattern can be set to match values that are slightly different from the expected format, such as missing characters or minor deviations in structure. Doing so can improve coverage when data is inconsistent but may also increase the risk of incorrect matches.

Keyword-based detection

In addition to matching values based on their format, the Entity Recognition Block can use keywords to determine the meaning of those values within the document.

From the example:

APP-1001 → Application ID
+1 415 555 0123 → Phone number
10 Mar 2026 → Date

These values are identified based on their structure and their association with nearby keywords, such as Application ID, Phone Number, or Submission Date.

Combining patterns and keywords

The block can combine patterns with keywords to improve precision. For example, a pattern may match multiple values in a document, but keywords such as Account, ID, or Number can help associate the detected value with the correct field.

This combination helps reduce false positives and ensures that detected values are associated with the correct fields.

Regex patterns (core formats)

Learn how to configure the Entity Recognition Block using regular expressions to match specific value formats.

Regex-based configuration

The Entity Recognition Block uses regular expressions (regex) to define patterns for structured values. These patterns match values based on format, regardless of context:

Parameter	Description	Example
Key	Label used to group detected values	`"account id"`
Regex pattern	Defines the structure of the value to match	`"[A-Z]{2}\\d{4}\\s\\d{3}"`
Matching behavior	Matches any value that fits the pattern	AB1234 567

Keyword-based configuration

Parameter	Description
`root_regexes`	Base keywords used to identify a field (e.g., "Account", "Acc")
`suffix_sequence`	Expands keywords with variations (e.g., "ID", "Number", "#")
`is_suffix_optional`	Allows matching root keywords without a suffix
`custom_targets`	Additional keywords outside the main keyword pattern
`entity_regex`	Defines the structure of the value to match
`is_case_sensitive`	Controls whether keyword matching is case-sensitive

Advanced detection parameters

Entity recognition provides additional configuration options to refine detection behavior and handle more complex document layouts:

Parameter	Description
`entity_regex_error_tolerance`	Allows small variations in matched values (e.g., missing or extra characters)
`enable_table_search`	Enables matching based on table column headers
`enable_closest_word_line_below`	Searches for the closest value below a keyword
`enable_block_above_search`	Allows matching values located above the keyword
`allow_intervening_words`	Controls whether words between the keyword and the value are allowed

Supported regex patterns (examples)

Pattern	Description	Example
`[A-Z]{2}\d{4}\s\d{3}`	Two letters, four digits, space, three digits	AB1234 567
`[A-Z]{3}-\d{4}`	Three letters, dash, four digits	APP-1001
`\d{9}`	Nine-digit number	133563585
`[A-Z0-9]{4,}`	Alphanumeric string (four or more characters)	AB12, A123B
`\d{2}-\d{8}`	EIN format	10-55583948
`\d{3}-\d{2}-\d{4}`	SSN format	051-54-5373
`\d{4}[-\s]\d{4}[-\s]\d{4}[-\s]\d{4}`	Credit card format	4556-7515-7353-2924

Supported entity types

The block includes predefined patterns and keyword types for common entities, reducing the need for custom configuration.

Entity type	Description	Example	Notes
Email	Detects standard email formats	john.doe@example.com
Phone number	Detects common phone formats	+1 415 555 0123
Date	Supports multiple date formats	10 Mar 2026
Account number	Matches numeric values with separators	12345, 123-456	Common suffixes such as NUMBER, NO, NUM, or # may be used
Customer ID	Matches alphanumeric identifiers	AB12 345G	Common suffixes such as NUMBER, NO, NUM, or # may be used Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION
Employee ID	Matches numeric identifiers	58391	Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION
Employer ID	Matches EIN-style identifiers	10-55583948	Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION
Passport number	Matches alphanumeric passport formats	123456A37B
Credit card number	Matches standard card formats	4556-7515-7353-2924
Routing number	Matches 9-digit numbers	133563585	Common suffixes such as NUMBER, NO, NUM, or # may be used
Application number	Matches alphanumeric values with separators	AA69994B22	Matches alphanumeric values (min length 4), allowing spaces and dashes
File number	Matches alphanumeric values with separators	AA-69994-B22	Matches alphanumeric values (min length 4), allowing spaces and dashes

Documentation Index

Entity Recognition Block

Context-based recognition

How it works

What affects the results

Limitations of context-based recognition

Pattern-based detection

How it works

What affects the results

Limitations

Example

Regex-based detection

Using predefined patterns

Keyword-based detection

Combining patterns and keywords

Regex patterns (core formats)

Regex-based configuration

Keyword-based configuration

Advanced detection parameters

Supported regex patterns (examples)

Supported entity types