Entity Recognition Block

Prev Next

To extract meaningful data from documents, it is necessary to identify both context-based and structured information within text. The Entity Recognition Block is a processing block used within a flow to identify and extract this information from your documents. It combines two complementary approaches:

  • Context-based recognition — identifying entities based on their meaning and context within the text.

  • Pattern-based detection — a rule-based approach that identifies structured data using predefined patterns and keywords.

By combining both approaches, flows can extract a wider range of information from document text with greater accuracy and control. In this article, you’ll learn how to leverage the Entity Recognition Block for your use case.

Context-based recognition

The Entity Recognition Block uses a context-based approach to identify entities based on their meaning and context within the text. Because the block relies on contextual understanding, it is well-suited for extracting information that does not follow a fixed or predictable format, including:

  • Names of people

  • Names of organizations

  • Addresses

How it works

The block analyzes the surrounding text to determine whether a word or phrase represents a specific type of entity. Instead of relying on predefined patterns, it uses a trained model to classify entities based on context. Use this approach when extracting information that:

  • does not follow a fixed format,

  • depends on context for correct interpretation, or

  • may appear in different forms across documents.

What affects the results

Performance depends on the quality and structure of the input text. Key factors include:

  • Text quality — errors in transcription (e.g., OICR mistakes) can impact recognition. To learn more, see Text Segmentation.

  • Context availability — entities are identified based on surrounding words, so limited context may reduce accuracy.

  • Variability in wording — unusual phrasing or formatting can make entity recognition more difficult.

Limitations of context-based recognition

This approach does not rely on pre-defined patterns and may not consistently detect highly structured values. For example, it is not well suited for:

  • account numbers,

  • IDs, or

  • other values that follow a strict format.

In such cases, pattern-based approaches provide more reliable results.

Pattern-based detection

Pattern-based detection is a rule-based approach that identifies structured data using predefined patterns and keywords. It detects entities by:

  • Matching text against regular expressions (regex).

  • Optionally validating matches using keywords.

This approach is suitable for extracting information that:

  • Follows a consistent and predictable format.

  • Can be defined using patterns (regular expressions).

  • May require precise control over how values are detected.

Typical use cases include the following:

  • account numbers

  • IDs

  • dates

  • emails and other formatted values

How it works

It processes text input (typically from a Transcription block) and applies configured rules to detect matching values. It supports two main configuration approaches:

  • Regex-based detection — identifies values based on their format.

  • Keyword-based detection — narrows down matches using surrounding keywords.

Starting v43, we support detecting entities that span multiple lines or pages, improving accuracy for real-world documents where values may be split across lines or continue onto the next page.

What affects the results

Pattern-based detection performance depends on how well the detection rules are defined and how closely the input text matches those rules.

Key factors include:

  • Regex accuracy — incorrectly defined patterns may result in missed or incorrect matches.

  • Keyword configuration — using relevant keywords can improve precision by narrowing down matches.

  • Input consistency — this approach performs best when the data follows a predictable format.

  • Error-tolerance settings — allowing variations in pattern matching can increase coverage but may introduce a False Positive.

Limitations

The pattern-based detection does not interpret meaning and cannot rely on context to identify entities. As a result:

  • It may detect values that match a pattern but are not relevant.

  • It requires manual configuration and tuning.

  • It is less effective for extracting information that varies significantly in wording or structure.

Example

The Entity Recognition Block can extract both context-based and structured values from a customer application form:

  • From this document, the Entity Recognition Block identifies:

    • John Doe → Person name (context-based)

    • John Doe Inc → Organization (context-based)

    • 123 Example Street, Example City → Address (context-based)

    • APP-1001 → Application ID (pattern-based)

    • john.doe@email.com → Email (pattern-based)

    • +1 415 555 0123 → Phone number (pattern-based)

    • 10 Mar 2026 → Date (pattern-based)

Regex-based detection

The pattern-based approach allows the system to identify values based on their format, regardless of context.

  • For example, using a pattern such as:

    • [A-Z]{3}-\d{4}

  • It detects:

    • “APP-1001” or similar structured IDs that match the defined format

Using predefined patterns

The approach includes predefined patterns and keyword types for common entities such as emails, phone numbers, and identification numbers, reducing the need for custom configuration.

From the same document:

  • john.doe@email.com → Email

  • +1 415 555 0123 → Phone number

  • 10 Mar 2026 → Date

These entities can be detected using predefined configurations without defining custom regex patterns.

Unlike the context-based recognition, the pattern-based approach does not rely on context and will match any value that fits the defined pattern. This quality makes it highly effective for structured data but dependent on correct configuration.

Handling variations in structured data

Pattern-based detection can be configured to allow small variations in matched values by introducing tolerance in pattern matching. For example, a pattern can be set to match values that are slightly different from the expected format, such as missing characters or minor deviations in structure. Doing so can improve coverage when data is inconsistent but may also increase the risk of incorrect matches.

Keyword-based detection

In addition to matching values based on their format, the Entity Recognition Block can use keywords to determine the meaning of those values within the document.

From the example:

  • APP-1001 → Application ID

  • +1 415 555 0123 → Phone number

  • 10 Mar 2026 → Date

These values are identified based on their structure and their association with nearby keywords, such as Application ID, Phone Number, or Submission Date.

Combining patterns and keywords

The block can combine patterns with keywords to improve precision. For example, a pattern may match multiple values in a document, but keywords such as Account, ID, or Number can help associate the detected value with the correct field.

This combination helps reduce false positives and ensures that detected values are associated with the correct fields.

Regex patterns (core formats)

Learn how to configure the Entity Recognition Block using regular expressions to match specific value formats.

Regex-based configuration

The Entity Recognition Block uses regular expressions (regex) to define patterns for structured values. These patterns match values based on format, regardless of context:

Parameter

Description

Example

Key

Label used to group detected values

"account id"

Regex pattern

Defines the structure of the value to match

"[A-Z]{2}\\d{4}\\s\\d{3}"

Matching behavior

Matches any value that fits the pattern

AB1234 567

Keyword-based configuration

Parameter

Description

root_regexes

Base keywords used to identify a field (e.g., "Account", "Acc")

suffix_sequence

Expands keywords with variations (e.g., "ID", "Number", "#")

is_suffix_optional

Allows matching root keywords without a suffix

custom_targets

Additional keywords outside the main keyword pattern

entity_regex

Defines the structure of the value to match

is_case_sensitive

Controls whether keyword matching is case-sensitive

Advanced detection parameters

Entity recognition provides additional configuration options to refine detection behavior and handle more complex document layouts:

Parameter

Description

entity_regex_error_tolerance

Allows small variations in matched values (e.g., missing or extra characters)

enable_table_search

Enables matching based on table column headers

enable_closest_word_line_below

Searches for the closest value below a keyword

enable_block_above_search

Allows matching values located above the keyword

allow_intervening_words

Controls whether words between the keyword and the value are allowed

Supported regex patterns (examples)

Pattern

Description

Example

[A-Z]{2}\d{4}\s\d{3}

Two letters, four digits, space, three digits

AB1234 567

[A-Z]{3}-\d{4}

Three letters, dash, four digits

APP-1001

\d{9}

Nine-digit number

133563585

[A-Z0-9]{4,}

Alphanumeric string (four or more characters)

AB12, A123B

\d{2}-\d{8}

EIN format

10-55583948

\d{3}-\d{2}-\d{4}

SSN format

051-54-5373

\d{4}[-\s]\d{4}[-\s]\d{4}[-\s]\d{4}

Credit card format

4556-7515-7353-2924

Supported entity types

The block includes predefined patterns and keyword types for common entities, reducing the need for custom configuration.

Entity type

Description

Example

Notes

Email

Detects standard email formats

john.doe@example.com

Phone number

Detects common phone formats

+1 415 555 0123

Date

Supports multiple date formats

10 Mar 2026

Account number

Matches numeric values with separators

12345, 123-456

Common suffixes such as NUMBER, NO, NUM, or # may be used

Customer ID

Matches alphanumeric identifiers

AB12 345G

Common suffixes such as NUMBER, NO, NUM, or # may be used

Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION

Employee ID

Matches numeric identifiers

58391

Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION

Employer ID

Matches EIN-style identifiers

10-55583948

Identifier variations such as ID, I.D, IDENTIFIER, or IDENTIFICATION

Passport number

Matches alphanumeric passport formats

123456A37B

Credit card number

Matches standard card formats

4556-7515-7353-2924

Routing number

Matches 9-digit numbers

133563585

Common suffixes such as NUMBER, NO, NUM, or # may be used

Application number

Matches alphanumeric values with separators

AA69994B22

Matches alphanumeric values (min length 4), allowing spaces and dashes

File number

Matches alphanumeric values with separators

AA-69994-B22

Matches alphanumeric values (min length 4), allowing spaces and dashes