Implementing and Optimizing Partial Word Search in ElasticSearch Using nGram

Keywords: ElasticSearch | nGram | partial search

Abstract: This article delves into the technical solutions for implementing partial word search in ElasticSearch, with a focus on the configuration and application of the nGram tokenizer. By comparing the performance differences between standard queries and the nGram method, it explains in detail how to correctly set up analyzers, tokenizers, and filters to address the user's issue of failing to match "Doe" against "Doeman" and "Doewoman". The article provides complete configuration examples and code implementations to help developers understand ElasticSearch's text analysis mechanisms and optimize search efficiency and accuracy.

Introduction

In practical applications of ElasticSearch, users often need to search for parts of words, such as finding documents containing "Doe" to match "John Doeman" and "Jane Doewoman". However, the default standard analyzer may not directly support such partial matching, leading to no query results. Based on the best answer from the Q&A data, this article explores how to implement efficient partial word search using nGram technology.

Problem Analysis

When users attempt simple queries (e.g., curl http://localhost:9200/my_idx/my_type/_search?q=Doe) or term queries (e.g., {"query": {"term": {"name": "Doe"}}}), they fail to return expected results. This is because ElasticSearch defaults to using the standard analyzer, which tokenizes text into complete words (e.g., "Doeman"), and "Doe" as a substring is not indexed. Therefore, the analysis strategy needs adjustment to support partial matching.

nGram Technology Principle

nGram is a technique that splits text into continuous character sequences. By setting minimum and maximum gram lengths, it can generate all possible substrings of a word. For example, for the word "Doeman", with min_gram=2 and max_gram=50, tokens like "Do", "oe", "em", "ma", "an", "Doe", "oem" are generated, thus indexing the "Doe" part. This allows search queries to match any substring within words.

Configuration Implementation

Based on the best answer, we adopt a scheme combining the standard tokenizer with an nGram filter. Here is a complete index configuration example:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "analysis": {
      "index_analyzer": {
        "my_index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "mynGram"
          ]
        }
      },
      "search_analyzer": {
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "mynGram"
          ]
        }
      },
      "filter": {
        "mynGram": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 50
        }
      }
    }
  }
}

In this configuration, the index analyzer converts document text to lowercase and applies the nGram filter to generate substring tokens; the search analyzer ensures query terms undergo the same processing for matching. Setting max_gram=50 handles long words (e.g., German compound words), and users can adjust it based on actual needs.

Performance and Optimization

Using nGram may increase index size and query time, but compared to wildcard queries (e.g., *Doe* in query_string), it is generally more efficient on large datasets. Wildcard queries require scanning the entire index, potentially degrading performance, while nGram supports fast lookups through pre-computed substrings. It is recommended to weigh the choice based on data volume and query patterns: nGram is preferred for high-frequency partial searches, while wildcard queries are simpler for ad-hoc needs.

Supplementary Solutions

Other answers provide alternative methods: using query_string queries (e.g., {"query": {"query_string": {"default_field": "name", "query": "*Doe*"}}}) can achieve partial matching, but performance risks should be noted. Additionally, avoid using the nGram tokenizer alone to prevent over-tokenization that returns all documents—as in the user's initial attempt with min_gram=1 and max_gram=1, which generated single-character tokens and caused excessive matches.

Conclusion

By properly configuring the nGram analyzer, ElasticSearch can effectively support partial word search. This article details the configuration steps, principles, and optimization suggestions to help developers solve practical search challenges. In practice, it is advisable to combine data characteristics and query requirements to select the most suitable solution, balancing search accuracy and system performance.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.