Clarification relevant to wildcard query

I’m aware of both OpenSearch and Elasticsearch best practices specific to wildcard queries (e.g. these are expensive queries, avoid beginning patterns with * or ?, etc.) However, until I inadvertently submitted a wildcard query on a text field today, it was my understanding that wildcard queries would ONLY work with fields mapped as either keyword or wildcard and generate an error when query against fields mapped as text fields.

I found that I was able to perform wildcard queries against fields mapped as text fields. I tested multiple variations and combinations of wildcard queries using * and ? wildcards in various positions. Search result sets yielded 100% accurate results.

I would appreciate clarification:

As the documentation refers to wildcard queries as term-level queries, should I AVOID using these queries with fields mapped as text fields? I have reviewed both OpenSearch and Elasticsearch documentation in-depth and cannot find any documentation that speaks to wildcard searches against fields mapped as text fields. I found a few Stack Overflow articles, but nothing that points back to official documentation on the subject.

I appreciate any clarifications you can provide.

Hi @maxfriz

Let’s assume the OpenSearch is running on localhost:9200.

# OpenSearch host shortcut
export OSH='localhost:9200'

# -------------------------------------
# Check we can talk to cluster.
curl -I ${OSH}

# -------------------------------------
# Let's just cleanup the cluster.
curl -X DELETE ${OS_HOST}/_all

# -------------------------------------
# Let's create a new index and define example
# analyzers for it.
curl -X PUT \
     -H "Content-Type: application/json" \
     ${OSH}/index_example -d \
'{
    "settings": {
        "analysis": {
            "analyzer": {
                "analyzer_one": {
                    "type": "keyword"
                },
                "analyzer_two": {
                    "type": "standard"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "field_one": {
                "type": "text",
                "analyzer": "analyzer_one"
            },
            "field_two": {
                "type": "text",
                "analyzer": "analyzer_two"
            }
        }
    }
}'

Now, we have an empty index index_example that has defined two analyzers. An analyzer_one that is working the same way as keywords analyzer and analyzer_two that is working as standard analyzer (this is what gets used for text fields by default).

Hint: If you are using zsh (eg. you are a Mac user) then I recommend running unsetopt nomatch on your CLI before continuing. For more details see https://github.com/ohmyzsh/ohmyzsh/issues/31#issuecomment-359728582

Let see how these analyzers work on text.
First, for the analyzer_one analyzer (keywords):

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_analyze?pretty -d \
'{
    "text": "The quick brown fox jumps over the lazy dog",
    "analyzer": "analyzer_one"
}'
# Returns
{
  "tokens" : [
    {
      "token" : "The quick brown fox jumps over the lazy dog",
      "start_offset" : 0,
      "end_offset" : 43,
      "type" : "word",
      "position" : 0
    }
  ]
}

Second, for the analyzer_two analyzer (standard):

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_analyze?pretty -d \
'{
    "text": "The quick brown fox jumps over the lazy dog",
    "analyzer": "analyzer_two"
}'
# Returns
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "jumps",
      "start_offset" : 20,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "the",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 40,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 8
    }
  ]
}

As you can see the first analyzer produced a single token and the second analyzer produced couple of individual tokens (one for each word). The tokens are a key information units that get stored in Lucene fields.

The wildcard query is a term level query. This means it processes individual terms stored in the document fields.

Now, let’s index a document and test how it works:

# -------------------------------------
# Index the following document.
curl -X PUT \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_doc/1 -d \
'{
    "field_one": "The quick brown fox jumps over the lazy dog",
    "field_two": "The quick brown fox jumps over the lazy dog"
}'

# -------------------------------------
# Test wildcard search on field_one
curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search?filter_path=hits.hits\&pretty -d \
'{
    "query": {
        "wildcard": {
            "field_one": "?he*"
        }
    }
}'
# Returns
{
  "hits" : {
    "hits" : [
      {
        "_index" : "index_example",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "field_one" : "The quick brown fox jumps over the lazy dog",
          "field_two" : "The quick brown fox jumps over the lazy dog"
        }
      }
    ]
  }
}

But if you replace the ?he* with ?he or ?uic? then the document will not be found.
However, these will work fine on the field_two. For instance:

# -------------------------------------
# Test wildcard search on field_two
curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search?filter_path=hits.hits\&pretty -d \
'{
    "query": {
        "wildcard": {
            "field_two": "?uic?"
        }
    }
}'
# Returns
{
  "hits" : {
    "hits" : [
      {
        "_index" : "index_example",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "field_one" : "The quick brown fox jumps over the lazy dog",
          "field_two" : "The quick brown fox jumps over the lazy dog"
        }
      }
    ]
  }
}

In short you can use term level queries against text fields or other fields that contain term(s) but you need to keep in mind that the query works on terms level. I demonstrated above how you can learn how many terms are created from your text. The more terms there are stored in the field the more resources will be needed to iterate over them. It is not that you should avoid doing it but you should measure the impact of it.

HTH,
Lukáš

1 Like

@lukas-vlcek Thank you for this easy-to-follow detailed explanation. This provides a lot of clarity and helps me to make some decisions specific to the analyzer, mapping and queries. Grateful for your extended detail.

Thank you again.

Max

1 Like