DLQ Query if field contains GUID

I have a field named chainOfIds. This field has up to 15 GUIDs with indices appended and separated by commas.

See example:

chainOfIds: caf5d635-20cf-4fd0-99a1-d8ef7a94721e.0, 77c5eb92-17d5-4382-ba89-41e053a5f520.1, 4ded4e92-9c28-4f71-8963-385e6d310293.2…

I am trying to create a query that would check if chainOfIds contains particular GUID. So far, my efforts were futile since I couldn’t figure out how to escape the hyphen character and make query work.

I started with the following query which returned 0 results

“query”: {
“simple_query_string”: {
“query”: “caf5d635-20cf-4fd0-99a1-d8ef7a94721e”,
“fields”: [
“chainOfIds”]}}

I tried querying for the last part of the GUID and got even more confused.

“query”: “d8ef7a94721e*” — 198 results. One wildcard after the string works
“query”: “d8ef7a94721e” — 0 results. Two wildcards do not

Further more, if I try adding an escaped hyphen to the query that returned results, I get 0 again.

“query”: “\-d8ef7a94721e*” — 0 results

What am I missing? Is there even a way to create a “CONTAINS” type of query for OpenSearch?

Hi @mmkt

Can you please share the mapping of the field?
My initial thought would be to create a custom analyzer for this field. This analyzer would remove the commas and break the field value into individual tokens on whitespaces. This will probably lead to use of term level queries.

Regards,
Lukáš

hi, i have the same issue here,
and which option keeps the “hyphen”
should i change the tokenizer

Hi,

I will try to give a more detailed answer:

In such use case it is important to pay attention to field mapping, text analysis configuration and query type used.

Let’s explore it with an example. Assume the OpenSearch is running on localhost:9200. In our case it will be the minimal distribution (v2.2.0) so we do not need to worry about setting up the security.

First, let’s prepare the cluster:

# OpenSearch host shortcut
export OSH='localhost:9200'

# Check how many indices we have in the cluster
curl -X GET ${OSH}/_cat/indices | wc -l
0

# If there are some indices, this command can be used to drop them all
curl -X DELETE ${OSH}/_all

We will work with documents that look like this:

// Doc #1
{
    "name": "foo",
    "chain": "Abc-1.0, Bcd-1.1, Cde-1.2"
}
// Doc #2
{
    "name": "bar",
    "chain": "Abc-1.0, Bcd-2.1, Cde-2.2"
}

// In this case both the documents share the first item in the chain field,
// the rest of the items is unique. We will use it later when querying...

Text Analysis Process

Before we create mapping for the index that these documents will be stored in, let’s talk about how the text is analyzed during indexing. The text goes through a process called Analysis (that is a Lucene terminology) and you can find a lot of details about it in Lucene Analysis package documentation.

The main concept is that Analysis transforms the input text into a set of tokens that are stored into Lucene index along with some other data.

As explained in the Lucene documentation the Analysis has three (optional) steps:

  1. CharFilter: It can transfor the text before it is tokenized. Typically, it can be used to remove or replace specific characters. A good example would be HTMLCharFilter that can strip out HTML entities from the text.
  2. Tokenizer: Its function is to break incoming text into individual tokens (hence creating a “token stream”). Basically, the incoming text “Quick fox jumps” will be broken into three tokens: [“Quick”, “fox”, “jumps”]. In this case the tokenizer broke the text on whitespaces. There are variouis tokenizers to fulfill different pruposes.
  3. TokenFilter: Once a token is created then it can be subject to additional processing. For example a language processing (stemming, lemmatization, injecting synonyms, lowercasing … etc).

Unfortunatelly, at this moment the text analysis process and its options are not well covered in the online documentation. We need to improve it.

Analyze API in Action

OpenSearch provides /_analyze API (I think it is also missing in the online documentation ATM) to experiment with the Analysis components and test the effect it has on a particular text. We will use it now.

By default text fields are processed by build-in Analyzer called “standard”. I believe in Lucene terminology this is now corresponding to a ClassicAnalyzer (which was called StandardAnalyzer in the past). It uses ClassicTokenizer. (Check the Lucene API documentation to see how the ClassicTokenizer splits the tokens on punctuation and hyphens).

See what the standard analyzer does to our chain field content:

curl -s -X GET \
         -H "Content-Type: application/json" \
         ${OSH}/_analyze\?pretty\&filter_path=tokens.token -d \
'{
    "analyzer": "standard",
    "text": "Abc-1.0, Bcd-1.1, Cde-1.2"
}'

# Or better filter the output through jq, like this: | jq '.tokens[].token'
# to get only the individual tokes.

curl -s -X GET \
         -H "Content-Type: application/json" \
         ${OSH}/_analyze\?pretty\&filter_path=tokens.token -d \
'{
    "analyzer": "standard",
    "text": "Abc-1.0, Bcd-1.1, Cde-1.2"
}' | jq '.tokens[].token'

# Output:
"abc"
"1.0"
"bcd"
"1.1"
"cde"
"1.2"

Right now you can see that matching such tokens in query can lead to many false positives. For example query for Cde-1.1 will “incorrectly” yield a match. The reason is that the standard tokenizer splits tokens on hyphens.

Now, let’s do some “magic” (do not worry, I will explain…):

curl -s -X GET \
         -H "Content-Type: application/json" \
         ${OSH}/_analyze\?pretty\&filter_path=tokens.token -d \
'{
    "char_filter": [
        {
        "type": "mapping",
        "mappings": [
            ", => "
        ]
        }
    ],
    "tokenizer": "whitespace",
    "text": "Abc-1.0, Bcd-1.1, Cde-1.2"
}' | jq '.tokens[].token'

# Output:
"Abc-1.0"
"Bcd-1.1"
"Cde-1.2"

What I did is that instead of using one of pre-defined analyzers I configured a custom analyzer on the fly by putting together individual analyzer componenets. (In particular CharFilter and Tokenizer, no TokenFilter was defined.)

I started with a custom CharFilter whose only purpose is to remove commas. I assume that we can safely remove commans which will leave us with chain items separated only by a space (or any member of a whitespace family). I do this because then I can safely use WhitespaceTokenizer to get a stream on correctly split tokens.

The result is a token stream of all our chian items nicely separated. Depending on how you want to do the token matching during query phase you might want to consider adding a LowerCaseFilter which is a TokenFilter (i.e. the third step in the analysis process).

The complete analysis setup would then look like this:

curl -s -X GET \
         -H "Content-Type: application/json" \
         ${OSH}/_analyze\?pretty\&filter_path=tokens.token -d \
'{
    "char_filter": [
        {
        "type": "mapping",
        "mappings": [
            ", => "
        ]
        }
    ],
    "tokenizer": "whitespace",
    "filter": ["lowercase"],
    "text": "Abc-1.0, Bcd-1.1, Cde-1.2"
}' | jq '.tokens[].token'

# Output:
"abc-1.0"
"bcd-1.1"
"cde-1.2"

Note that at this point I assume capital letters do not have any specific meaning in your IDs. In other words: AaBb == aabb.

Indexing Time

Now, we can fianlly create an index with predefined custom analyzer for the chain field:

curl -X PUT \
     -H "Content-Type: application/json" \
     ${OSH}/index_example\?pretty -d \
'{
    "settings": {
        "analysis": {
            "char_filter": {
                "ID_normalization": {
                    "type": "mapping",
                    "mappings": [
                        ", => "
                    ]
                }
            },
            "analyzer": {
                "custom_chain_analyzer": {
                    "type": "custom",
                    "char_filter": [
                        "ID_normalization"
                    ],
                    "tokenizer": "whitespace",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "standard"
            },
            "chain": {
                "type": "'text'",
                "analyzer": "custom_chain_analyzer"
            }
        }
    }
}'

And index our documents:

curl -X POST \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_doc -d \
'{
    "name": "foo",
    "chain": "Abc-1.0, Bcd-1.1, Cde-1.2"
}'

curl -X POST \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_doc -d \
'{
    "name": "bar",
    "chain": "Abc-1.0, Bcd-2.1, Cde-2.2"
}'

Query Time

There are two query types that you should consider in this use case:

Term-level queries

In its core the Term-level queries take the query term as is and without any additional processin it is used for matching. This means that in most cases the user input should directly match the tokens stored in the index. I said SHOULD because there are some flexible options (seew below).

Let’s start with a basic Term query example:

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search\?pretty -d \
'{
    "query": {
        "term": {
            "chain": "abc-1.0"
        }
    }
}'

In this case the query term has to match exactly. But we can use the case_insensitive option like this:

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search\?pretty -d \
'{
    "query": {
        "term": {
            "chain": {
                "value": "Abc-1.0",
                "case_insensitive": true
            }
        }
    }
}'

Note that both the documents are matching as they both share Abc-1.0 ID.

We can also use more flexible queries like Prefix or Wildcards:

# The following query will match both the documents which is correct because
# both of them contain common chain item starting with `Abc` (and using
# case_insensitive option again).

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search\?pretty -d \
'{
    "query": {
        "prefix": {
            "chain": {
                "value": "Abc",
                "case_insensitive": true
            }
        }
    }
}'

# The following query will match only the second document because the first
# document has a chain item B that ends with `-1.1` and not with `-2.1`.
curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search\?pretty -d \
'{
    "query": {
        "wildcard": {
            "chain": {
                "value": "B*-2.?",
                "case_insensitive": true
            }
        }
    }
}'

Full-text queries

When using the Full-text queries the Match, Match boolean prefix, Match phrase and Match phrase prefix queries seems to be the most useful candidates.

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search\?pretty -d \
'{
    "query": {
        "match": {
            "chain": {
                "query": "Bcd-1.1",
                "analyzer": "custom_chain_analyzer"
            }
        }
    }
}'

Note that the Match query allows you to specify an analyzer. Most of the time you want to use the same analyzer that was used during index.

You can benefit from the phrase query type if the order of IDs is important. For example if you want to find a document whose IDs were in this order: "Abc-1.0, Cde-1.2, Bcd-1.1" then you can use query like this:

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search\?pretty -d \
'{
    "query": {
        "match_phrase": {
            "chain": {
                "query": "Abc-1.0, Cde-1.2, Bcd-1.1",
                "analyzer": "custom_chain_analyzer"
            }
        }
    }
}'

And as expected it will not find any matching document. Note that you can also use the same query without commas: "query": "Abc-1.0 Cde-1.2 Bcd-1.1". This is possible because the analyzer is explicitly specified and we know what it does to the text (i.e. both queries are identical).

But now you can specify a slop to allow for some toleration.

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search\?pretty -d \
'{
    "query": {
        "match_phrase": {
            "chain": {
                "query": "Abc-1.0 Cde-1.2 Bcd-1.1",
                "analyzer": "custom_chain_analyzer",
                "slop": 2
            }
        }
    }
}'

and voilà the query matches the first document despite the fact the two last IDs are actually switched.

Other things to consider

Besices the basic concepts described so far there are also alternatives briefly mentioned below.

Pattern Tokenizer

An alternative to use of Whitespace tokenizer would be using Pattern tokenizer (there are couple of them) but because this is not documented I will leave it out for now. (Basically, you can replace the CharFilter which is removing the comma and Whitespace tokenizer with a single Pattern tokenizer using expression to identify a comma followed by a whispace(s)).

Split Processor

Another great alternative would be using ingestion pipeline with a split processor. But again, the documentation for this is unfortunatelly missing.

Fuzzy term-level query

When using term-level queries (as discussed above) you can add one more query type into your tool box. As of writing it is for some reason missing here https://opensearch.org/docs/2.3/opensearch/query-dsl/term/ but there is also a Fuzzy term-level query.

Mind the similar looking characters

If your IDs contain hyphens consider that there are several characters that look almost identical visually: horizonat line (—), hyphen (–) and dash (-).

For example your document will have an ID like Abc–1.0 and the user will query for Abc-1.0. One of these use hyphen and the other dash. Can you tell the difference? In any case these two will not match!

In order to mitigate this problem you can consider mapping them to a single common character.

For examaple we can extend the mappings in the CharFilter to cover the transformation of both the horizontal line and hyphen to a dash:

"char_filter": {
    "ID_normalization": {
        "type": "mapping",
        "mappings": [
            ", => ",
            "— => -",
            "– => -"
        ]
    }
}

Or maybe the delimiter is not really important for quering in which case you can simply remove any kind of dash-like characters.

"char_filter": {
    "ID_normalization": {
        "type": "mapping",
        "mappings": [
            ", => ",
            "— => ",
            "– => ",
            "- => ",
        ]
    }
}

Aggregations & Sorting (the Elephant in the room)

Maybe at some point you would like to use aggregations to get some statistics about individual tokens.

For example you could use the terms aggregation to get the most frequently present IDs (in case they are not all unique). But you can run into some unexpected errors:

curl -X GET \
     -H "Content-Type: application/json" \
     ${OSH}/index_example/_search -d \
'{
    "size": 0,
    "aggs": {
        "chain_items": {
            "terms": {
                "field": "chain",
                "size": 10
            }
        }
    }
}'

ERROR: Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [chain] in order to load field data by uninverting the inverted index. Note that this can use significant memory.

But this for another discussion…

HTH,
Lukáš