Hi @maxfriz
Let’s assume the OpenSearch is running on localhost:9200
.
# OpenSearch host shortcut
export OSH='localhost:9200'
# -------------------------------------
# Check we can talk to cluster.
curl -I ${OSH}
# -------------------------------------
# Let's just cleanup the cluster.
curl -X DELETE ${OS_HOST}/_all
# -------------------------------------
# Let's create a new index and define example
# analyzers for it.
curl -X PUT \
-H "Content-Type: application/json" \
${OSH}/index_example -d \
'{
"settings": {
"analysis": {
"analyzer": {
"analyzer_one": {
"type": "keyword"
},
"analyzer_two": {
"type": "standard"
}
}
}
},
"mappings": {
"properties": {
"field_one": {
"type": "text",
"analyzer": "analyzer_one"
},
"field_two": {
"type": "text",
"analyzer": "analyzer_two"
}
}
}
}'
Now, we have an empty index index_example
that has defined two analyzers. An analyzer_one
that is working the same way as keywords
analyzer and analyzer_two
that is working as standard
analyzer (this is what gets used for text
fields by default).
Hint: If you are using zsh (eg. you are a Mac user) then I recommend running unsetopt nomatch
on your CLI before continuing. For more details see https://github.com/ohmyzsh/ohmyzsh/issues/31#issuecomment-359728582
Let see how these analyzers work on text.
First, for the analyzer_one
analyzer (keywords):
curl -X GET \
-H "Content-Type: application/json" \
${OSH}/index_example/_analyze?pretty -d \
'{
"text": "The quick brown fox jumps over the lazy dog",
"analyzer": "analyzer_one"
}'
# Returns
{
"tokens" : [
{
"token" : "The quick brown fox jumps over the lazy dog",
"start_offset" : 0,
"end_offset" : 43,
"type" : "word",
"position" : 0
}
]
}
Second, for the analyzer_two
analyzer (standard):
curl -X GET \
-H "Content-Type: application/json" \
${OSH}/index_example/_analyze?pretty -d \
'{
"text": "The quick brown fox jumps over the lazy dog",
"analyzer": "analyzer_two"
}'
# Returns
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 4,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "fox",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "jumps",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "over",
"start_offset" : 26,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "the",
"start_offset" : 31,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "lazy",
"start_offset" : 35,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "dog",
"start_offset" : 40,
"end_offset" : 43,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
As you can see the first analyzer produced a single token and the second analyzer produced couple of individual tokens (one for each word). The tokens are a key information units that get stored in Lucene fields.
The wildcard query is a term level query. This means it processes individual terms stored in the document fields.
Now, let’s index a document and test how it works:
# -------------------------------------
# Index the following document.
curl -X PUT \
-H "Content-Type: application/json" \
${OSH}/index_example/_doc/1 -d \
'{
"field_one": "The quick brown fox jumps over the lazy dog",
"field_two": "The quick brown fox jumps over the lazy dog"
}'
# -------------------------------------
# Test wildcard search on field_one
curl -X GET \
-H "Content-Type: application/json" \
${OSH}/index_example/_search?filter_path=hits.hits\&pretty -d \
'{
"query": {
"wildcard": {
"field_one": "?he*"
}
}
}'
# Returns
{
"hits" : {
"hits" : [
{
"_index" : "index_example",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"field_one" : "The quick brown fox jumps over the lazy dog",
"field_two" : "The quick brown fox jumps over the lazy dog"
}
}
]
}
}
But if you replace the ?he*
with ?he
or ?uic?
then the document will not be found.
However, these will work fine on the field_two
. For instance:
# -------------------------------------
# Test wildcard search on field_two
curl -X GET \
-H "Content-Type: application/json" \
${OSH}/index_example/_search?filter_path=hits.hits\&pretty -d \
'{
"query": {
"wildcard": {
"field_two": "?uic?"
}
}
}'
# Returns
{
"hits" : {
"hits" : [
{
"_index" : "index_example",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"field_one" : "The quick brown fox jumps over the lazy dog",
"field_two" : "The quick brown fox jumps over the lazy dog"
}
}
]
}
}
In short you can use term level queries against text
fields or other fields that contain term(s) but you need to keep in mind that the query works on terms level. I demonstrated above how you can learn how many terms are created from your text. The more terms there are stored in the field the more resources will be needed to iterate over them. It is not that you should avoid doing it but you should measure the impact of it.
HTH,
Lukáš