Usage of hyphenation_decompounder

Hello,
I want to implement a search in the documents with a lot of terms in German which is full of compound words.
It seems I can use hyphenation_decompounder filter for that:

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "hyphenation_decompounder",
      "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
      "word_list": ["Kaffee", "zucker", "tasse"]
    }
  ],
  "text": "Kaffeetasse"
}

The problem I have - I’m using managed OpenDistro from Deutsche Telekom and I don’t understand how to load the hyphenation_patterns.xml to the cluster.

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Exception while reading hyphenation_patterns_path."
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Exception while reading hyphenation_patterns_path.",
    "caused_by" : {
      "type" : "no_such_file_exception",
      "reason" : "/rds/datastore/elasticsearch/v7.6.2/package/elasticsearch-7.6.2/config/analysis/hyphenation_patterns.xml"
    }
  },
  "status" : 400
}

The extension endpoints from elasticsearch seem not accessible as well as POST /_security/api_key

Any ideas/suggestion to how to implement search for compound words in OpenDistro?

Hi @Meshka and welcome,

In this specific case I would recommend you to experiment with this feature locally first. This will help you understand what exactly you need to have in place.

I haven’t been using this specific token filter myself (because I did not have the need to work with German-like languages, which this filter is primarily used for) but I think I see what is the issue here.

It is looking for “analysis/hyphenation_patterns.xml” file and it was not found at “/rds/datastore/elasticsearch/v7.6.2/package/elasticsearch-7.6.2/config/analysis/hyphenation_patterns.xml” location.

This seems like it was expecting this file in “<elasticsearch_install_folder>/config/analysis/hyphenation_patterns.xml”. Which is why I recommend you to test this locally first so that you better understand what file needs to go where…

I do not know what “managed OpenDistro from Deutsche Telekom” is offering you but I would guess that “config/analysis/hyphenation_patterns.xml” file is simply not in place. And the reason is that in most cases you need to install it prior starting the node yourself. There are several hyphenation files available for download on the web but they are usually associated with specific license, and in many cases the license does not play well with AL2 which means it can not be distributed with OpenDistro (or OpenSearch) easily. That is why users have to download and install them manually themselves.

BTW: you might also want to check GitHub - uschindler/german-decompounder: Data files of German Decompounder for Apache Lucene / Apache Solr / Elasticsearch for more info.

I think if you can break down this process to a smaller steps that you (or anyone else) can replicate locally it will be easier to help you as well.

HTH,
Lukáš

1 Like

Thanks for the response @lukas-vlcek

The problem is not finding the file. I have it.
And I tried to do it locally via docker - I can easily copy the file, and it’s working. So having SSH to the server will solve the problem.

The question is - how to copy a config file to the managed version of OpenDistro from Deutsche Telekom.
I don’t have SSH there, so I can’t just copy file there. I was looking for any API endpoints that can do it, but I cannot find an option to extend it like e.g. Elastic Cloud via extensions https://cloud.elastic.co/deployment-features/extensions
I don’t see any API in OpenDistro that allow to do such things.

That’s my challenge.

Hi @Meshka,

Unfortunately, I do not know what “OpenDistro from Deutsche Telekom” is. If it is some cloud service then I recommend you contacting their support.

(Not sure if I helped you much :frowning:)
Lukáš

1 Like