Creating an Autocomplete Using OpenSearch.org Docs

Hello,

Thank you for checking out my post. I need some help writing a an aggregation query! I started off using the docs on opensearch.org under the search experience. I used the edge-ngram-filter as described.

Everything is going great. But at the end, I want to group/aggregate the results. This is where I need help. The last phase.

I’m using OpenSearch to index UK postcode data as part of a auto-complete. Here’s the what the data looks like:

{
  "address_1": "1",
  "address_2": "Ashwood Park",
  "address_3": "Bridge Of Don",
  "address_4": "ABERDEEN",
  "postcode": "AB22 8PR",
  "point": "POINT(57.200058,-2.122866)",
  "latitude": "57.200058",
  "longitude": "-2.122866"
}

There are about 33 millions docs.

Using some indexing examples from the OpenSearch.org website, here is the index I created filters for “edge-ngram-filter” and “lowercasing”:

{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "analysis": {
        "analyzer": {
          "autocomplete": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase", "edge_ngram_filter"]
          }
        },
        "filter": {
          "edge_ngram_filter": {
            "type": "edge_ngram",
            "min_gram": 2,
            "max_gram": 12
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "address_1": {
        "type": "text",
        "analyzer": "autocomplete"
      },
      "address_2": {
        "type": "text",
        "analyzer": "autocomplete"
      },
      "address_4": {
        "type": "text",
        "analyzer": "autocomplete"
      },
      "postcode": {
        "type": "text",
        "analyzer": "autocomplete"
      },
      "point": {
        "type": "geo_point"
      },
      "latitude": {
        "type": "double"
      },
      "longitude": {
        "type": "double"
      }
    }
  }
}

My assignment requires me to search over 4 or 5 “weighted” fields:

  • “postcode^8”
  • “address_2^4”
  • “address_3^2”
  • “address_4^6”

Therefore, I am using a multi_match query body as follows:

{
  "from": 0,
  "size": 100, 
  "query": {
    "multi_match": {
      "query": "AB15 8PS",
      "fields": [
        " postcode^6",
        " address_2",
        " address_3^2",
        " address_4^4"
      ]
    }
  },

  "highlight": {
    "fields": {
      " postcode": {},
      " address_4": {},
      " address_3": {},
      " address_2": {},
    }
  }
}

Everything works until this point. the results come back with each document. But I need to group the _source docs and supply a unique postcode count for each grouped results.

[
  {
    "_index": "postcode-index",
    "_type": "_doc",
    "_id": "53926",
    "_score": 66.87001,
    "_source": {
      "address_1": "5 Small Holdings Whitemyres",
      " address_2": "",
      " address_3": "Kingswells",
      " address_4": "ABERDEEN",
      " postcode": "AB15 8PS",
      " point": "POINT(57.148816,-2.192512)",
      " latitude": "57.148816",
      " longitude": "-2.192512"
    },
    "highlight": {
      " postcode": ["<em>AB15</em> <em>8PS</em>"]
    }
  },
  {
    "_index": "postcode-index-example4",
    "_type": "_doc",
    "_id": "53927",
    "_score": 66.87001,
    "_source": {
      "address_1": "6 Small Holdings Whitemyres",
      " address_2": "",
      " address_3": "Kingswells",
      " address_4": "ABERDEEN",
      " postcode": "AB15 8PS",
      " point": "POINT(57.148816,-2.192512)",
      " latitude": "57.148816",
      " longitude": "-2.192512"
    },
    "highlight": {
      " postcode": ["<em>AB15</em> <em>8PS</em>"]
    }
  }
]

However, I need to group the results, and send back 10 “grouped” results, and the highlight, along with a unique count of postcodes for each grouping. Here is an example output that I am trying to create:
image

I could really use the help, ASAP. Happy to work with a contractor if necessary.

If you don’t know, please pass this on to someone you think might know?

Thank you!

1 Like

Hi Daniel,

I’m trying to understand exactly what you need here. Do you just need to aggregate the “postcode” field? I did the following. I changed the “postcode” field in your mapping to add a keyword field

      "postcode": {
        "type": "text",
        "analyzer": "autocomplete",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },

Then I changed your query

GET test1/_search
{
  "from": 0,
  "size": 100, 
  "query": {
    "multi_match": {
      "query": "AB22 8PR",
      "fields": [
        "postcode^6",
        "address_2",
        "address_3^2",
        "address_4^4"
      ]
    }
  },
  "aggs": {
    "postcode": {
      "terms": {
        "field": "postcode.keyword",
        "size": 10
      }
    }
  }, 
  "highlight": {
    "fields": {
      " postcode": {},
      " address_4": {},
      " address_3": {},
      " address_2": {}
    }
  }
}

Gives these results (for the 1 doc you have above).

{
  "took" : 140,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 5.967817,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 5.967817,
        "_source" : {
          "address_1" : "1",
          "address_2" : "Ashwood Park",
          "address_3" : "Bridge Of Don",
          "address_4" : "ABERDEEN",
          "postcode" : "AB22 8PR",
          "point" : "POINT(57.200058 -2.122866)",
          "latitude" : "57.200058",
          "longitude" : "-2.122866"
        }
      }
    ]
  },
  "aggregations" : {
    "postcode" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "AB22 8PR",
          "doc_count" : 1
        }
      ]
    }
  }
}

Does that help?

2 Likes

Thank you soo much, Jon!

No, that isn’t quite what I need, but thank you for beig the first to respond!
OpenSearch is a worthy tool!

To put it simply, here is the flow for my autocomplete tool… (basically, suggests grouped addresses by postcode, as you type. With each keystroke the process starts again, and updats the suggestions.

This is how I am currently approaching the query …using node and creating a second query which is not ideal.

    1. Multi-Match query [“four fields ^weighted”]
      returns results of weighted fields
    1. then get “unique postcodes” from those results
    1. then send a new query using “unique postcodes” to get the address
    1. I want to return the grouped address with the “count” of how many postcodes there are.
      We get the grouping by removing the addressLine1 field (house number field) and a few other fields.
    1. return the grouped set back to node for final processing and then return as an object
 "_source" : {
   " address_2" : "Netherly Place",
  " address_3" : "",
  " address_4" : "BALLATER",
  " postcode" : "AB35 5QE",
  " postcode_count" : "16",
}

I was careful to use the indexing methods on the OpenSearch.com Search: autocomplete docs. In my case, I’m using the n-edge-gram method. I also have the .keyword available as well to use in the aggregation of the postcode.

One problem, at this point, is the matches return the _source which has all the fields, including the fields that I dont’ want. So groupiong is very difficult. Here is the complete _source…

"_source" : {
  "address_1" : "Balmoral Bar, 1",
  " address_2" : "Netherly Place",
  " address_3" : "",
  " address_4" : "BALLATER",
  " postcode" : "AB35 5QE",
  " point" : "POINT(57.049628,-3.039931)",
  " latitude" : "57.049628",
  " longitude" : "-3.039931"
}

I know there is a way to continue adding more query or aggregation; as in a “pipeline” in Query DSL, but I am not a wizzard, yet. I’m looking for a wizard to show me how to make this query work for me.

Happy to change the query around to using filters instead or aggs or whatever else, as long as the mult search fields are possible, and the query is fast. there are about 33 million docs_.

Happy to share my screen and do a walk-through using the Dev-Tools in Dashboard if that helps!

Many thanks!

ps - the way I am currently doing it, is I am taking the 1st results in node, parsing the unique postcodes into an array and then .map to a simple query that takes a postcode value. This gives me a ‘stacked’ query that I can send to _msearch. Then, those results are returned to node where I futher create the final objects, which are the key and count value in each bucket.

This method has it’s caveots: such as) I will first need to get a count of the first query (_count) because without specifying the size, the detault returned is 10. If I specify a high number, and there are much less, that causes problems in node …all taking more time on the thread. I would like to take the 6 steps and create just one DSL Query to send to OpenSearch. Also, that is a lot of data to pass back and forth, only to reduce it down and group it. I know there is a way with OpenSaearch, but this truely a job, for a master. :wink:

I can add the count of addreses to the end of the results for each item. However, I don’t want to return the same address for house number: 1, 2, 3

1 mystreet, mytown, mylocality, mypostcode
2 mystreet, mytown, mylocality, mypostcode
3 mystreet, mytown, mylocality, mypostcode
4 mystreet, mytown, mylocality, mypostcode
1 newstreet, mytown, mylocality, mypostcode
2 newstreet, mytown, mylocality, mypostcode
3 newstreet, mytown, mylocality, mypostcode
...

I need to remove the house number and only return the one address line with the postcode count.

This is what I am looking to return.

mystreet, mytown, mylocality, mypostcode (18 matches)
newstreet, mytown, mylocality, newpostcode (10 matches)
oldstreet, mytown, mylocality, oldpostcode (5 matches)
histreet, mytown, mylocality, hipostcode (10 matches)
lowstreet, mytown, mylocality,lowpostcode (210 matches)

Essentially, this is a grouped address (removing house number and sum(postcode). The _source contains the entire address fields. How do I group the fields in _source?

In SQL it would look something like this…

SELECT  addressLine2, addressLine3, addressLine4, postcode, sum(postcode)
FROM    address_source
WHERE   [pardon the cut here but to save time...]
GroupBy addressLine2, addressLine3, addressLine4, postcode

Not sure how to do this with Query DSL.

Thank you very much for your time!
Best,
Daniel