Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v2.11
Describe the issue:
Hi all, I’m looking for some advices about how to design indexing for documents with “unlimited” fields. We have an excel like application, users can create their own table and add unlimited columns and rows. Typical user can have hundreds of columns and tens of thousands of rows per table. If we have an index per user account then the data is indexed with this format { "id": 1, "table1_col1_name": "", "table1_col2_location": "", "table2_col1_name": "", "table2_col2_phone": "", ... }
. This results in having hundreds of mapped fields and quickly hit the Limit of total fields [1000] in index [ap-index] has been exceeded
error as expected. If we use table level indexing, still we have the same scalability issue.
How can we design an indexing strategy that can support unlimited columns and also has less number of indexes as much as possible? It should also support all feature listed in constraints below.
What’d the best practices in this scenario?
Constraint:
We want to use almost all features of OpenSearch because we have a very flexible search tool, i.e. multi match, term, wildcard, bool_prefix, search as you type, geo distance, aggregation, sorting, pagination etc.
Potential solution:
I’m thinking to have a field to keep all columns in an array but I’m not sure if this limits querying or aggregation somehow. For example the following document:
{
"table1_col1_0_address": {
"addressLine1": "Maecenas tempus mauris"
},
"table1_col1_0_autoid": {
"value": 4028,
"display": "0004028"
},
"table1_col1_0_email": {
"email": "abdulhalim1@assetpanda.com"
},
"table1_col1_0_gps_location": "1,1",
"table1_col1_0_ip": "1.1.1.1",
"table1_col1_0_colours": [
"red"
],
"table1_col1_0_deleted_time": "15:00:00.000+00:00",
"table1_col1_0_updated_date": "2024-01-04",
"table1_col1_1_added_datetime": "2024-01-01T00:00:00.000Z",
"id": "01HPJ56KXAFN7WXF7YGF11STCF"
}
Would be:
{
"id": "01HPJ56KXAFN7WXF7YGF11STCF",
"columns": [
{
"name": "table1_col1_0_address",
"value": {
"address": {
"addressLine1": "Maecenas tempus mauris"
}
}
},
{
"name": "table1_col1_0_autoid",
"value": {
"autoid": {
"value": 4028,
"display": "0004028"
}
}
},
{
"name": "table1_col1_0_email",
"value": {
"email": {
"address": "ap@example.com"
}
}
},
{
"name": "table1_col1_0_gps_location",
"value": {
"gps": "1,1"
}
},
{
"name": "table1_col1_0_deleted_time",
"value": {
"time": "15:00:00.000+00:00"
}
},
...
]
}
Example query:
GET aps-index/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"nested": {
"path": "columns",
"query": {
"bool": {
"must": [
{
"term": {
"columns.name.keyword": "table1_col1_0_email"
}
},
{
"term": {
"columns.value.email.address": "ap@example.com"
}
}
]
}
}
}
},
{
"nested": {
"path": "columns",
"query": {
"bool": {
"must": [
{
"term": {
"columns.name.keyword": "table1_col1_0_autoid"
}
},
{
"range": {
"columns.value.autoid.value": {
"gte": 10,
"lte": 200000
}
}
}
]
}
}
}
}
]
}
},
"aggs": {
"nested_aggs": {
"nested": {
"path": "columns"
},
"aggs": {
"filter_by_column_name": {
"filter": {
"term": {
"columns.name.keyword": "table1_col1_0_autoid"
}
},
"aggs": {
"group_by": {
"terms": {
"field": "columns.value.autoid.value",
"size": 10
}
}
}
}
}
}
}
}
This is an example index mapping:
{
"aps-index": {
"aliases": {},
"mappings": {
"properties": {
"columns": {
"type": "nested",
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"value": {
"properties": {
"address": {
"properties": {
"addressLine1": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"autoid": {
"properties": {
"value": {
"type": "long"
},
"display": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"date": {
"type": "date"
},
"phone": {
"properties": {
"countryCode": {
"type": "long"
},
"number": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
},
"id": {
"type": "long"
}
}
},
"settings": {
"index": {
"replication": {
"type": "DOCUMENT"
},
"number_of_shards": "5",
"provided_name": "aps-index",
"number_of_replicas": "1"
}
}
}
}
This seems working but I wonder what would be the cons of this approach when compared to the previous document schema? One downside of this is that the queries are more complex than the flat structure. Is there any other cons, i.e. performance, scalability etc.?
Thanks in advance!