How to design an index with infinite columns support

ap-h · March 23, 2024, 3:01pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v2.11

Describe the issue:
Hi all, I’m looking for some advices about how to design indexing for documents with “unlimited” fields. We have an excel like application, users can create their own table and add unlimited columns and rows. Typical user can have hundreds of columns and tens of thousands of rows per table. If we have an index per user account then the data is indexed with this format { "id": 1, "table1_col1_name": "", "table1_col2_location": "", "table2_col1_name": "", "table2_col2_phone": "", ... }. This results in having hundreds of mapped fields and quickly hit the Limit of total fields [1000] in index [ap-index] has been exceeded error as expected. If we use table level indexing, still we have the same scalability issue.

How can we design an indexing strategy that can support unlimited columns and also has less number of indexes as much as possible? It should also support all feature listed in constraints below.

What’d the best practices in this scenario?

Constraint:
We want to use almost all features of OpenSearch because we have a very flexible search tool, i.e. multi match, term, wildcard, bool_prefix, search as you type, geo distance, aggregation, sorting, pagination etc.

Potential solution:
I’m thinking to have a field to keep all columns in an array but I’m not sure if this limits querying or aggregation somehow. For example the following document:

{
  "table1_col1_0_address": {
    "addressLine1": "Maecenas tempus mauris"
  },
  "table1_col1_0_autoid": {
    "value": 4028,
    "display": "0004028"
  },
  "table1_col1_0_email": {
    "email": "abdulhalim1@assetpanda.com"
  },
  "table1_col1_0_gps_location": "1,1",
  "table1_col1_0_ip": "1.1.1.1",
  "table1_col1_0_colours": [
    "red"
  ],
  "table1_col1_0_deleted_time": "15:00:00.000+00:00",
  "table1_col1_0_updated_date": "2024-01-04",
  "table1_col1_1_added_datetime": "2024-01-01T00:00:00.000Z",
  "id": "01HPJ56KXAFN7WXF7YGF11STCF"
}

Would be:

{
  "id": "01HPJ56KXAFN7WXF7YGF11STCF",
  "columns": [
    {
      "name": "table1_col1_0_address",
      "value": {
        "address": {
          "addressLine1": "Maecenas tempus mauris"
        }
      }
    },
    {
      "name": "table1_col1_0_autoid",
      "value": {
        "autoid": {
          "value": 4028,
          "display": "0004028"
        }
      }
    },
    {
      "name": "table1_col1_0_email",
      "value": {
        "email": {
          "address": "ap@example.com"
        }
      }
    },
    {
      "name": "table1_col1_0_gps_location",
      "value": {
        "gps": "1,1"
      }
    },
    {
      "name": "table1_col1_0_deleted_time",
      "value": {
        "time": "15:00:00.000+00:00"
      }
    },
    ...
  ]
}

Example query:

GET aps-index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "columns",
            "query": {
              "bool": {
                "must": [
                  {
                    "term": {
                      "columns.name.keyword": "table1_col1_0_email"
                    }
                  },
                  {
                    "term": {
                      "columns.value.email.address": "ap@example.com"
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "columns",
            "query": {
              "bool": {
                "must": [
                  {
                    "term": {
                      "columns.name.keyword": "table1_col1_0_autoid"
                    }
                  },
                  {
                    "range": {
                      "columns.value.autoid.value": {
                        "gte": 10,
                        "lte": 200000
                      }
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "nested_aggs": {
      "nested": {
        "path": "columns"
      },
      "aggs": {
        "filter_by_column_name": {
          "filter": {
            "term": {
              "columns.name.keyword": "table1_col1_0_autoid"
            }
          },
          "aggs": {
            "group_by": {
              "terms": {
                "field": "columns.value.autoid.value",
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}

This is an example index mapping:

{
  "aps-index": {
    "aliases": {},
    "mappings": {
      "properties": {
        "columns": {
          "type": "nested",
          "properties": {
            "name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "value": {
              "properties": {
                "address": {
                  "properties": {
                    "addressLine1": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                },
                "autoid": {
                  "properties": {
                    "value": {
                      "type": "long"
                    },
                    "display": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                },
                "date": {
                  "type": "date"
                },
                "phone": {
                  "properties": {
                    "countryCode": {
                      "type": "long"
                    },
                    "number": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        },
        "id": {
          "type": "long"
        }
      }
    },
    "settings": {
      "index": {
        "replication": {
          "type": "DOCUMENT"
        },
        "number_of_shards": "5",
        "provided_name": "aps-index",
        "number_of_replicas": "1"
      }
    }
  }
}

This seems working but I wonder what would be the cons of this approach when compared to the previous document schema? One downside of this is that the queries are more complex than the flat structure. Is there any other cons, i.e. performance, scalability etc.?

Thanks in advance!

gaobinlong · March 26, 2024, 9:09am

Flat object suits your case, take a look at the documentation for more details: Flat object - OpenSearch Documentation.

ap-h · March 26, 2024, 9:38am

Hey @gaobinlong thanks for the reply. Unfortunately we cant use flat objects because as mentioned in the original post we have a very flexible search functionality and it uses different query, sort, pagination and aggregation features of OpenSearch. From documentation:

Flat objects do not support:

Type-specific parsing.

Numerical operations, such as numerical comparison or numerical sorting.

Text analysis.

Highlighting.

Aggregations of subfields using dot notation.

Filtering by subfields.

Topic		Replies	Views
Index settings total_fields.limit still exist? Index Management configure	1	2264	February 7, 2023
Increase index.mapping.total_fields.limit OpenSearch	0	139	March 5, 2025
How an explosion takes place on an opensearch index/cluster... 🥳 General Feedback discuss , index-management	2	686	September 19, 2023
Configuration recommendation OpenSearch configure , index-management	2	76	November 4, 2024
Query failing for large volume General Feedback	2	224	May 23, 2024

How to design an index with infinite columns support

Related topics