Correlation Between Two Nested Fields in OpenSearch

Hi all,
I am trying to calculate the correlation between two different variables using OpenSearch. Based on the documentation, I found two relevant aggregation methods: bucket_correlation and matrix_stats . Since bucket_correlation is used for correlating bucket counts with expected values, it does not fit my use case. However, matrix_stats provides correlation between two different numeric fields, which is what I need.

The challenge I am facing is that the OpenSearch documentation demonstrates matrix_stats for two numeric fields at the same level. However, my requirement is to compute the correlation between two different values within a nested field. Below is my index mapping for reference:

Answers field Mapping

{
  "mappings": {
    "properties": {
      "answers": {
        "type": "nested",
        "properties": {
          "question_id": { "type": "integer" },
          "value": { "type": "integer" }
        }
      }
    }
  }
}

Here, answers is an array containing question_id and value, representing responses to different questions. This structure is similar to a Google Form survey, where value holds the response to a given question_id. My goal is to calculate the correlation between responses (value) for two different question_ids.

Query

{
  "size": 0,
  "aggs": {
    "filtered_answers": {
      "nested": {
        "path": "answers"
      },
      "aggs": {
        "filtered": {
          "filter": {
            "terms": { "answers.question_id": [303, 320] }
          },
          "aggs": {
            "matrix_stats_agg": {
              "matrix_stats": {
                "fields": ["answers.value"]
              }
            }
          }
        }
      }
    }
  }
}

This query returns only a single bucket where answers.value is correlated with itself, resulting in a correlation of 1. The response is as follows:

Response

"filtered_answers": {
      "doc_count": 12138,
      "filtered": {
        "doc_count": 3020,
        "matrix_stats_agg": {
          "doc_count": 3020,
          "fields": [
            {
              "name": "answers.value",
              "count": 3020,
              "mean": 4.143046357615902,
              "variance": 0.8314686894700023,
              "skewness": -1.3974214836739975,
              "kurtosis": 5.343217510860543,
              "covariance": {
                "answers.value": 0.8314686894700023
              },
              "correlation": {
                "answers.value": 1
              }
            }
          ]
        }
      }
    }

Since the same field is used for both variables in the correlation calculation, OpenSearch is only computing the correlation of answers.value with itself.

Questions

  1. How can I compute the correlation between values corresponding to two different question_ids within the nested answers field?
  2. Is there an alternative approach to achieving this in OpenSearch?
  3. Does OpenSearch use Pearson’s correlation in matrix_stats aggregation? While the response includes covariance and variance, the documentation does not explicitly mention Pearson’s correlation.

I have explored various query modifications, including using Painless scripts, but have not yet found a working solution. Any guidance on this issue would be highly appreciated.

Thanks!

Hi @Rafee, Looks like it treats the answers.value as a single field rather than distinguishing between different question_ids. Have you considered restructuring the data so that values for different question_ids are stored as separate fields? This would allow matrix_stats to compute correlation directly.

Best,
mj

Hi @Mantas ,

Thank you for your response. The index is already actively used for multiple requirements with the current mapping, making reindexing with a different mapping not feasible. Therefore, I wanted to explore whether it is possible to perform correlation within nested types.

Regards,
@Rafee

Hi @Mantas ,

I would like to know if it is possible to perform correlation analysis between nested fields with different values in OpenSearch. If this cannot be achieved within OpenSearch, I may proceed with using Python libraries like SciPy. I would appreciate your opinion on this.

Thank You!
Regards,
@Rafee