Hi all,
I am trying to calculate the correlation between two different variables using OpenSearch. Based on the documentation, I found two relevant aggregation methods: bucket_correlation and matrix_stats . Since bucket_correlation is used for correlating bucket counts with expected values, it does not fit my use case. However, matrix_stats provides correlation between two different numeric fields, which is what I need.
The challenge I am facing is that the OpenSearch documentation demonstrates matrix_stats for two numeric fields at the same level. However, my requirement is to compute the correlation between two different values within a nested field. Below is my index mapping for reference:
Answers field Mapping
{
"mappings": {
"properties": {
"answers": {
"type": "nested",
"properties": {
"question_id": { "type": "integer" },
"value": { "type": "integer" }
}
}
}
}
}
Here, answers is an array containing question_id and value, representing responses to different questions. This structure is similar to a Google Form survey, where value holds the response to a given question_id. My goal is to calculate the correlation between responses (value) for two different question_ids.
Query
{
"size": 0,
"aggs": {
"filtered_answers": {
"nested": {
"path": "answers"
},
"aggs": {
"filtered": {
"filter": {
"terms": { "answers.question_id": [303, 320] }
},
"aggs": {
"matrix_stats_agg": {
"matrix_stats": {
"fields": ["answers.value"]
}
}
}
}
}
}
}
}
This query returns only a single bucket where answers.value is correlated with itself, resulting in a correlation of 1. The response is as follows:
Response
"filtered_answers": {
"doc_count": 12138,
"filtered": {
"doc_count": 3020,
"matrix_stats_agg": {
"doc_count": 3020,
"fields": [
{
"name": "answers.value",
"count": 3020,
"mean": 4.143046357615902,
"variance": 0.8314686894700023,
"skewness": -1.3974214836739975,
"kurtosis": 5.343217510860543,
"covariance": {
"answers.value": 0.8314686894700023
},
"correlation": {
"answers.value": 1
}
}
]
}
}
}
Since the same field is used for both variables in the correlation calculation, OpenSearch is only computing the correlation of answers.value with itself.
Questions
- How can I compute the correlation between values corresponding to two different question_ids within the nested answers field?
- Is there an alternative approach to achieving this in OpenSearch?
- Does OpenSearch use Pearson’s correlation in matrix_stats aggregation? While the response includes covariance and variance, the documentation does not explicitly mention Pearson’s correlation.
I have explored various query modifications, including using Painless scripts, but have not yet found a working solution. Any guidance on this issue would be highly appreciated.
Thanks!