Use case - new to OpenSearch - OpenSearch multilingual hybrid search

I am a young technical entrepreneur that has a SaaS idea which is highly dependent on OpenSearch. I would really appreciate your review of my use case, and help me accelerate my MVP development by giving guidance and feedback on how would you approach this use case, knowing that I have AWS credits and will use OpenSearch 2.11 standard on AWS.

Specifically, need assistance/ feedback on:

End-to-End Indexing Schema and Mapping:

  • Multilingual Support: We require assistance in configuring OpenSearch to handle multilingual data effectively, including indexing and searching across multiple languages.
  • Semantic Search: Guidance on implementing semantic search capabilities to find conceptually similar incidents, not just keyword matches.
  • Relevance Scoring: Advice on customizing relevance scoring to balance keyword matching, semantic similarity, and language consistency.
  • Performance Optimization: Recommendations for optimizing search performance given the potentially large dataset and the need for real-time results.

Use case:
In the aviation industry, a Safety Management System (SMS) is critical to ensure compliance with regulatory standards and to maintain the highest levels of safety. An Incident refers to any event or occurrence that deviates from standard operating procedures or poses a potential risk during flight operations or ground handling. Documenting incidents is essential for understanding safety issues, investigating root causes, and preventing recurrence. For a global airline or aviation organization with multiple bases and operations, incidents may be recorded across different locations, in multiple languages, and with varying levels of detail. Identifying similar incidents that have occurred in the past is crucial for efficient root cause analysis, trend identification, and the implementation of corrective actions.

Objective:

I aim to develop a SaaS that leverages AWS OpenSearch to allow Safety Managers and Analysts (SMA) to search for incidents from their location and find similar past incidents. This will enable a better understanding of recurring safety issues, prevent further occurrences, and enhance the overall safety process. The system must handle multilingual data and provide relevant results even when incident titles and descriptions are recorded in different languages or contain mixed languages.

Functional Requirements:

  1. Incident Record StructureThe system should store and manage incident records with the following key attributes:
  • Incident ID: A unique identifier for each incident.
  • Location: The airport or base where the incident occurred.
  • Title (Local Language): A short description of the incident, often in the local language.
  • Title (English): An English translation or version of the title.
  • Description (Local Language): A detailed explanation of the incident, typically in the local language.
  • Date of Occurrence: The date when the incident was occurred .
  • Date of reporting: The date when the incident was reported in the system.
  1. Multilingual Data Handling
  • Incident titles and descriptions may be written in English, local languages, or a mixture of both.
  • Users should be able to search in any supported language.
  • The system should return relevant results regardless of the language used in the original incident record.
  1. Search for Similar Incidents
  • The system must allow users to search for incidents similar to a newly recorded incident or based on specific search input.
  • Search results should be based on both keyword matches and semantic similarity.
  • The system should score and rank results based on relevance, prioritizing incidents that are more closely related to the search input.
  1. Location-Specific Results
  • The search functionality should limit results to incidents that occurred within the same location where the new incident is being reported or based on the user’s assigned location.
  • Users should not see incidents from other locations unless explicitly allowed or if the incidents are relevant across multiple locations.
  1. Date-Based Filtering
  • The system must allow users to filter incidents by a date range.
  • Users should be able to search for similar incidents within a specific review period (e.g., incidents reported in the past 6 months) to identify recent trends.
  1. Relevance ScoringThe system must provide a relevance score for each returned incident based on:

End-to-End Search and Retrieval Workflow:

  1. Initiation
  • The user enters the Incident ID.
  • The system retrieves the incident’s title and description from the SQL data warehouse.
  1. Search Request
  • The title and description are sent via a REST API request to get embeddings and to the OpenSearch system.
  • Apply a location filter to restrict results to a specific site.
  • Apply a date range filter to restrict results to a particular review period.
  1. Results Retrieval
  • The user receives a list of incidents (ID, title, and description), ranked by relevance based on keyword and semantic similarity scores.
  1. Review
  • The user can review detailed information about past incidents, including their titles, descriptions, and dates of occurrence.