The purpose of this RFC (request for comments) is to gather community feedbacks on a proposal to provide a way to update a GeoIP database in GeoIP processor automatically.
opened 07:36PM - 13 Jan 23 UTC
enhancement
RFC
The purpose of this RFC (request for comments) is to gather community feedbacks … on a proposal to provide a way to update a GeoIP database in GeoIP processor automatically.
# Problem Statement
There is a need to add location information like city name, country name, or coordinates of a given IP address during a data ingestion in an OpenSearch cluster. As IP addresses are assigned to organizations, the mapping between an IP address to a location information keeps changing by nature. Therefore, to get a better accuracy on a location information of a given IP address, the mapping data need to be updated periodically. However, the OpenSearch uses a static mapping data which does not get updated.
# Current State
OpenSearch has a GeoIP processor with which a user can add location data like city name, country name, latitude/longitude, and more based on an IP address in a document. OpenSearch uses GeoLite2 databases as a mapping data from an IP address to a location information which was provided by MaxMind in 2019/11/19.
OpenSearch gets GeoLite2 Country, GeoLite2 City, and GeoLite2 ASN database file from a maven repository and include them in the build artifact. When a node starts, it prepare the list of available databases by reading the GeoLite2 databases from a local disk. Once GeoIP processor is called for the first time after the node starts, it loads an appropriate database into memory and use it. Users can put their own database in $OS_CONFIG/ingest-geoip folder either to override existing database or to add new database. However, users have to restart every nodes to reload the updated database files from a disk.
MaxMind update the GeoLite2 database twice weekly but the OpenSearch users cannot benefit from the update as there is no easy way to update the database automatically without restarting nodes in a cluster.
# Proposal
We want to have a feature in the OpenSearch where the mapping data from IP address to location information is updated regularly without manual intervention so that a user can get better accuracy on a location information of an IP address during a data ingestion with minimum effort.
# Approach
1. We will have free database distribution server which will host the MaxMind database file. The file in the server will get updated regularly. The server will have a manifest file for each database.
2. A user call an API to create a GeoIP policy with a url of a manifest file.
3. An OpenSearch cluster will parse the manifest file, download a GeoIP database file, and make it ready to be used in a GeoIP processor.
4. While the OpenSearch cluster prepare the GeoIP database to be used by GeoIP processors, a request to create a GeoIP processor using the GeoIP policy will fail.
5. Once the GeoIP database is ready to be used by GeoIP processors, a user can create a GeoIP processor using the GeoIP policy name.
6. In the background, the OpenSearch cluster update the GeoIP database with given interval.
## Data flow diagram
<img width="676" alt="Screen Shot 2023-01-13 at 11 17 15 AM" src="https://user-images.githubusercontent.com/1809492/212402515-9e58f357-d4f3-4be8-9c7f-7da307d585d3.png">
## API design
https://github.com/opensearch-project/OpenSearch/issues/5860
# Data format
## Option1. MaxMind Format
One option is to use MaxMind data format and MaxMind SDK to read the database file as what we have today. A cluster manager node will download GeoIP database from an external endpoint and store it in an index. It will notify to all ingest nodes to download the new database file from the index. Once every ingest node is ready to use the new database file, the manager node will update a flag in an index. Each ingest node check the flag in the index to decide whether to start to use the new database or not.
<img width="554" alt="Screen Shot 2022-12-21 at 11 10 44 AM" src="https://user-images.githubusercontent.com/1809492/212402832-7a5575c7-ee0d-41e0-93f0-f5e84c3d30d3.png">
### Pros
* Small IP address to location information mapping data size (GeoLite2-City: 70 MB)
* Just a few seconds to prepare data to be used by a GeoIP processor
* Fast data ingestion time. (0.05ms/doc)
### Cons
* Dependency on a MaxMind format
* Dependency on a MaxMind data
### Possible future improvement
* We can have our own binary format and SDK for the mapping data. This will remove a dependency on the MaxMind data format in OpenSearch.
## Option2. OpenSearch Index(Preferred)
In this option, we will utilize an OpenSearch index. An OpenSearch cluster will download a file in CSV format from an external endpoint. After the download complete, it will put the data into an index in an OpenSearch cluster. It will also create an index alias pointing to the newly created index.
The index will be a single shard with auto_expand_replicas value as 0-all so that querying an index can happen within a same node to achieve a fast processing time.
The GeoIP processor will query the index internally to populate location data during the ingest time.
<img width="561" alt="Screen Shot 2023-01-05 at 2 17 39 PM" src="https://user-images.githubusercontent.com/1809492/212403016-5040c9ff-1f85-4e0c-afef-c8ad66a9a110.png">
### CSV file format
1. The first line will be a field name for each column.
2. The first column should be IP range in CIDR format.
```
//Example
cidr latitude longitude city country
1.0.0.0/24 -37.8333 145.2375 Seattle "United States"
```
### Pros
* No dependency on MaxMind data format.
* Can use another GeoIP database provided other than MaxMind. For example, IP2Location database provided by Hexasoft. We can not provide a free database distribution server for IP2Location database due to its license but a user can setup their own server.
* Can benefit from future performance improvements on indexing in an OpenSearch out of the box.
### Cons
* Bigger data size than the option1. (GeoLite2-City: 400MB in a segment file which is 3.5 times larger than option1)
* Slower to prepare data to be used in a processor due to indexing time. (GeoLite2-City: 300 seconds)
* Slow data ingestion time compared to the option 1. (0.059ms/doc) Even slower data ingestion time if a cluster has separate ingest nodes which are not data node. (0.071ms/doc)
* Repeated load on a cluster with indexing process during database update.
### Possible future improvement
* For the index which was created by geoip policy, we can make them to be stored in ingest nodes. This will prevent an increase in a query latency when a cluster has separate ingest nodes from data nodes.
* We can generate index using a lucene library directly to reduce the index time.
# Questions to community
1. Do you want to use your own GeoIP database or any other GeoIP database other than free GeoLite2 database provided by MaxMind?
2. Which GeoIP database have you used in OpenSearch: GeoLite2-City, GeoLite2-Country, GeoLite2-ASN, or all of them?
3. How frequently do you want to update the GeoIP database? Once a day or once a week?
4. Do you have separate ingest nodes from data nodes in the cluster?
kris
March 23, 2023, 11:54pm
2
thanks for posting @vamshin - let’s hope you get good feedback from the community!