Accroding to the comment 16 bil docs in 15mins by @mhoydis ,
Changes I made:
- Lower the master to 3 only.
- Enabled KNN
- Increased the refresh interval
- Drops more fields in the _source
- Increase the shards to use more NVME for an index.
30~50 billion docs per day and total size between 5.1~6TB with 120 shards. Each shard size is around 40~50GB. The indexing rate is slower than log generating rate from 1 thousand web-service hosts.
green open filebeat-object-server-2022.05.28 u99sQ4ItSuqNAsQj-njd9w 120 0 50,575,992,235 0 6.5tb 6.5tb
#My indexed log example
I’m looking for a way to compare the OpenSearch performance between deploying approaches.
- Running OpenSearch on virtual machine
- Running OpenSearch on the physical machine hosts.
- Running OpenSearch by helm chart in a k8s.
In Using message queue like Kafka from @mhoydis , what’s the benefit having Kafka between log generaotrs and OpenSearch client node?
What’s the best practice for OpenSearch installation?
Is there a basic measure unit such
Indexing per second (IPS) or
docs per second ?
Regards // Hugo
My experience in running Opendistro 7.10 in the past.
3 node master, 18 data nodes. Each node has multiple disks since each is limited to 300 IOPS ( Ceph storage ), run in OpenStack VM
- It is around 4-5Tb log per day before compress
- I use HDD to store those logs:
- For Kong API logs: I use Kafka as data streaming since it can be hit 100k requests per second and Opendistro can not handle them easy
- For K8S logs: ship them directly to open distro with a fluent bit.
I do recommend running on a virtual machine if the budget is limited, for higher performance I would recommend a dedicated machine. And HDD or SSD is depend how you would use it, in my case I use HDD because 90% request is write, 10% request is read
In the other thread you say that you have “12 powerful physical machines”. What exactly is the spec of these “powerful machines”?
Assuming that they truly are powerful, you will get the best performance and easily handle 300K/sec of these relatively simple records by doing the following…
- Get rid of K8s and the network and storage latency that it brings. You can still run OpenSearch in a container and bind mount it to the storage. As with VMs, K8s is for flexibility not ultimate performance.
- Run one OpenSearch node per server. The number of shards should be 2x the number of data nodes, in this case 24. (NOTE: see below for multi-socket servers.)
- Setup the NVMe drives in RAID-0 and let OpenSearch handle redundancy.
- Use ISM to rollover the index when the primary shard size reaches 20GB.
IMPORTANT: There is an exception to #2 above for multi-socket servers. Crossing the socket boundary will kill OpenSearch performance. If servers are multi-socket, run one OpenSearch node per socket (each with its own dedicate storage mount point) and bind each to a specific socket. Use rack-awareness features of OpenSearch to ensure that primaries and replicas are not placed on data nodes that reside on the same physical server.
Hi @robcowart thx
What exactly is the spec of these “powerful machines”?
6TB x 13 NVMEs
/dev/nvme1n1 5.8T 90M 5.5T 1% /mnt/nvme1n1
/dev/nvme2n1 5.8T 90M 5.5T 1% /mnt/nvme2n1
/dev/nvme3n1 5.8T 90M 5.5T 1% /mnt/nvme3n1
/dev/nvme4n1 5.8T 89M 5.5T 1% /mnt/nvme4n1
/dev/nvme5n1 5.8T 101M 5.5T 1% /mnt/nvme5n1
/dev/nvme6n1 5.8T 101M 5.5T 1% /mnt/nvme6n1
/dev/nvme7n1 5.8T 101M 5.5T 1% /mnt/nvme7n1
/dev/nvme8n1 5.8T 101M 5.5T 1% /mnt/nvme8n1
/dev/nvme9n1 5.8T 101M 5.5T 1% /mnt/nvme9n1
/dev/nvme10n1 5.8T 90M 5.5T 1% /mnt/nvme10n1
/dev/nvme11n1 5.8T 89M 5.5T 1% /mnt/nvme11n1
processor : 79
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
stepping : 7
you will get the best performance and easily handle 300K/sec of these relatively simple records
Very close to our record so far. But NVME disk utilizaiton is pretty low so I think the bottlenect is somewhere else.
Use ISM to rollover the index when the primary shard size reaches 20GB
Under the design, I control the per shard size between 30~50GB (as recommendation from the document). Since the data nodes are running as k8s pod, will the rollove helps? At the same time, it’s always the number of shards handling the incoming traffic. I mean if the shards are 50, it’s still 50 data nodes work for 50 shards even if it’s rollovered.
Setup the NVMe drives in RAID-0 and let OpenSearch handle redundancy
If the bottleneck from the diskIO, this might help. But it doesn’t look like the bottleneck is on diskIO so far.
# sudo numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 515792 MB
node 0 free: 202072 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 516080 MB
node 1 free: 405139 MB
node 0 1
0: 10 32
1: 32 10
Your problem is that you are running WAY TOO MANY data nodes. The fact that your servers are multi-socket makes the issue even worse (due to fabric/memory access latency).
The RAID-0 recommendation is not because of throughput performance, it is because if you reduce the number of data nodes, each node will need access to multiple drives. RAID-0 would be the way to do this.
K8s introduces network and storage latency (this is not the same thing as throughput or utilization), which will increase the wait time and negatively impact any synchronous operations. This makes your excessive number of data nodes even worse. All of inter-node communication has the added latency of the CNI which further hinders it.
We have a simple 5 node cluster, each with 16-core EPYC processors, 128GB RAM and NVMe storage. We can easily beat your ingest rates with our cluster, and that is with network flow records which are much much more complex than your logs.
While it may be “cool” to have 100s of instances running in a K8s cluster, it is NEVER going to offer you the best performance. In the case the simpler solution is the better choice.
@robcowart All the suggestions are really helpful. Single node bare metal gives an impressive indexing rate to our use case from my testing yesterday. To process the log line into fields and output to OpenSearch from 1300 web application servers. Here with the numbers for sharing to the community.
- Per Filebeat can process 4000 lines in a second when there’s no backpressure on the OpenSearch side.
*The OpenSearch on K8s (190 opensearch-data with NVME for each, 150 opensearch-client and 3 masters) - 400k~410k/sec documents. The CPU usage on k8s hosts are not very busy. This proves that the bottleneck may from the latency as @robcowart explained.
- The single bara metal OpenSearch dimr 10NVMEs in path.data - 280K/sec . It’s CPU bound as all CPU threads hit 94~100% which is what I’m looking for.
Monitoring metrics of k8s OpenSearch
Monitoring metrics of bare metal OpenSearch
- To design a new OpenSearch cluster on bare metal. To think about how to full utilize all the 12~24 NVMEs on each.
- Each machine has 768G~1TB RAM. If running an OpenSearch per server, will it gain the benefit from this much memory? To my knowledge, the HEAP memory has limitations of 32 or 64GB per JVM.
- To set up the proper ISM rollover for our use case. 4~8TB per day per index. I’ll need to figure out the right number of shards.
- To have better understanding how rack-awareness features work.
Thanks to @robcowart and @BlackMetalz for sharing the knowledge. I’ll keep posting here for the final performance improvement as a reference for the community.
[Solved] The OpenSearch on dockers(k8s) can run as fast as bare metal with a proper settings.
It’s a bit tricky to configure it. The nature of k8s is for flexibility rather than ultimate performance as @robcowart pointed out earlier.
- Refresh interval setting should be configured to reflect your use case. I’m using 300sec.
- Don’t run OpenSearch with a huge amount of nodes. In case of us with 190 data + 120 client+ 3 master. We’re going to redesign the architecture for the production environment.
- To set exact number of CPU reservation and limit to OpenSearch workloads in k8s. Don’t leave it empty.
According to the test, having all available CPUs to the OpenSearch pod on a single node k8s can achieve similar performance results as bare metal.
Hope this help and thanks for everyone @robcowart @BlackMetalz @reta