Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.12+
Describe the issue:
We’ve recently upgraded to OpenSearch 2.12 from 2.11.
Previously we were able to create a Docker image that populated the index and then stored the image in a new image. We could then run that pre-populated image in our tests.
Since upgrading to 2.12, we get the following Lucene error when starting the pre-populated image:
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-05-27T22:02:00.14795819Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-05-27T22:01:23.629914316Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:179)
at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1149)
at org.opensearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:900)
at org.opensearch.env.NodeEnvironment.assertCanWrite(NodeEnvironment.java:1373)
at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:376)
at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:301)
at org.opensearch.node.Node.<init>(Node.java:535)
at org.opensearch.node.Node.<init>(Node.java:417)
at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181)
at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172)
at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
at org.opensearch.cli.Command.main(Command.java:101)
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138)
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104)
For complete error details, refer to the log at /usr/share/opensearch/logs/opensearch-cluster.log
It seems that the lock file node.lock
was created by the initial docker image to populate the index. The subsequent run of the docker image isn’t happy that the lock file has a different creation time to when started for the second time.
Our Dockerfile to create the pre-baked image looks like this:
FROM opensearchproject/opensearch:latest
RUN opensearch -p pid_file -E discovery.type=single-node -E http.port=9201 -d && \
# script which populates index && \
cd /usr/share/opensearch && \
kill `cat pid_file` && \
rm -f /usr/share/opensearch/data/nodes/0/node.lock
Once this is built, simply running it as follows fails:
docker run -it my-prebaked-os-image:2.12-SNAPSHOT
I’ve tried removing the script which populates the index, but get the same error (i.e. simply creating a new docker image which includes starting opensearch once prevents it being rerun in a later test).
This used to work fine in OpenSearch 2.11. I assume it may be related to the new version of Lucene - but I see the file lock check code in Lucene is many years old:
ore/src/java/org/apache/lucene/store/NativeFSLockFactory.java#L180
Is there a way I can cleanly shut down OpenSearch, and have it restart without hitting this error. I’ve tried killing the process and removing the lock file physically, but the error remains.
Configuration:
Relevant Logs or Screenshots: