#Polkadot validator went offline
Due to a memory issue on v0.9.2, the Polkadot chain halted at 11:30 UTC on May 24th, 2021. Parity issues an emergency announcement to downgrade to 0.8.30.
We see that our polkadot nodes have stopped syncing and a restart does not help.
The polkadot deployment is done with terraform and Google Cloud Build. We use a custom entrypoint shell called push_containers
. A custom container is built on top of the polkadot container version passed as parameter, with the entrypoint script injected and some extra packages like jq
and xxd
added.
We perform the downgrade, by passing version v0.8.30 to the terraform module variable responsible for MIDL polkadot deployments, however we observe that after deployment is complete, the polkadot version is still v0.9.2.
We verified that the image pull policy of the statefulset was set to always
and that at each pod restart, the right image digest would indeed be pulled.
We then attempted several things:
Meanwhile, the epoch went over and the validator was marked offline.
The custom container Dockerfile contains an apt-get upgrade
instruction which upgrades the container to v0.9.2 despite the original container having version v0.8.30. Therefore, despite the origin container being set to the right version, the final one always has the latest released version.
We removed the apt-get upgrade instruction and the pipeline now behaves as expected. We then redeclared the validate intention of the nominator at 17:55 UTC on May 24th, 2021.
Commit of the fix: https://github.com/midl-dev/polkadot-k8s/commit/05364f80e21636094d3e39283b546b649dce4ef2
A couple of weeks ago we noticed that the container version was not always set as expected based on the config, but we did not take further action.
We have been following a policy of rebuilding containers to inject entrypoints. In hindsight, it appears safer to always use the origin containers whenever possible, and inject the entrypoint with a configmap, because:
We will be doing that moving forwards.