Sat 28. Aug 2021
This week there was an Ethereum node split due to outdated clients. This resulted in the following comment on ETHSecurity Community channel:
Personally, I think this is true. I also think there is currently little effort put into instructions or approaches on how to run nodes. This becomes even more essential with Ethereum 2. As such, I think it makes sense as a community to share our approaches for a more resilient network. What follows in the next sections is my attempt.
To update software, one way to receive new updates is by relying on Docker container repositories. Instead of downloading new releases and patching our client software, we separate the concerns of persistent storage and the actual process. In my case, I run the erigon
client, which receives periodical updates on DockerHub at thorax/erigon
.
I update these images roughly once a week by pulling the latest
tag: /bin/podman pull docker.io/thorax/erigon:stable
. I do this via systemd
units:
[Unit]
Description=Ethereum 1 mainnet client
Requires=network-online.target
After=network-online.target
[Service]
Restart=always
RestartSec=5s
User=core
Type=simple
ExecStartPre=-/bin/podman kill erigon erigon-rpcdaemon lighthouse lighthouse-vc
ExecStartPre=-/bin/podman rm erigon erigon-rpcdaemon lighthouse lighthouse-vc
ExecStartPre=/bin/podman pull docker.io/thorax/erigon:stable
ExecStart=/bin/podman run \
--name erigon \
-v /var/mnt/ssdraid/eth1/erigon-mainnet:/data:z \
docker.io/thorax/erigon:stable erigon \
--metrics --metrics.port=6060 \
--pprof --pprof.port=6061 \
--private.api.addr=localhost:4090 \
--datadir /data \
--chain mainnet
[Install]
WantedBy=multi-user.target
Here you can also see the separation of persistent data: the -v /var/mnt/ssdraid/eth1/erigon-mainnet:/data:z
mounts the a block device which has the data the process needs. In theory, I would only ever need to point this directory to the client process, and the client process could receive upgrades independent of the data it mounts.
I can manually upgrade my client by running systemctl restart erigon
, which will kill all its dependencies, remove the containers, and then pull the newest version. The update will propagate over the rpcdaemon
and its lighthouse
dependants by systemd
Requires
and After
instructions in the erigon.service
file. For example, consider the rpcdaemon
process:
[Unit]
Description=Ethereum 1 client rpcdaemon
Requires=erigon.service
After=erigon.service
[Service]
Restart=always
RestartSec=5s
User=core
Type=simple
ExecStart=/bin/podman run \
--net=container:erigon \
--pid=container:erigon \
--ipc=container:erigon \
-v /var/mnt/ssdraid/eth1/erigon-goerli:/data:z \
--name erigon-rpcdaemon \
docker.io/thorax/erigon:stable rpcdaemon \
--datadir /data \
--private.api.addr=localhost:4090 \
--http.api=eth,erigon,web3,net,debug,trace,txpool,shh \
--http.addr=0.0.0.0
[Install]
WantedBy=multi-user.target
Here, the [Unit]
part of the systemd specification has the following lines:
[Unit]
Description=Ethereum 1 client rpcdaemon
Requires=erigon.service
After=erigon.service
This will mean that the erigon-rpcdaemon.service
will only launch after the erigon.service
is running. This propagates to lighthouse.service
:
[Unit]
Description=Ethereum 2 mainnet client
Requires=erigon-rpcdaemon.service
After=erigon-rpcdaemon.service
...
And from there on to the validator:
[Unit]
Description=Ethereum 2 mainnet client validator
Requires=lighthouse.service
After=lighthouse.service
...
This way, a simple systemctl restart erigon
command will cause the whole stack to upgrade.
While running the restart operation could be a cron job, the reality is that the nodes also run other software. This includes but is not limited to podman
, i.e., the Docker host and the kernel of the system. These updates are arguably as important to the client upgrades to avoid shellshocks and vulnerabilities alike to be introduced over time.
To avoid dependency conflicts between different processes and thus risk the maintenance process cascading into chaos, I have taken the approach popularised by the CoreOS Linux distribution. Here, the general idea is that everything except for the kernel and the container runtime interface is a container. And while CoreOS as a company does not exist anymore, the distribution is kept alive by Fedora as Fedora CoreOS.
But how does this help? Well, CoreOS also auto-upgrades the kernel by periodically polling the release windows. Naturally, you may also configure the way these updates are rolled out. This way, by enabling the erigon.service
with systemctl enable erigon.service
the operating system will trigger the process to start on each boot. And in CoreOS, each unattended boot operation corresponds to a system upgrade, which by specifying podman rm
and podman pull
operations corresponds to also upgrading the clients automatically. What is thus achieved is that around each week, when Fedora releases a new CoreOS version, the nodes will download the patches, restart, and then upgrade the Ethereum client processes. This allows unattended-upgrades across all nodes that I maintain, thus effectively avoiding chain splits.
In theory, this works all okay, but sometimes backward compatibility with the mounted persistent storage is broken in practice. And sometimes, the command line arguments are tweaked, which may also cause downtime. The way to resolve this is by introducing monitoring to the cluster, for which there already exists process-specific approaches via Prometheus and Grafana.
But, Prometheus and Grafana only work as long as the node itself can recover from configuration errors, which it often cannot because there is no programmatic way to apply patches to filesystem and process arguments. To resolve this, we should first test that the upgrade does not cause downtime, and only if so, then upgrade. While I have not configured these to work automatically, the tools already exist in the form of Linux checkpoints.
Luckily, the checkpoint operations also apply to containers via CRIU. In essence, CRIU allows the containers to be stopped and pushed to a remote computer while the host itself tries to upgrade the system. Interestingly, this also applies to kernel upgrades via seamless kernel upgrades. This way, it could be possible to devise an approach that works roughly as follows:
kexec
, and if so, boot into it and pull the latest container images. Otherwise, notify the ops team (that's me!) that the upgrade breaks something.