How to monitor your Ethereum Node in under 5 minutes

Jul 4, 2021 • Odysseas Lamtzidis • ethereum, devops, netdata

cover image

This piece is a blog post version of a workshop I gave at EthCC about monitoring an Ethereum Node using Netdata.

Disclaimer: Although we use Netdata, this guide is generic. We talk about metrics that can be surfaced by many other tools, such as Prometheus/Grafana or Datadog.

The contents are as follows:

Introduction to Ethereum Nodes
What is Netdata
How to monitor a system that runs go-ethereum (Geth)
How to monitor go-ethereum (Geth)

Ethereum Nodes

Running a node is no small feat, as it requires increasingly more and more resources to store the state of the blockchain and quickly process new transactions.

Nodes are useful for both those who develop on Ethereum (dapp developers) and users.

For users, it’s crucial so that they can verify, independently, the state of the chain. Moreover, using their own node, they can both send transactions and read the current state of the blockchain more efficiently. This is important, as a range of activities require the lowest of latencies (e.g MEV).

For developers, it’s important to run a Node so that they can easily look through the state of the blockchain.

Given this reality, services like Infura or Alchemy have been created to offer “Ethereum Node-as-a-Service”, so that a developer or user can use their Ethereum Node to read the chain or send transactions.

This is not ideal, as users and developers need both the speed of their own node and the lack of dependency on an external actor who can go offline at any time.

Running the Ethereum Node

Thus, running an Ethereum Node is not as a fringe activity as one outsider would expect, but rather a common practice for experienced users and developers. On top of that, running an Ethereum node is one of the core principles of decentralisation. If it becomes very hard or complex, the system becomes increasingly centralised, as fewer and fewer parties will have the capital and expertise required to run a node.

Geth is the most widely-used implementation of the Ethereum Node, written in Go.

The Netdata Agent

The Netdata Agent was released back in 2016 as an open-source project and since then it has gathered over 55K GitHub ✨.

TL;DR of netdata monitoring:

You run a single command to install the agent.
Netdata will auto-configure itself and detect all available data sources. It will also create sane default alarms for them.
It will gather every metric, every second.
It will produce, instantly, stunning charts about those metrics.

In other words, you don’t have to setup

a) A dashboard agent
b) A time series database (TSDB)
c) An alert system.

Netdata is all three.

How to monitor your Ethereum Node

EthCC was a blast, not only for the energy of the ecosystem, but also for how our workshop was received by node operators from a dozens of projects.

I was stunned to see how many professionals are struggling with monitoring their infrastructure, often using some outdated Grafana Dashboard or the default monitoring system of a cloud provider.

Let’s get right into it.

Preparation

The first order of business is to install netdata on a machine that is already running Geth.

Note: Make sure you run Geth with the --metrics flag. Netdata expects the metric server to live in port 6060 and be accessible by localhost. If you have modifed that, we will need to make a configuration change in the collector so that we point it to your custom port.

To install Netdata, run:

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Visit the Netdata dashboard at <node_ip>:19999.

For illustration purposes, we run a public test Geth server at http://163.172.166.66:19999.

Action plan

We will not cover every single metric that is surfaced by Netdata. Instead, we will focus on a few important ones.

For these metrics, we will:

Talk about what the particular system metric means in general.
Discuss how to read these system metrics, no matter the workload.
Analyze how Geth affects these system metrics.

How to read the dashboard

The dashboard is organized into 4 main areas:

The top utility bar. Particularly important to access the time picker and running alerts.
The main section where the charts are displayed.
The right menu which organizes our charts into sections and submenus. For example, the system overview section has many different submenus (e.g cpu) and each submenu has different charts.
The left menu which concerns Netdata Cloud.

System Overview section

First, we take a look at the System Overview section.

System overview screenshot full-resolution image

Top-level Gauges

It has a nice review of the whole system. During sync, we expect to see elevated Disk Read/Write and Net inbound/outbound. CPU usage will be elevated only if there is high use of Geth’s RPC server.

Table of Contents

Ethereum Nodes

Running the Ethereum Node

The Netdata Agent

How to monitor your Ethereum Node

Preparation

Action plan

How to read the dashboard

System Overview section

Top-level Gauges

CPU utilization chart

CPU Pressure Stall Information (PSI) chart

CPU Load chart

How Geth affect the CPU charts

Disk Charts

Disk IO chart

PageIO chart

Disk PSI chart

How Geth affect the Disk charts

RAM charts

RAM utilization chart

RAM PSI chart

RAM swap usage chart

How Geth affect the Ram charts

Network charts

Total Bandwidth chart

How Geth affects the Network charts

Softnet chart

How Geth affects the Softnet chart

Disks section

Disk Operations chart

IO backlog chart

How Geth affects the Disks charts

Networking Stack Section

tcp chart

How Geth is affecting the tcp charts

Applications Section

How Geth affects the Application Section

eBPF charts

Geth section

Chaindata session total read/write chart

Chaindata rate chart

Chaindata size chart

Chainhead chart

P2P bandwidth & peers charts

Reorgs charts

TX pool charts

Goroutines chart

RPC chart

Default Alerts

How Geth affects the default alert

How to change a default alert

Extending the Geth-Netdata integration

More Netdata goodies

In conclusion

Kudos