Raspberry Pi Ceph Cluster – Testing Part 1

It’s time to run some tests on the Raspberry Pi Ceph cluster I built. I’m not sure if it’ll be stable enough to actually test, but I’d like to find out and try to tune things if needed.

Pool Creation:

I want to test both standard replicated pools, and Ceph’s newer erasure coded pools. I configured Ceph’s replicated pools with 2 replicas. The erasure coded pools are more like RAID in that there aren’t N replicas spread across the cluster, but rather data is split into chunks and distributed, checksummed, and then the data and checksums are spread across the pool. I configured the erasure coded pool with data split into 2 with an additional coding chunk. Practically, this means that they should tolerate the same failures, but with 1.5x the overhead instead of 2x the overhead.

I created one 16GiB pool of each type to test. Why not always use erasure coded pools? They’re more computationally complex, which might be bad on compute constrained devices such as the Raspberry Pi. They also don’t support the full set of operations. For example, RBD can’t completely reside on an erasure coded pool. There is a workaround, the metadata resides on a replicated pool with the data on the erasure coded pool.

Baseline Tests:

To get a basic idea for what performance level I could expect from the hardware underlying Ceph, I ran network and storage performance tests.

I tested network throughput with iperf between each Raspberry Pi 3 B+ and an external computer connected over Gigabit Ethernet. Each got around 250Mb/s, which is reasonable for a gigabit chip connected via USB2.0. For comparison, a Raspberry Pi 3 B (not the plus version) with Fast Ethernet tested around 95Mb/s. As a control, I also tested the same iperf client against another computer connected over full gigabit at 950Mb/s.

Disk throughput was tested using dd for sequential reads and writes and iometer for random reads and writes against a flash key with an XFS filesystem. XFS used to be the recommended filesystem for Ceph until Ceph released BlueStore, and it’s still used for BlueStore’s metadata storage partition. The 32GB flash keys performed at 32.7MB/s sequential read and 16.5 MB/s sequential write. Random read and write with 16kiB operations yielded 15MB/s and 0.9MB/s (that’s 900kB/s) respectively using sysbench’s fileio module.

RBD Tests:

RADOS Block Device (RBD) is a block storage service, so you can run your own filesystem but have Ceph’s replication protect the data as well as spread access over multiple drives for increased performance. Ceph currently doesn’t support using pure erasure coded pools for RBD (or CephFS), instead the data is stored in the erasure coded pool and the metedata in a replicated pool. Partial writes also need to be enabled per-pool, as per the docs.

Once the pools were created, I mounted each and started running tests. The first thing I wanted to test was just sequential read and write of data. To do this, I made an XFS filesystem with the defaults and mounted it on a test client (Arch Linux, kernel 5.1.14, quad core, 16GiB RAM, 1xGbE, Ceph 13.2.1) and wrote a file using dd if=/dev/zero of=/mnt/test/test.dd bs=4M count=3072 iflag=direct. Initially, writes to the replicated rbd image looked decent, averaging a whopping 6.4MB/s. And then the VM suddenly got really cranky.

The Ceph manager dashboard’s health status. Not what I’d been hoping for during a performance test.

One host, rpi-node2 had seemingly dropped from the network. Ceph went into recovery mode to keep my precious zeroes intact, and IO basically ground to a halt as the cluster recovered at a blazing 1.3MiB/s. I couldn’t hit the node with SSH, so I power-cycled it. It came back up, Ceph realized that the OSD was back and cut short the re-balancing of the cluster. I decided to run the write test again, deleted the test file and ran fstrim -v /mnt/test, which tells Ceph that the blocks can be freed, so it frees up that space on the OSDs so I could re-run fresh.

The second test ended similarly to the first, with dd writing happily at 6.1MB/s until rpi-node3 stopped responding (including to SSH) at 3.9GB written. This time I stopped dd immediately and waited for the node to come back, which is did after almost two minutes. I checked the logs and saw that the system was running out of memory and the ceph-osd process was getting OOM killed. I also noticed that both nodes that had failed were the ones running the active ceph-mgr instance serving the dashboards.

I ran the test again, this time generating a 1GiB file instead and confirmed that it was the node with the ceph-mgs instance running out of memory. I also let the write finish, testing how well Ceph ran in the degraded state. At 2.8MB/s, no performance records were being set, but the cluster was still ingesting data with one node missing.

I have two options, the first is to move the ceph-mgr daemon to another device, but as I wanted the cluster to be self-contained to three nodes, so I opted for the second option. Option two is to lower the memory usage of the ceph-osd daemon. I looked at the documentation for the BlueStore config, and saw that the default cache size is 1 GiB, or as much RAM as the Pi has. That just won’t do, so I added bluestore_cache_size = 536870912 and bluestore_cache_kv_max = 268435456 to my ceph.conf file and restarted the OSDs. This means that BlueStore will use at most 512MiB of RAM for its caches with only 256MiB maximum for the RocksDB metadata cache.

I reran the 1GiB file test and had 3.5M B/s write speed and no OSDs getting killed. With the 16GiB file, writes averaged 3MB/s, but RAM usage at the halfway mark eventually got the OSD running on the same node as the active ceph-mgr killed. Again, the cluster survived, just in a less happy state until the killed OSD daemon restarted. I disabled the dashboard, and while this helped RAM usage, the ceph-osd daemon was still getting killed. I further dropped the BlueStore cache to 256MiB and 128MiB for the metadata store. This time I locked two of the three nodes up hard and needed to power-cycle them.

During the 1GiB file tests, this was the unhappiest the cluster got. The PGs were simply being slow replicating and caught up quickly once the write load subsided.

With the replicated testing being close to a complete failure, I moved on to testing the erasure coded pool. I expected them to be worse due the the increased amount of compute resources needed. I was wrong. I was able to successfully write the test file, and the worst I got was an OSD being slow and not responding to the the monitor fast enough and then recovering a few seconds later. Sequential writes averaged 5.7MB/s and sequential read was an average of 6.1MB/s, but I still had two nodes go down at different times. It seems that erasure coded pools perform slightly better, but can still cause system memory exhaustion.

One thing to note is that even with three nodes, Ceph never lost any data and was still accepting write, just very slowly. I hadn’t intended to test Ceph’s resiliency, as that has already been well tested, but it was nice to see that it kept serving reads and writes.

At this point, I don’t think that the 1GiB of RAM is enough to run RBD properly. Sequential writes looked to be around 6MB/s when the cluster hadn’t lost any OSDs or whole hosts. I never attempted to test random access, due to the issues with sequential reads and writes.

CephFS and Rados Gateway:

With RBD being a bit of a bust, I wanted to see if CephFS and Rados Gateway performed better. As this post is getting long, CephFS and RadosGW results are in a second post, along with a conclusion.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s