Having built a Ceph cluster with three Raspberry Pi 3 B+s and unsuccessfully tested it with RBD, it’s time to try CephFS and Rados Gateway. The cluster setup post is here and the test setup and RBD results post is here. Given how poorly RBD performed, I’m expecting similar poor performance from CephFS. Using the object storage gateway might work better, but I don’t have high hopes for the cluster staying stable under even small loads in either test.
I’m using the same test setup as I used in the RBD tests. Two pools, one using 2x replication and the other using erasure coding. Test client is an Arch Linux system running a 5.1.14 kernel with Ceph 13.2.1, a quad core CPU, 16GiB of RAM and connected via a 1GbE connection to the cluster. I’m also running the OSDs with their cache limited to 256MiB maximum size, and the metadata cache limited to 128MiB.
CephFS requires a metadata pool, so I created a replicated pool for metadata. Why not create both at once? CephFS doesn’t currently support multiple filesystems on the same cluster, though there is experimental support for it.
With the pool created and the CephFS filesystem mounted on my test client, I started the dd write test with a 12GiB file using
dd if=/dev/zero of=/mnt/test/test.dd bs=4M count=3072. The first run completed almost instantly, apparently completely fitting in the filesystem cache. CephFS doesn’t support
iflag=direct in dd, so I simply reran the write test knowing that the cache was pretty full at this point. Almost instantly, with less than 1GiB into the test, two of the nodes simultaneously fell over and died. They were completely unreachable over the network, but this time I connected a HDMI monitor to them to see the console. I saw quickly that kthread had been blocked for over 120 seconds, and the system was pretty completely unusable. A USB keyboard was recognized, but key-presses weren’t registering. I power-cycled the systems after waiting at least five minutes, and they came up fine.
I tried running the test again, and both hosts quickly locked up. I was running top as they did so, and both hosts rapidly consumed their memory. Despite having the OSD’s cache set to a maximum of 256MiB, the ceph-osd process was using around 750MiB before the system became unresponsive. OOM killer didn’t kill cpeh-osd in this case to save the system, possibly due to the kernel failing to allocate memory internally to do so. CephFS seems to hang the Raspberry Pis hard.
I decided to test a bunch of smaller files, because CephFS is meant more as a general filesystem with a bunch of files, whereas RBD tends to get used to store large VM images. I used sysbench to create 1024 16MiB files for a total of 16GiB on my 40GiB CephFS filesystem. Initially, things seemed to work fine. Sysbench reported that it created the 1024 files at 6.4MB/s. While this was just test preparation, it seemed to be a good sign.
What didn’t seem such a good sign was when I actually started running the sysbench write test and Ceph started complaining about slow MDS ops. A lot of them. The sysbench write test immediately failed, citing an IO error on the file. Running
ls -la showed a lot of 0 bytes file, with a couple 16MiB files. Ugh. I recreated the test setup, this time with writes at a blazing 540kB/s. When it finally finished several hours later, attempting to run the write tests showed the same truncation of files to 0B as before. This seemed to be a sysbench issue, but I didn’t spend much time troubleshooting it.
For completeness, I also tried an erasure coded pool with CephFS. Like RBD, the metadata pool isn’t supported on erasure coded pools, and needs to be on a replicated pool. Results initially looked better, but the OSDs still exhausted their memory and caused host freezes, though after a longer time with more data successfully ingested.
I had intended to test RadosGW with the S3 API, but I decided against it. With two different failed tests, the chances for any test results that didn’t end with the cluster dying are pretty low.
The Raspberry Pi 3 B+ doesn’t have enough RAM to run Ceph. While everything technically works, any attempts at sustained data transfer to or from the cluster fail. It seems like a Raspberry Pi, or other SBC, with 2GB+ of RAM would actually be stable, but still slow. The RAM issue is likely exacerbated by the 900kB/s random write rate the flash keys are capable of, but I don’t have faster flash keys or spare USB hard drives to test with.
Erasure coding seems to be better on RAM limited systems, and while it still failed, it always failed later, with more data successfully written. While it may have been more taxing on the Raspberry Pi’s limited CPU resources, these resources were typically in low contention, with usage averaging around 25% under maximum load across all 4 cores.
The release of the Raspberry Pi 4 B happened while I was writing this series of blog posts. I’d love to re-rerun these tests on three or four of the 4GB models with actual storage drives. The extra RAM should keep the OSDs from running out of memory and dying/taking down the whole system, and the USB3.0 interconnect means that storage and network access will be considerably faster. They might be good enough to run a small, yet stable cluster, and I look forward to testing on them soon.