Three Node Ceph Cluster at Home

I’ve always wanted to use Ceph at home, but triple replication meant that it was out of my budget. When Ceph added Erasure Coding, it meant I could build a more cost-effective Ceph cluster. I had a working file-server, so I didn’t need to build a full-scale cluster, but I did some tests on Raspberry Pi 3B+s to see if they’d allow for a usable cluster with one OSD per Pi. When that didn’t even work, I shelved the idea as I couldn’t justify the expense of building a cluster.

When my file-server started getting full, I decided to build a Ceph cluster to replace it. I’d get more redundancy, easier expansion and have refreshed hardware (some of my drives are going to be 9 years old this summer). I briefly looked at ZFS. But between its limited features and with the legality of running it on Linux being an open question, I quickly ruled it out.

The Cluster – Hardware

Three nodes is the generally considered the minimum number for Ceph. I briefly tested a single-node setup, but it wasn’t really better than my file-server. So my minimal design is three nodes, with the ability to add more nodes and OSDs if and when my storage needs grow.

A stack of boxes with computer hardware in them, waiting to be built into a storage cluster, sitting on a dining room table.
Some of the hardware waiting to be built into cluster nodes sitting on my dining room table. Not pictured is the first node I built for a single-node test.

I wanted each node to be small and low-ish power, so I was mainly looking at mATX cases that could take 8 3.5″ drives. After much searching, I realized that mATX is basically dead and there aren’t many mATX cases or motherboards out there. Presumably people either get mITX cases and motherboards if they want something small or get full ATX boards if they want lots of on-board peripherals and expandability, leaving mATX as an awkward middle ground.

For the motherboard, it needed to have (either onboard or space to add via add-in cards): 8xSATA (or SAS), 2x M.2 (at least one 22110) slots, 8-core CPU (or better), support for 32GB RAM sticks and a single 10GbE port (not SFP+). I found three boards that would work, a SuperMicro board that was incredibly expensive but had everything except the CPU onboard, an AMD X570 based board and an Intel Z390 based board. I quickly ruled out the SuperMicro board based on price and the Intel board based on the lack of low-power CPUs (to get the 8-cores, I was looking at a 9700k, which isn’t low-power or particularly fast or the 9900T, which no one could get me). I chose an ASRock X570M Pro4 board. It had what I needed, and better yet, it supported a 65W, 8-core Ryzen 3 CPU. I’d been bit by serious hardware bugs in the Ryzen 1, so I was a bit wary to try it, but Intel had nothing anywhere near competitive.

HDD choice was relatively easy, I created a spreadsheet with all the drive models I could find, found the best $/TB 7200RPM model I could. This ended up being 6TB Seagate Ironwolf drives, which were discontinued in favour of an inferior model after I ordered mine. Luckily with Ceph, replacements don’t need to be the same size like they did in RAID6.

SSDs were a little more challenging. I chose inexpensive, but good, M.2 SSDs for the OS drives. I also wanted some SSD based OSDs. Ceph apparently does not do well on consumer level SSDs without power-loss-protection and consequently has very slow fsync performance, so I needed a fancier SSD than the Samsung 970 Pros I had been intending to use. I found the Intel P4510/P4511 series, and decided on a 2.5″ U.2 P4510. This required an M.2 to min-SAS adapter and mini-SAS to U.2 adapter to get it connected to the board’s open M.2 slot. Why not use the P4511? No stock on it.

PartCountNotes
Fractal Design Node 804 – Case38×3.5″, 2×2.5″, mATX, full ATX PSU
ASRock X570M Pro4 – Motherboard3mATX, 8xSATA, 2x M.2, PCIe4.0 x16, x1, x4
Corsair RM550x – Power Supply3ATX PSU
Seagate 6TB Ironwolf – Hard Drive18OSD: 6 per node, 108TB Raw, ST6000VN0033
Kingston SC2000 250GB – M.2 SSD3OS/Boot Drive
Intel P4510 1TB – U.2 SSD3OSD: 1 per node
AMD Ryzen 3700x = CPU365W, 8-core
64GB Corsair LPX – RAM3One 2x32GB DDR4 3200 per node
Geforce GT710 – Video Card3One per node
Startech M.2 to U.2 Adapter Board3To connect P4510 SSD to motherboard
Mini-SAS to U.2 Adapter Cable3Cable to connect from Startech adapter to SSD
4x SATA Splitter Cable6One per each bank of 4 drives, 2 per node
Corsair ML120 120mm 4-pin fan3One each on front of drive compartment
Aquanta AQ-107 – 10GbE NIC3One per node
Cat 6a Patch Cable3One per node
Netgear 8-port 10GbE Switch1Model: XS708E, one for the whole cluster
Cluster hardware.

A small note on networking: I elected not to have separate public and cluster networks, I set everything to use the same 10GbE network. This simplified setup, both on the host/Ceph as well as physical cabling and switch setup. For a small cluster, the difference shouldn’t matter.

An Ikea Omar rack with three computers as well a UPS at the bottom, two network switches at the top and a bunch of network cabling.
Three cluster nodes in an Ikea Omar wire rack. At the bottom is a 1500VA APC UPS with a 3kVA additional battery. At the top is my core switch, and the cluster’s 10GbE switch.

Other hardware notes: the Fractal Design Node 804 HDD mounts are missing one of the two standard screw holes. The spec for 3.5″ drives only requires the two end holes, but the Node 804’s mounts only support the optional middle hole and the hole nearest the connectors. Most drives 6TB and over lack the middle hole, apparently to allow another platter.

The Cluster – Setup

Setup was pretty straightforward. I used Arch Linux as a base, running Ceph version 14.2.8 (then current). I installed the cluster using the Manual Deployment instructions.

Once the cluster was running, it was time to create pools and set up the CephFS filesystem I planned on migrating to. Ceph correctly assigned my hard drives as hdd class and the SSDs as ssd class. I had planned to have CephFS backed by an erasure coded pool, with a durability requirement of being able to lose either two drives or one host (but not both). CephFS doesn’t allow EC coded pools to be used to the CephFS metadata store, so I created a replicated pool on the P4510 SSDs. I’m using the P4510s to store the metadata because suggestions online were that it would increase performance to have the metadata pool on SSDs. I’m not sure if this actually makes much difference with the low number of drives I have.

I created a CRUSH rule that will place data only on SSDs with: ceph osd crush rule create-replicated ssd-only default host ssd.

To make sure this CRUSH rule worked, I tested it by:

  1. Dumping a copy of the crushmap using ceph osd getcrushmap -o crushmap.orig
  2. Running crushtool --test -i crushmap.orig --rule 2 --show-mappings --x 1 --num-rep 3 (the number after --rule is the index of the rule to test).
    Results: CRUSH rule 2 x 1 [18,20,19], which are the OSD numbers of my SSDs, exactly as intended.

Finally, create the pool with ceph osd pool set cephfs-metadata crush-rule-name ssd-only. Excellent! On to the EC pool.

Three Node Cluster – EC CRUSH Rules

The EC coded pool took a little more work to get working. My design goal is to have the cluster be able to suffer the failure of either a single node or two OSDs in any nodes. To do this, I would minimally need to split each block up into four pieces, plus two parity pieces.

What the OSD tree looks like:

$ ceph osd tree
ID CLASS WEIGHT    TYPE NAME          STATUS REWEIGHT PRI-AFF
-1       100.97397 root default
-3        33.65799     host Alecto
 0   hdd   5.45799         osd.0          up  1.00000 1.00000
 1   hdd   5.45799         osd.1          up  1.00000 1.00000
 2   hdd   5.45799         osd.2          up  1.00000 1.00000
 3   hdd   5.45799         osd.3          up  1.00000 1.00000
 4   hdd   5.45799         osd.4          up  1.00000 1.00000
 5   hdd   5.45799         osd.5          up  1.00000 1.00000
20   ssd   0.90999         osd.20         up  1.00000 1.00000
-5        33.65799     host Megaera
 6   hdd   5.45799         osd.6          up  1.00000 1.00000
 7   hdd   5.45799         osd.7          up  1.00000 1.00000
 9   hdd   5.45799         osd.9          up  1.00000 1.00000
11   hdd   5.45799         osd.11         up  1.00000 1.00000
14   hdd   5.45799         osd.14         up  1.00000 1.00000
16   hdd   5.45799         osd.16         up  1.00000 1.00000
19   ssd   0.90999         osd.19         up  1.00000 1.00000
-7        33.65799     host Tisiphone
 8   hdd   5.45799         osd.8          up  1.00000 1.00000
10   hdd   5.45799         osd.10         up  1.00000 1.00000
12   hdd   5.45799         osd.12         up  1.00000 1.00000
13   hdd   5.45799         osd.13         up  1.00000 1.00000
15   hdd   5.45799         osd.15         up  1.00000 1.00000
17   hdd   5.45799         osd.17         up  1.00000 1.00000
18   ssd   0.90999         osd.18         up  1.00000 1.00000

I found a blog post from 2017 describing how to configure a CRUSH rule to make this happen, but that post was a little light on how to actually do this.

  1. Get a copy of the existing crushmap: ceph osd getcrushmap -o crushmap.orig
  2. Decompile the crushmap to plaintext to edit: crushtool -d crushmap.orig -o crushmap.decomp
  3. Edit the crushmap using a text editor. This is the rule I added:
rule ec-rule {
        id 1
        type erasure
        min_size 6
        max_size 6
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 3 type host
        step choose indep 2 type osd
        step emit
}

Some explanation on this rule config:

  • min_size and max_size being 6 is how many OSDs we want to split the data over.
  • step take default class hdd means that CRUSH won’t place any blocks on the SSDs.
  • The lines step choose indep 3 type host and step choose indep 2 type osd tell CRUSH to first choose three hosts and then CRUSH to choose two OSDs on each of those hosts.

4. Compile the modified crushmap: crushtool -c crushmap.decomp -o crushmap.new

5. Test the new crushmap: crushtool --test -i crushmap.new --rule 1 --show-mappings --x 1 --num-rep 6.
In my case, this resulted in CRUSH rule 1 x 1 [9,16,8,13,5,0], which shows placement on 6 OSDs, with two per host.

6. Insert the new crushmap into the cluster: ceph osd setcrushmap -i crushmap.new

More information on this can be found on the CRUSH Maps documentation.

With the rule created, next came creating a pool with the rule:

  • Create an erasure code profile for the EC pool: ceph osd erasure-code-profile set ec-profile_m2-k4 m=2 k=4. This is a profile with k=4 and m=2, so two parity OSDs and 4 data OSDs for a total of 6 OSDs.
  • Create the pool with the CRUSH rule and EC profile: ceph osd pool create cephfs-ec-data 128 128 erasure ec-profile_m2-k4 ec-rule. I chose 128 PGs because it seemd like a reasonable number.
  • As CephFS requires a non-default configuration option to use EC pools as data storage, run: ceph osd pool set cephfs-ec-data allow_ec_overwrites true.
  • The final step was to create the CephFS filesystem itself: ceph fs new cephfs-ec cephfs-metadata cephfs-ec-data --force, with the force being required to use an EC pool for data.

Conclusion

Once that was all done, I finally had a healthy cluster with two pools and a CephFS filesystem:

  cluster:
    id:     5afc28eb-61b3-45f1-b441-6f913fd70505
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum alecto,megaera,tisiphone (age 11h)
    mgr: alecto(active, since 11h), standbys: megaera, tisiphone
    mds: cephfs-ec:1 {0=alecto=up:active} 2 up:standby
    osd: 21 osds: 21 up (since 11h), 21 in (since 11h)

  data:
    pools:   2 pools, 160 pgs
    objects: 22 objects, 3.9 KiB
    usage:   22 GiB used, 101 TiB / 101 TiB avail
    pgs:     160 active+clean

Instead of running synthetic benchmarks, I decided to copy some of my data from the old server into the new cluster. Performance isn’t quite what I was hoping for, I’ll need to dig into why, but I haven’t done any performance tuning. Basic tests were getting about 40-60MB/s write with a mix of file sizes from a few MB to a few dozen GB. I was hoping to max my fileserver’s 1GbE link, but 60MB/s of random writes on spinning rust isn’t bad, especially with only 18 total drives.

If you’re curious about the names I chose for my 3 hosts, Alecto, Megaera and Tisiphone are the names of the three Greek Furies. If I add more hosts, I’m going to be a bit stuck for names, but adding the Greek Fates should get me another 3 nodes.

One final note: when I was trying to mount CephFS I kept getting the error mount error: no mds server is up or the cluster is laggy which wasn’t terribly helpful. dmesg seemed to suggest I was having authentication issues, but the key was right. Turns out that I also needed to specify the username for it to work.

7 thoughts on “Three Node Ceph Cluster at Home

  1. How is the performance of the metadata pool, are you sure it’s not still suffering the fsync() related slowness issue? Great article, by the way, thanks for writing it!

    Like

  2. I’m kind of confused by your ec rule. Why are you selecting 3 nodes and 2 OSDs each if you want to survive the crash of 1 host or 2 OSDs? It seems like you should select 2 nodes and 3 OSDs in the crushmap?

    Like

    1. The EC rule is spreading data across 6 OSDs with 2 OSDs as overhead for the erasure coding. So if I’m only using two hosts with three OSDs per host, any host loss means I’m 3 OSDs down and I’ve lost data. This way, there are only two OSDs per host, which means any single host can be lost and only take out two OSDs.

      Like

Leave a comment