I’ve always wanted to use Ceph at home, but triple replication meant that it was out of my budget. When Ceph added Erasure Coding, it meant I could build a more cost-effective Ceph cluster. I had a working file-server, so I didn’t need to build a full-scale cluster, but I did some tests on Raspberry Pi 3B+s to see if they’d allow for a usable cluster with one OSD per Pi. When that didn’t even work, I shelved the idea as I couldn’t justify the expense of building a cluster.
When my file-server started getting full, I decided to build a Ceph cluster to replace it. I’d get more redundancy, easier expansion and have refreshed hardware (some of my drives are going to be 9 years old this summer). I briefly looked at ZFS. But between its limited features and with the legality of running it on Linux being an open question, I quickly ruled it out.
The Cluster – Hardware
Three nodes is the generally considered the minimum number for Ceph. I briefly tested a single-node setup, but it wasn’t really better than my file-server. So my minimal design is three nodes, with the ability to add more nodes and OSDs if and when my storage needs grow.

I wanted each node to be small and low-ish power, so I was mainly looking at mATX cases that could take 8 3.5″ drives. After much searching, I realized that mATX is basically dead and there aren’t many mATX cases or motherboards out there. Presumably people either get mITX cases and motherboards if they want something small or get full ATX boards if they want lots of on-board peripherals and expandability, leaving mATX as an awkward middle ground.
For the motherboard, it needed to have (either onboard or space to add via add-in cards): 8xSATA (or SAS), 2x M.2 (at least one 22110) slots, 8-core CPU (or better), support for 32GB RAM sticks and a single 10GbE port (not SFP+). I found three boards that would work, a SuperMicro board that was incredibly expensive but had everything except the CPU onboard, an AMD X570 based board and an Intel Z390 based board. I quickly ruled out the SuperMicro board based on price and the Intel board based on the lack of low-power CPUs (to get the 8-cores, I was looking at a 9700k, which isn’t low-power or particularly fast or the 9900T, which no one could get me). I chose an ASRock X570M Pro4 board. It had what I needed, and better yet, it supported a 65W, 8-core Ryzen 3 CPU. I’d been bit by serious hardware bugs in the Ryzen 1, so I was a bit wary to try it, but Intel had nothing anywhere near competitive.
HDD choice was relatively easy, I created a spreadsheet with all the drive models I could find, found the best $/TB 7200RPM model I could. This ended up being 6TB Seagate Ironwolf drives, which were discontinued in favour of an inferior model after I ordered mine. Luckily with Ceph, replacements don’t need to be the same size like they did in RAID6.
SSDs were a little more challenging. I chose inexpensive, but good, M.2 SSDs for the OS drives. I also wanted some SSD based OSDs. Ceph apparently does not do well on consumer level SSDs without power-loss-protection and consequently has very slow fsync performance, so I needed a fancier SSD than the Samsung 970 Pros I had been intending to use. I found the Intel P4510/P4511 series, and decided on a 2.5″ U.2 P4510. This required an M.2 to min-SAS adapter and mini-SAS to U.2 adapter to get it connected to the board’s open M.2 slot. Why not use the P4511? No stock on it.
Part | Count | Notes |
---|---|---|
Fractal Design Node 804 – Case | 3 | 8×3.5″, 2×2.5″, mATX, full ATX PSU |
ASRock X570M Pro4 – Motherboard | 3 | mATX, 8xSATA, 2x M.2, PCIe4.0 x16, x1, x4 |
Corsair RM550x – Power Supply | 3 | ATX PSU |
Seagate 6TB Ironwolf – Hard Drive | 18 | OSD: 6 per node, 108TB Raw, ST6000VN0033 |
Kingston SC2000 250GB – M.2 SSD | 3 | OS/Boot Drive |
Intel P4510 1TB – U.2 SSD | 3 | OSD: 1 per node |
AMD Ryzen 3700x = CPU | 3 | 65W, 8-core |
64GB Corsair LPX – RAM | 3 | One 2x32GB DDR4 3200 per node |
Geforce GT710 – Video Card | 3 | One per node |
Startech M.2 to U.2 Adapter Board | 3 | To connect P4510 SSD to motherboard |
Mini-SAS to U.2 Adapter Cable | 3 | Cable to connect from Startech adapter to SSD |
4x SATA Splitter Cable | 6 | One per each bank of 4 drives, 2 per node |
Corsair ML120 120mm 4-pin fan | 3 | One each on front of drive compartment |
Aquanta AQ-107 – 10GbE NIC | 3 | One per node |
Cat 6a Patch Cable | 3 | One per node |
Netgear 8-port 10GbE Switch | 1 | Model: XS708E, one for the whole cluster |
A small note on networking: I elected not to have separate public and cluster networks, I set everything to use the same 10GbE network. This simplified setup, both on the host/Ceph as well as physical cabling and switch setup. For a small cluster, the difference shouldn’t matter.

Other hardware notes: the Fractal Design Node 804 HDD mounts are missing one of the two standard screw holes. The spec for 3.5″ drives only requires the two end holes, but the Node 804’s mounts only support the optional middle hole and the hole nearest the connectors. Most drives 6TB and over lack the middle hole, apparently to allow another platter.
The Cluster – Setup
Setup was pretty straightforward. I used Arch Linux as a base, running Ceph version 14.2.8 (then current). I installed the cluster using the Manual Deployment instructions.
Once the cluster was running, it was time to create pools and set up the CephFS filesystem I planned on migrating to. Ceph correctly assigned my hard drives as hdd
class and the SSDs as ssd
class. I had planned to have CephFS backed by an erasure coded pool, with a durability requirement of being able to lose either two drives or one host (but not both). CephFS doesn’t allow EC coded pools to be used to the CephFS metadata store, so I created a replicated pool on the P4510 SSDs. I’m using the P4510s to store the metadata because suggestions online were that it would increase performance to have the metadata pool on SSDs. I’m not sure if this actually makes much difference with the low number of drives I have.
I created a CRUSH rule that will place data only on SSDs with: ceph osd crush rule create-replicated ssd-only default host ssd
.
To make sure this CRUSH rule worked, I tested it by:
- Dumping a copy of the crushmap using
ceph osd getcrushmap -o crushmap.orig
- Running
crushtool --test -i crushmap.orig --rule 2 --show-mappings --x 1 --num-rep 3
(the number after--rule
is the index of the rule to test).
Results:CRUSH rule 2 x 1 [18,20,19]
, which are the OSD numbers of my SSDs, exactly as intended.
Finally, create the pool with ceph osd pool set cephfs-metadata crush-rule-name ssd-only
. Excellent! On to the EC pool.
Three Node Cluster – EC CRUSH Rules
The EC coded pool took a little more work to get working. My design goal is to have the cluster be able to suffer the failure of either a single node or two OSDs in any nodes. To do this, I would minimally need to split each block up into four pieces, plus two parity pieces.
What the OSD tree looks like:
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 100.97397 root default
-3 33.65799 host Alecto
0 hdd 5.45799 osd.0 up 1.00000 1.00000
1 hdd 5.45799 osd.1 up 1.00000 1.00000
2 hdd 5.45799 osd.2 up 1.00000 1.00000
3 hdd 5.45799 osd.3 up 1.00000 1.00000
4 hdd 5.45799 osd.4 up 1.00000 1.00000
5 hdd 5.45799 osd.5 up 1.00000 1.00000
20 ssd 0.90999 osd.20 up 1.00000 1.00000
-5 33.65799 host Megaera
6 hdd 5.45799 osd.6 up 1.00000 1.00000
7 hdd 5.45799 osd.7 up 1.00000 1.00000
9 hdd 5.45799 osd.9 up 1.00000 1.00000
11 hdd 5.45799 osd.11 up 1.00000 1.00000
14 hdd 5.45799 osd.14 up 1.00000 1.00000
16 hdd 5.45799 osd.16 up 1.00000 1.00000
19 ssd 0.90999 osd.19 up 1.00000 1.00000
-7 33.65799 host Tisiphone
8 hdd 5.45799 osd.8 up 1.00000 1.00000
10 hdd 5.45799 osd.10 up 1.00000 1.00000
12 hdd 5.45799 osd.12 up 1.00000 1.00000
13 hdd 5.45799 osd.13 up 1.00000 1.00000
15 hdd 5.45799 osd.15 up 1.00000 1.00000
17 hdd 5.45799 osd.17 up 1.00000 1.00000
18 ssd 0.90999 osd.18 up 1.00000 1.00000
I found a blog post from 2017 describing how to configure a CRUSH rule to make this happen, but that post was a little light on how to actually do this.
- Get a copy of the existing crushmap:
ceph osd getcrushmap -o crushmap.orig
- Decompile the crushmap to plaintext to edit:
crushtool -d crushmap.orig -o crushmap.decomp
- Edit the crushmap using a text editor. This is the rule I added:
rule ec-rule {
id 1
type erasure
min_size 6
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 3 type host
step choose indep 2 type osd
step emit
}
Some explanation on this rule config:
min_size
andmax_size
being 6 is how many OSDs we want to split the data over.step take default class hdd
means that CRUSH won’t place any blocks on the SSDs.- The lines
step choose indep 3 type host
andstep choose indep 2 type osd
tell CRUSH to first choose three hosts and then CRUSH to choose two OSDs on each of those hosts.
4. Compile the modified crushmap: crushtool -c crushmap.decomp -o crushmap.new
5. Test the new crushmap: crushtool --test -i crushmap.new --rule 1 --show-mappings --x 1 --num-rep 6
.
In my case, this resulted in CRUSH rule 1 x 1 [9,16,8,13,5,0]
, which shows placement on 6 OSDs, with two per host.
6. Insert the new crushmap into the cluster: ceph osd setcrushmap -i crushmap.new
More information on this can be found on the CRUSH Maps documentation.
With the rule created, next came creating a pool with the rule:
- Create an erasure code profile for the EC pool:
ceph osd erasure-code-profile set ec-profile_m2-k4 m=2 k=4
. This is a profile with k=4 and m=2, so two parity OSDs and 4 data OSDs for a total of 6 OSDs. - Create the pool with the CRUSH rule and EC profile:
ceph osd pool create cephfs-ec-data 128 128 erasure ec-profile_m2-k4 ec-rule
. I chose 128 PGs because it seemd like a reasonable number. - As CephFS requires a non-default configuration option to use EC pools as data storage, run:
ceph osd pool set cephfs-ec-data allow_ec_overwrites true
. - The final step was to create the CephFS filesystem itself:
ceph fs new cephfs-ec cephfs-metadata cephfs-ec-data --force
, with the force being required to use an EC pool for data.
Conclusion
Once that was all done, I finally had a healthy cluster with two pools and a CephFS filesystem:
cluster:
id: 5afc28eb-61b3-45f1-b441-6f913fd70505
health: HEALTH_OK
services:
mon: 3 daemons, quorum alecto,megaera,tisiphone (age 11h)
mgr: alecto(active, since 11h), standbys: megaera, tisiphone
mds: cephfs-ec:1 {0=alecto=up:active} 2 up:standby
osd: 21 osds: 21 up (since 11h), 21 in (since 11h)
data:
pools: 2 pools, 160 pgs
objects: 22 objects, 3.9 KiB
usage: 22 GiB used, 101 TiB / 101 TiB avail
pgs: 160 active+clean
Instead of running synthetic benchmarks, I decided to copy some of my data from the old server into the new cluster. Performance isn’t quite what I was hoping for, I’ll need to dig into why, but I haven’t done any performance tuning. Basic tests were getting about 40-60MB/s write with a mix of file sizes from a few MB to a few dozen GB. I was hoping to max my fileserver’s 1GbE link, but 60MB/s of random writes on spinning rust isn’t bad, especially with only 18 total drives.
If you’re curious about the names I chose for my 3 hosts, Alecto, Megaera and Tisiphone are the names of the three Greek Furies. If I add more hosts, I’m going to be a bit stuck for names, but adding the Greek Fates should get me another 3 nodes.
One final note: when I was trying to mount CephFS I kept getting the error mount error: no mds server is up or the cluster is laggy
which wasn’t terribly helpful. dmesg seemed to suggest I was having authentication issues, but the key was right. Turns out that I also needed to specify the username for it to work.