Ceph Cluster Updates

The Ceph cluster is continuing to run fine and generally meets all of my needs, though I could definitely use more IOPs, especially for RBD volumes attached to VMs. I have encountered a couple operational issues that may crop up in a home environment outside a datacenter with redundant internet connectivity and power. If you’d like to read more on the cluster hardware choice, setup and testing, you can here.

Timekeeping

I had a nearly 3-day Internet outage caused by a Telus’ general sloppiness. While I was offline, my cluster wasn’t able to keep time using NTP. While I had the cluster computers pulling NTP from a local NTP server (running on my PFSense router), all of its sources of time were on the Internet, except that derived from its onboard clock. At some point, NTP on the Ceph servers decided that the upstream was bad, and so dropped it. Over a day the clocks on the servers drifted enough that Ceph started having serious issues. I ended up needing to shut down the cluster, and with it, all the VMs using it for storage, because it had degraded to the point of being unusable due to how Ceph handles clock skew.
Luckily, I had a USB based GPS unit handy, and I was able to set myself up a backup stratum 1 NTP server using the GPS and a Raspberry Pi. While USB GPS units generally have significantly more jitter than serial ones due to the vagaries of USB’s timing, it was much lower than the clocks on the servers, and this satisfied Ceph.With higher quality timekeeping my cluster was back to being healthy and able to serve writes.
I ended up using GPSD to access the GPS and Chrony to handle being the NTP server. I found this guide and this one useful to get things working.

Lessons learned: Time synchronization is important for a healthy Ceph Cluster. Have a local backup time source. While Ceph can be tuned to be more tolerant to poor time sync, there’s no replacement for stable time infrastructure.

Power

A month or so after the Internet outage, I had a power outage that lasted 1.5 hours. My backup power for the cluster has about 75 minutes of runtime, so the cluster was shut down. I powered down the VMs and then set the following flags on the cluster (source):

ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover

To bring the cluster back, I started the three hosts back up and unset the flags. However, when everything came back up, I realized that I hadn’t waited enough time before powering down the hosts, and some of the VM’s RBD disks hadn’t finished syncing. Cue several hours of running xfs_repair on a bunch of RBD devices. In the end, I didn’t lose any data, just had lightly corrupt filesystems.

Lessons learned: Safe shutdown needs to be handled automatically, with attention taken to how the cluster is being used. I wrote a safe shutdown script to use with Proxmox and apcupsd that shuts down the VMs, calls sync twice and then waits a (hopefully very conservative) 10 minutes before powering off the hosts, to ensure that all data has been properly written to disks. As my hosts are not set to power back on when power comes back, I’m manually un-setting the flags when the system come up.

Memory

I also learned that 64GB of memory was not enough to run Ceph and VMs on the same hosts, as Ceph could easily use 40+GB of RAM on its own, leading to RAM exhaustion of the hosts. I upgraded to 128GB of RAM by adding another of the same kit to each host, which is the maximum amount the board and CPU supports.

I’m still having issues with RAM exhaustion under heavy load of VMs and CephFS, so it seems like 265GB of RAM may actually be a good idea, but that’s out of my budget. I was seeing Ceph using 80+GB of RAM under heavy CephFS metadata workloads, which was a backup operation determining if files had changed since the last backup. My general fix has been to better spread VMs across hosts and to limit how many metadata operations I’m putting through CephFS at once.

Other Tweaks

Sometime I was getting Ceph into HEALTH_WARN with showing OSDs with slow ops, and dmesg was full of errors like nvme nvme0: I/O 42 QID 47 timeout, completion polled. This appears to be due to a firmware bug in the Intel P4510 SSDs I’m using, that has since been fixed. So I did some baseline tests and then updated the firmware on the hosts and rebooted.

  • Download the Intel Memory and Storage Tool (MAS), CLI version for Linux from Intel’s website. Strangely, Proxmox doesn’t have this packaged (licensing issues).
  • Unzip it.
  • Install the debian package using apt install ./intelmas_<version>_amd64.deb.
  • Find the SSD using MAS: intelmas show -intelssd
  • Update the firmware: intelmas load -intelssd <SSD #>.
  • Reboot

This update tripled the I/O throughput of the nvme-only pool residing on the SSDs, both in IOPs and raw bitrate. Maximum latency, as reported by sysbench went from 30 seconds to 90 milli-seconds. And, no more errors in dmesg or slow ops on my OSDs. Incidentally, this performance increase was triggering the OOM conditions mentioned in the previous section, due to the increased performance needing more RAM.

Conclusion

I’m still pretty happy with Ceph at home, but there are certainly some considerations that crop up with home use that don’t in a datacenter, though they should probably also be planned for even with redundant power and connectivity, in the event of a true disaster.

Below is my current cluster usage. I’ve got a 37TiB CephFS filesystem running on it, with the remainder being used for RBD images for VMs and the like.

  cluster:
    id:     5afc28eb-61b3-45f1-b441-6f913fd70505
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum megaera,tisiphone,alecto (age 5w)
    mgr: alecto(active, since 5w), standbys: megaera, tisiphone
    mds: cephfs-ec:1 {0=tisiphone=up:active} 2 up:standby
    osd: 21 osds: 21 up (since 5w), 21 in (since 3M)

  data:
    pools:   4 pools, 320 pgs
    objects: 23.78M objects, 35 TiB
    usage:   57 TiB used, 44 TiB / 101 TiB avail
    pgs:     320 active+clean

2 thoughts on “Ceph Cluster Updates

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s