A presentation prepared for the Northern Colorado Linux Users’ Group, April 2024.
Ceph is a free (as in freedom and as in beer) software defined storage system. It creates a reliable and scalable storage cluster out of commodity components.
Ceph is not an acronym and is correctly capitalized throughout this slide deck. Ceph is short for “cephalopod,” a class of marine animals including octopodes (Octopus is a Greek word and should get a Greek plural. Makes no sense to do a Latin plural for a Greek word. Yes, I know I’m going to lose this one), squid, cuttlefish, and nautilus. These creatures all have tentacles – how this relates to storage software will be explained later in this presentation.
Ceph was created by Sage Weil for his doctoral research while at the University of California, Santa Cruz. The earliest code dates from 2004.
After completing his PhD, Sage founded a consultancy called Inktank Storage focused on further developing Ceph as well as consulting services for its users.
Ceph is the premier free-software large scale storage solution. It is deployed globally. Voluntary telemetry results in 2023 report greater than 1 Exibyte of storage managed in more than 2500 clusters. Current numbers from 2024 show 1.39 Exibytes of storage across more than 3100 deployments.
With more than 3000 clusters reporting home, Ceph is widely used.
It’s data storage. Ceph takes data in and reliably gives you back what was once written.
It’s very scalable. Tens of gigabytes to hundreds of terabytes to multiple petabytes.
It can provide great performance (1+ Tibibytes/sec has been documented recently) if built well. More on that in a later slide.
Several client interface options available.
Let’s begin our dive into the details. Goggles on!
There are a lot of details. Completely grokking them is not required to use Ceph.
A Ceph storage cluster is made of at least one cluster monitor daemon (3 is common in production, occasionally 5), multiple object storage daemons (one is needed for each copy of the stored data), and a few other components.
The cluster’s monitors keep track of essential cluster state:
Lots of maps, right? But one need not be a deep sea cartographer to understand Ceph.
As mentioned, the monitors keep track of where the other cluster components live. They reliably form a concensus for these state data using an implementation of the Paxos part-time parliament protocol. Read the paper (source below), it is very interesting. And very approachable for a computer science paper.
No special 99.9999% reliable hardware is required for any Ceph cluster component, including the monitors. Cluster reliability is accomplished using software algorithms that can deal with failure of individual components.
While flaky hardware or panicking kernels beneath a Ceph monitor may cause some slowness, the integrity of the cluster’s data is ensured.
Ceph object storage daemons (hereafter “OSDs”) actually hold the cluster’s data. In a Ceph cluster there is a 1:1 correspondence between a data disk and an OSD. That is, the OSD software manages the actual storage of bits on a disk.
In the past, the Ceph objects were stored as files on an OSD disk. Known as “file store,” this was usually done using a mounted XFS filesystem. Current releases of Ceph use an on-disk format called Bluestore. (Blue like the ocean.) Bluestore has many advantages over the older filestore backend, which is considered obsolete.
Yeah, there are a multitude of maps. The OSD map specfically details the correspondence between an OSD and a not-yet-introduced concept called a “placement group.” We’ll get to those later.
A Ceph manager keeps track of runtime metrics (storage utilization, performance, system load, and the like) and provides a environment where plugins coded in Python can be run. Failure of all managers will not make data unavailable – they are not part of the data storage/retrieval path. That said, they are important and perform some ongoing cluster maintenance tasks, such as making sure the storage utilization of the OSDs remains balanced across the cluster.
Not mentioned yet is that Ceph has three primary client access methods: block, S3 object, and a networked filesystem. In the filesystem access method, files are individual Ceph objects that are found by their object ID number. (Does this sound like an inode at all?) The directory hierarchy, mode bits, and file names are all metadata accessed through the cluster’s metadata servers (MDSes). Network location of these MDSes is stored in the monitors’ MDS map.
CRUSH is Ceph’s data distribution algorithm. It maps a Ceph data object to specific OSDs in the cluster. CRUSH is an acronym for Controlled Replication Under Scalable Hashing. This is all documented in another academic paper, again listed in the sources slide.
CRUSH is key to Ceph’s scalability and reliability. Discussed more in a future slide.
Ah, well that is a good point. Let’s explore how to use it a bit. But first a question… How do you want to store and retrieve your data? Ceph has 3 primary access facilities:
It looks like a disk to the client system. That client can be a Linux box, the Qemu system emulator (and KVM if you’re running with that), and even Windows.
Actually, it looks like a disk with some additional features. RBD “images” are thin provisioned, snapshotable, and cloneable. Thin provisioning means a 1Tibyte RBD image only consumes space in the cluster for blocks that have been written to. Reads from unwritten blocks all return zeros. Just like a sparse file in UNIX.
Snapshots are RBD image contents from a specific point in time. An RBD image can be rolled back to its snapshotted contents. A snapshot can also be used to clone the contents of an RBD image. This is useful for making N copies of a disk filled with data and presenting it to multiple systems. Fast provisioning of virtual machines can be accomplished using this feature.
Snapshots can also be copied from one cluster to another. Multi geography redundancy and off-site backups are cool.
If you’re a web person in 2024, you’re sure to be familiar with Amazon’s S3. Ceph lets you build your own. No more charges for data ingress or egress. Or for API calls.
The Ceph radosgw or RGW service can serve static web content (images, CSS, ECMAscript libraries, videos, etc) to clients. Basically any web object that does not need server side processing is a good fit. It also makes a nice target for backups.
Similar to RBD, RGW can autonomously copy data from one Ceph cluster to another. For the same reasons: multi-site redundancy has numerous uses.
This client interface satisfies a need similar to NFS or SMB. Multiple clients needing read/write access to a shared network filesystem. It aims to be more POSIX compliant than NFS.
It integrates similar features to RBD in some regards: snapshots of the filesystem can be created ad-hoc or autonomously. Filesystem contents can be shipped to remote clusters. Additionally, disk usage quotas can be inflicted on clients if desired.
At the lowest level, Ceph implements RADOS, the Reliable Autonomic Distributed Object Store. RADOS data objects distributed across the cluster in such a way that maximum durability and availabilty are acheived. Data objects are moved around by Ceph itself as needed when storage hardware is added to or removed from the cluster.
The client access methods from the previous slides are all implemented as layers built on top of RADOS. RGW, RBD, and CephFS are all built atop the stable foundation RADOS provides.
This part is the key to Ceph’s scalability. There is not a single, shared “head” that all the stored bits flow through. Instead, clients talk to the OSD that holds their objects directly. When a client connects to the cluster it retrieves a copy of the current CRUSH and OSD maps. These two items are key to mapping a RADOS object to an individual OSD.
RADOS objects are identified by an opaque integer ID. The CRUSH algorithm maps that opaque integer to a specific placement group. That PG is mapped to a set of OSDs in the OSD map. This is done computationally on the client side, not as a query to a central directory server. Thus, clients figure out which OSD to talk to. The cluster’s aggregate bandwidth scales more-or-less linearly with the number of OSDs it contains.
The previous slide mentions placement groups. These deserve a bit of explanation now.
And I am out of time for writing slides, so will ad lib this at presentation time.
Lots of components involved. It is a complex beast.
The documentation can always be improved.
Bugs do sometimes get past the regression test suite. Be conservative deploying new versions.
The command line interface options are not always consistent. This can make learning a challenge.
It is not always easy to deploy. Every upstream release adds features to improve this aspect, but it can still be a challenge.
Ceph’s complexity may make it not a good fit for your needs. There are other free software options that implement similar client interfaces:
Linux user since 1992 (yep, I’m old)
Founding member of NCLUG (Well, I was at the very first meeting in the Ramskeller at CSU. And have been a mostly regular attendee since then.)
Professional sysadmin sort of person. Storage, networks, operating systems, security…
Built first Ceph cluster at home in 2018
Spent 5.5 years at GoDaddy doing storage, mostly Ceph
Currently employed by Koor Technologies – providing Ceph expertise for customers with interesting data storage needs
Allo Communications
Comcast/Xfinity
Hurricane Electric (if you’re hitting this over IPv6 transit)
Debian GNU/Linux
haproxy
Apache HTTPD