All your bits are belong to Ceph

A presentation prepared for the Northern Colorado Linux Users’ Group, April 2024.

Ceph is a free (as in freedom and as in beer) software defined storage system. It creates a reliable and scalable storage cluster out of commodity components.

That’s a funny name

Ceph is not an acronym and is correctly capitalized throughout this slide deck. Ceph is short for “cephalopod,” a class of marine animals including octopodes (Octopus is a Greek word and should get a Greek plural. Makes no sense to do a Latin plural for a Greek word. Yes, I know I’m going to lose this one), squid, cuttlefish, and nautilus. These creatures all have tentacles – how this relates to storage software will be explained later in this presentation.

Where did it come from?

Ceph was created by Sage Weil for his doctoral research while at the University of California, Santa Cruz. The earliest code dates from 2004.

After completing his PhD, Sage founded a consultancy called Inktank Storage focused on further developing Ceph as well as consulting services for its users.

Where is it now?

Ceph is the premier free-software large scale storage solution. It is deployed globally. Voluntary telemetry results in 2023 report greater than 1 Exibyte of storage managed in more than 2500 clusters. Current numbers from 2024 show 1.39 Exibytes of storage across more than 3100 deployments.

With more than 3000 clusters reporting home, Ceph is widely used.

Some high profile users

Digital Ocean
Cisco
Bloomberg
Dreamhost
CERN
Deutsche Telekom/T-Mobile

What can it do for me?

It’s data storage. Ceph takes data in and reliably gives you back what was once written.

It’s very scalable. Tens of gigabytes to hundreds of terabytes to multiple petabytes.

It can provide great performance (1+ Tibibytes/sec has been documented recently) if built well. More on that in a later slide.

Several client interface options available.

Why do I want it instead of $MYCURRENTSTORAGE?

Own your storage destiny
- Add or remove capacity as needs change
- One unified system can provide block, S3, and network file system storage services
- Choose hardware from your current vendor of choice
Avoid appliance based storage operating expenses
- Commercial SAN and Filer storage has large ongoing, often ever increasing maintenance and support costs
- And (sometimes) support just isn’t very good at providing solutions
- Free software, support it in house. Or choose from a variety of vendors to help

More reasons to want it

Works well with other Linux and free software solutions.
Multi-site data replications options included.
Support is available from public mailing lists as well as dedicated vendors.

How does it do all that?

Let’s begin our dive into the details. Goggles on!

There are a lot of details. Completely grokking them is not required to use Ceph.

Basic cluster components

A Ceph storage cluster is made of at least one cluster monitor daemon (3 is common in production, occasionally 5), multiple object storage daemons (one is needed for each copy of the stored data), and a few other components.

Ceph monitor

The cluster’s monitors keep track of essential cluster state:

the monitor map
object storage daemon map
manager map
filesystem metadata server map
CRUSH map

Lots of maps, right? But one need not be a deep sea cartographer to understand Ceph.

Monitors are the brains of the cluster

As mentioned, the monitors keep track of where the other cluster components live. They reliably form a concensus for these state data using an implementation of the Paxos part-time parliament protocol. Read the paper (source below), it is very interesting. And very approachable for a computer science paper.

Monitors are just Linux machines

No special 99.9999% reliable hardware is required for any Ceph cluster component, including the monitors. Cluster reliability is accomplished using software algorithms that can deal with failure of individual components.

While flaky hardware or panicking kernels beneath a Ceph monitor may cause some slowness, the integrity of the cluster’s data is ensured.

Object Storage Daemons

Ceph object storage daemons (hereafter “OSDs”) actually hold the cluster’s data. In a Ceph cluster there is a 1:1 correspondence between a data disk and an OSD. That is, the OSD software manages the actual storage of bits on a disk.

In the past, the Ceph objects were stored as files on an OSD disk. Known as “file store,” this was usually done using a mounted XFS filesystem. Current releases of Ceph use an on-disk format called Bluestore. (Blue like the ocean.) Bluestore has many advantages over the older filestore backend, which is considered obsolete.

You mentioned a map?

Yeah, there are a multitude of maps. The OSD map specfically details the correspondence between an OSD and a not-yet-introduced concept called a “placement group.” We’ll get to those later.

Monitors are brains, OSDs store data, what’s a manager?

A Ceph manager keeps track of runtime metrics (storage utilization, performance, system load, and the like) and provides a environment where plugins coded in Python can be run. Failure of all managers will not make data unavailable – they are not part of the data storage/retrieval path. That said, they are important and perform some ongoing cluster maintenance tasks, such as making sure the storage utilization of the OSDs remains balanced across the cluster.

Filesystem metadata servers? What dat?

Not mentioned yet is that Ceph has three primary client access methods: block, S3 object, and a networked filesystem. In the filesystem access method, files are individual Ceph objects that are found by their object ID number. (Does this sound like an inode at all?) The directory hierarchy, mode bits, and file names are all metadata accessed through the cluster’s metadata servers (MDSes). Network location of these MDSes is stored in the monitors’ MDS map.

CRUSH map? Orange drink can? What?

CRUSH is Ceph’s data distribution algorithm. It maps a Ceph data object to specific OSDs in the cluster. CRUSH is an acronym for Controlled Replication Under Scalable Hashing. This is all documented in another academic paper, again listed in the sources slide.

CRUSH is key to Ceph’s scalability and reliability. Discussed more in a future slide.

What else to monitors do?

Keep track of cluster “identities” which are used to authenticate and authorize cluster components and clients.
Hold encryption keys for any OSDs that make use of the Linux kernel’s dmcrypt subsystem (cryptsetup/LUKS)

Lots of details. I just want to store my bits!

Ah, well that is a good point. Let’s explore how to use it a bit. But first a question… How do you want to store and retrieve your data? Ceph has 3 primary access facilities:

RADOS block device/RBD. This essentially looks like a disk to the client system.
Amazon AWS S3 storage. “malloc() for the web”
Ceph Filesystem. An alternative to NFS or SMB where one or more clients can access a shared remote file namespace. CephFS aims to be more POSIX compliant than NFS.

RADOS block device/RBD

It looks like a disk to the client system. That client can be a Linux box, the Qemu system emulator (and KVM if you’re running with that), and even Windows.

Actually, it looks like a disk with some additional features. RBD “images” are thin provisioned, snapshotable, and cloneable. Thin provisioning means a 1Tibyte RBD image only consumes space in the cluster for blocks that have been written to. Reads from unwritten blocks all return zeros. Just like a sparse file in UNIX.

Additional RBD features

Snapshots are RBD image contents from a specific point in time. An RBD image can be rolled back to its snapshotted contents. A snapshot can also be used to clone the contents of an RBD image. This is useful for making N copies of a disk filled with data and presenting it to multiple systems. Fast provisioning of virtual machines can be accomplished using this feature.

Snapshots can also be copied from one cluster to another. Multi geography redundancy and off-site backups are cool.

S3 object gateway/RGW

If you’re a web person in 2024, you’re sure to be familiar with Amazon’s S3. Ceph lets you build your own. No more charges for data ingress or egress. Or for API calls.

The Ceph radosgw or RGW service can serve static web content (images, CSS, ECMAscript libraries, videos, etc) to clients. Basically any web object that does not need server side processing is a good fit. It also makes a nice target for backups.

Similar to RBD, RGW can autonomously copy data from one Ceph cluster to another. For the same reasons: multi-site redundancy has numerous uses.

Ceph FS, a shared network filesystem

This client interface satisfies a need similar to NFS or SMB. Multiple clients needing read/write access to a shared network filesystem. It aims to be more POSIX compliant than NFS.

It integrates similar features to RBD in some regards: snapshots of the filesystem can be created ad-hoc or autonomously. Filesystem contents can be shipped to remote clusters. Additionally, disk usage quotas can be inflicted on clients if desired.

a few more, less common client interfaces

The low level Ceph RADOS object layer is available to client systems. SQLite has a RADOS backend that makes use of this.
There is an iSCSI target available built on top of RBD. Want to back your VMware clusters with free software storage?
Samba has a CephFS protocol backend. Ceph cluster storage can be exported without the double buffering effect of mounting a CephFS and re-exporting that via Samba
A user space NFS server called “NFS Ganesha” exists that can map NFS to CephFS as well.

Lower level details

At the lowest level, Ceph implements RADOS, the Reliable Autonomic Distributed Object Store. RADOS data objects distributed across the cluster in such a way that maximum durability and availabilty are acheived. Data objects are moved around by Ceph itself as needed when storage hardware is added to or removed from the cluster.

The client access methods from the previous slides are all implemented as layers built on top of RADOS. RGW, RBD, and CephFS are all built atop the stable foundation RADOS provides.

OSDs keep my data. What bits go where?

This part is the key to Ceph’s scalability. There is not a single, shared “head” that all the stored bits flow through. Instead, clients talk to the OSD that holds their objects directly. When a client connects to the cluster it retrieves a copy of the current CRUSH and OSD maps. These two items are key to mapping a RADOS object to an individual OSD.

RADOS objects are identified by an opaque integer ID. The CRUSH algorithm maps that opaque integer to a specific placement group. That PG is mapped to a set of OSDs in the OSD map. This is done computationally on the client side, not as a query to a central directory server. Thus, clients figure out which OSD to talk to. The cluster’s aggregate bandwidth scales more-or-less linearly with the number of OSDs it contains.

Mutliple copies for reliablility

The previous slide mentions placement groups. These deserve a bit of explanation now.

And I am out of time for writing slides, so will ad lib this at presentation time.

RADOS does what now?

Reliable
Autonomous
Distributed
Object
Store

Things are not all perfect

Lots of components involved. It is a complex beast.

The documentation can always be improved.

Bugs do sometimes get past the regression test suite. Be conservative deploying new versions.

The command line interface options are not always consistent. This can make learning a challenge.

It is not always easy to deploy. Every upstream release adds features to improve this aspect, but it can still be a challenge.

Other options to explore?

Ceph’s complexity may make it not a good fit for your needs. There are other free software options that implement similar client interfaces:

ZFS is Sun/Oracle’s Zettabyte File System. It is local only, but provides very robust data integrity and availabity features to a single server with directly attached disks.
MinIO is “high-performance, S3 compatible object store” available under the AGPL and commercial licenses.
LINSTOR from Linbit, the people behind DRBD. This is a block-only clustered storage solution.

Presenter information

Linux user since 1992 (yep, I’m old)

Founding member of NCLUG (Well, I was at the very first meeting in the Ramskeller at CSU. And have been a mostly regular attendee since then.)

Professional sysadmin sort of person. Storage, networks, operating systems, security…

Built first Ceph cluster at home in 2018

Spent 5.5 years at GoDaddy doing storage, mostly Ceph

Currently employed by Koor Technologies – providing Ceph expertise for customers with interesting data storage needs

References

https://en.wikipedia.org/wiki/Ceph_(software)
https://ceph.io/en/news/blog/2023/telemetry-celebrate-1-exabyte/
https://telemetry-public.ceph.com/d/ZFYuv1qWz/telemetry?orgId=1
https://ubuntu.com/ceph/managed
https://docs.ceph.com/en/latest/start/intro/

More references

https://en.wikipedia.org/wiki/Paxos_(computer_science)
https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
https://ceph.io/assets/pdfs/weil-rados-pdsw07.pdf

This presentation was made with

vim
Markdown
Pandoc
S5

Served to you by

Allo Communications
Comcast/Xfinity
Hurricane Electric (if you’re hitting this over IPv6 transit)
Debian GNU/Linux
haproxy
Apache HTTPD