This blog post is the second in a series concerning Ceph.
Creating data redundancy
One of the main concerns when dealing with large sets of data is data durability. We do not want a cluster in which a simple disk failure will introduce a loss in data. What Ceph aims for instead is fast recovery from any type of failure occurring on a specific failure domain.
Ceph is able to ensure data durability by using either replication or erasure coding.
Replication
For those of you who are familiar with RAID, you can think of Ceph's replication as RAID 1 but with subtle differences.
The data is replicated onto a number of different OSDs, nodes, or racks depending on your cluster configuration. The original data and the replicas are split into many small chunks and evenly distributed across your cluster using the CRUSH-algorithm. If you have chosen to have three replicas on a 6-node cluster, these three replicas will be spread out onto all six nodes, not just three nodes containing the full replicas.
It is important to choose the right level of data replication. If you are running a single-node cluster, replication on the node level would be impossible and your cluster would lose data in the event of a single OSD failure. In this case, you would choose to replicate data across the OSDs you have available on the node.
On a multi-node cluster, your replication factor decides how many OSDs or nodes you can afford to lose in case of disk or node failure, without data loss. Of course, the replication of data introduces the problem of lowering your total amount of space available in your cluster. If you choose a replication factor of 3 on the node level, you will only have 1/3 of your total storage available in your cluster for you to use.
Replication in Ceph is fast and only limited by the read/write operations of the OSDs. However, some people are not content with "only" being able to use a small amount of their total space. Therefore, Ceph also introduced erasure coding.
Erasure Coding
Erasure coding encodes your original data in a way so that when you need to retrieve the data again, you only need a subset of the data to recreate the original information. It splits objects into k data fragments and then computes m parity fragments. I will provide an example.
Let us say that the value of our data is 52. We could split it into:
x = 5
y = 2
The encoding process will then compute a number of parity fragments. In this example, these will be equations:
x + y = 7
x - y = 3
2x + y = 12
Here, we have a k = 2 and m = 3. k is the number of data fragments and m is the number of parity fragments. In case of a disk or node failure and the data needs to be recovered, out of the 5 elements we will be storing (the two data fragments and the three parity fragments) we only require two of these five to recover. This is what ensures data durability when using erasure coding.
Now, why does this matter? It matters because these parity fragments take up significantly less space when compared to replicating the data. Here is a table that shows how much overhead there is on different erasure coding schemes. The overhead is calculated with m / k.
Erasure coding scheme (k+m) | Minimum number of nodes | Storage overhead |
---|---|---|
4+2 | 6 | 50% |
6+2 | 8 | 33% |
8+2 | 10 | 25% |
6+3 | 9 | 50% |
As we can see in the table, you can use the (8+2) scheme to make sure you can lose two of your nodes without losing any data, and this with only a 25% storage overhead.
If you look at this from a storage space optimization standpoint, this is a much better use of the storage. However, it is not without certain downsides. The parity fragments take time for the cluster to calculate and read/write operations are therefore slower than with replication. Therefore, erasure coding is usually recommended on clusters that deal with large amounts of cold data.
Using Ceph
A natural part of deployments on Kubernetes is to create persistent volume claims (PVCs). PVCs can claim a volume and use that as storage for data in the pod. In order to create a PVC you first need to define a StorageClass in Kubernetes.
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
spec:
failureDomain: host
replicated:
size: 3
requireSafeReplicaSize: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph # namespace:cluster
pool: replicapool
imageFormat: "2"
imageFeatures: layering
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Delete
In this StorageClass file, you can see that we first create a replica pool that creates 3 replicas in total and uses host
as the failure domain. After that, we define whether or not we should allow volume expansion after a volume is created and what the reclaim policy should be. Reclaim policy determines whether the data that is stored in the volume should be deleted or retained when a pod ceases to exist. In this case, I have chosen delete.
# kubectl get storageclass -n rook-ceph
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
rook-ceph-block rook-ceph.rbd.csi.ceph.com Delete Immediate true 10m
Now that the StorageClass has been created, we can now create a PVC:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: rook-ceph-block
This creates a PVC that is now running on our Kubernetes cluster:
# kubectl get pvc -n rook-ceph
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
rbd-pvc Bound pvc-56c45f01-562f-4222-8199-43abb856ca94 1Gi RWO rook-ceph-block 37s
We will now deploy a pod that uses this PVC:
---
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: pvc
readOnly: false
After deploying this pod, you can see it in the pod list:
# kubectl get pods -n rook-ceph
NAME READY STATUS RESTARTS AGE
demo-pod 1/1 Running 0 118s
That is how you deploy pods that create persistent volume claims on your Ceph cluster!
Top comments (1)
Thank you very much for sharing.
I will also add that beyond durability, the erasure coding scheme also impact performance !
Imagine if you have 100 disks and an erasure set size = 100 a single write would make all 100 disks to spin ! Very bad idea, because in general you have a lot of concurrent access.
Behind that is that Write-IOPS are diluted in the erasure set size as can be understand by using for example this calculator :
docs.clyso.com/tools/erasure-codin...
I would also appreciate to find some fault tolerance tests on Ceph. In fact I came to ceph because I had to quit Minio after issues with fault tolerance as described here : dev.to/julienlau/minio-a-critical-...