Memory is one of the most important components of a computer system and is particularly important for processing large amounts of data. In addition to increasing the storage capacity, it is also necessary to guarantee the security of the data. Bit errors, for example, are one of the key foes that ECC RAM is battling against to protect the main memory. To date these error-correcting memory modules...
Ceph is a comprehensive storage solution that uses its very own Ceph file system (CephFS). Ceph offers the possibility to file various components within a distributed network. In addition, the data can be physically secured in various storage areas. Ceph guarantees a wide variety of storage devices from which to choose, alongside high scalability.
What you should know about Ceph and its most important features
Ceph was conceived by Sage A. Weil, who developed it while writing his dissertation and published it in 2006. He then led the project with his company Intank Storage. In 2014, the company was acquired by RedHat, with Weil staying on as the chief architect, in charge of the software’s development.
Ceph only works on Linux systems, for example CentOS, Debian, Fedora, RedHat/RHEL, OpenSUSE, and Ubuntu. Accessing the software through Windows systems cannot be done directly, but is possible through the use of iSCSI (Internet Small Computer System Interface). As such, Ceph is particularly suitable for use in data centers that make their storage space available over servers, and for cloud solutions of any kind that use software to provide storage.
We have complied a list of the most important features of Ceph:
- High scalability
- Data security through redundant storage
- Absolute reliability through distributed data storage
- Software-based increase in availability through an integrated algorithm for locating data
- Continuous memory allocation
- Minimal hardware requirements (set-up possible with 1 GB RAM on a computer with a single-core processor and only a few GB of available storage space, depending on the task in the network)
Ceph requires several computers that are connected to one another in what is called a cluster. Each connected computer within that network is referred to as a node.
The following tasks must be distributed among the nodes within the network:
- Monitor nodes: Monitor the status of individual nodes in the cluster, especially the managers, object storage devices, and metadata servers (MDS). In order to ensure maximum reliability, at least three monitor nodes are recommended.
- Managers: Manage the status of storage usage, system load, and the capacity of the nodes.
- Ceph-OSDs (Object Storage Devices): The background applications for the actual data management; they are responsible for the storage, duplication, and restoration of data. At least three OSDs are recommended for a cluster.
- Metadata servers (MDSs): Store metadata, including storage paths, file names, and time stamps of files stored in the CephFS for performance reasons. They are POSIX-compatible, and can be queried using Unix command lines such as is, find, and like.
The centerpiece of the data storage is an algorithm called CRUSH (Controlled Replication Under Scalable Hashing). It uses an allocation table called the CRUSH Map to find an OSD with the requested file.
Ceph pseudo-randomly distributes files, meaning that they appear to be filed indiscriminately. However, CRUSH actually chooses the most-suitable storage location based on fixed criteria, after which the files are duplicated and then saved on physically separate media. The administrator of the network can set the relevant criteria.
Files are organized into placement groups. File names are processed as hash values. Another organizational property is the quantity of file duplicates.
Hash values are strings of numeric values that are returned following the processing of input by certain computing operations. An easier approach would be to generate the checksum from the raw data. However, highly-complex algorithms do come into play that create unique digital fingerprints out of data of any length. The output always has the same compact length and does not contain any unwanted symbols, making it suitable for the processing of file names, as well.
In order to guarantee data security, journaling is used on the OSD level. Every file to be saved is stored temporarily until it has been properly saved on the intended OSD.
Accessing stored data
The base of the Ceph data storage architecture is called RADOS, a reliable, distributed object store comprised of self-healing, self-mapping, intelligent storage nodes.
There are several ways to access stored data:
- librados: Native access is possible by using the librados software libraries through APIs in programming and scripting languages, such as C/C++, Python, Java, and PHP.
- radosgw: Data can either be read or written by means of the HTTP Internet protocol in this gateway.
- CephFS: This is the POSIX-compatible, inherent file system. It offers a kernel module for computers with access, and also supports FUSE (a file system creation interface that does not require administrator rights).
- RADOS Block Device: Integration using block storage through a kernel module or a virtual system like QEMU or KVM.
Alternatives to Ceph
The most popular alternative is GlusterFS, which also belongs to the Linux distributor RedHat/RHEL and can also be used at no cost. Gluster follows a similar approach for aggregating distributed memory into a unified storage location within the network. Both solutions, GlusterFS vs Ceph, have their own pros and cons.
There are other free alternatives, such as XtremFS and BeeGfs. Microsoft offers commercial, software-based storage solutions for Windows servers, including Storage Spaces Direct (S2D).
Pros and cons of Ceph
Ceph is the best choice in many situations, but this method for storing data also comes with some disadvantages.
Pros of Ceph
Ceph is free and is also an established method, despite its comparably young development history. You can find a large amount of helpful information online regarding its set-up and maintenance. In addition, the application has been extensively documented by the manufacturer. The take-over by RedHat is enough to ensure that it will continue to be developed for the near future. Its scalability and integrated redundancy ensure data security and flexibility within the network. On top of that, availability is also guaranteed by the CRUSH algorithm.
Redundancy in this sense means “surplus.” In computer technology, it is used to designate additional, surplus data. In this case, data redundancy is often deliberately executed in order to ensure data security and reliability. This is possible on both the soft- and hardware levels: On the one hand, data or information that is required for data reconstruction can be stored several times in the memory; on the other hand, physically separate storage components can be made available in several places, in order to compensate for any malfunctions from individual computers.
Cons of Ceph
Due to the variety of components provided, a comprehensive network is required, in order to be able to fully use all of Ceph’s functionalities. In addition, the set-up is relatively time consuming, and the user cannot be entirely sure where the data is physically being stored.