Due to rising worldwide data usage, more and more companies around the world are moving away from dedicated data servers and instead opting for more holistic solutions in the form of centrally stored data networks. This structure is carried out in the form of storage area networks, or SANs. SAN storage users profit from quick data access and comprehensive hardware redundancy.
Distributed file systems are a solution for storing and managing data that no longer fit onto a typical server. Lack of capacity can be due to more factors than just data volume. For example, if the data to be stored is unstructured, then a classic file system with a file structure will not do.
Saving large volumes of data – GlusterFS and Ceph make it possible
With bulk data, the actual volume of data is unknown at the beginning of a project. As such, systems must be easily expandable onto additional servers that are seamlessly integrated into an existing storage system while operating. For a user, so-called “distributed file systems” look like a single file in a conventional file system, and they are unaware that individual data or even a large part of the overall data might actually be found on several servers that are sometimes in different geographical locations. Since GlusterFS and Ceph are already part of the software layers on Linux operating systems, they do not place any special demands on the hardware. Linux runs on every standard server and supports all common types of hard drives.
High availability is decisive
High availability is an important topic when it comes to distributed file systems. Hardware malfunctions must be avoided as much as possible, and any software that is required for operation must also be able to continue running uninterrupted even while new components are being added to it. Maintenance work must be able to be performed while the system is operating, and all-important metadata should not be saved in a single central location. Access to metadata must be decentralized, and data redundancy must be a factor at all times. A server malfunction should never negatively impact the consistency of the entire system. GlusterFS and Ceph are two systems with different approaches that can be expanded to almost any size, which can be used to compile and search for data from big projects in one system.
The term “big data” is used in relation to very large, complex, and unstructured bulk data that is collected from scientific sensors (for example, GPS satellites), weather networks, or statistical sources. In addition to storage, efficient search options and the systematization of the data also play a vital role with big data.
A short introduction to GlusterFS
GlusterFS is a distributed file system with a modular design. Various servers are connected to one another using a TCP/IP network. As a POSIX (Portable Operating System Interface)-compatible file system, GlusterFS can easily be integrated into existing Linux server environments. This is also the case for FreeBSD, OpenSolaris, and macOS, which support POSIX. Integration into Windows environments can only be achieved in the roundabout way of using a Linux server as a gateway.
Functionalities of GlusterFS
During its beginnings, GlusterFS was a classic file-based storage system that later became object-oriented, at which point particular importance was placed on optimal integrability into the well-known open-source cloud solution OpenStack. GlusterFS still operates in the background on a file basis, meaning that each file is assigned an object that is integrated into the file system through a hard link. There are no dedicated servers for the user, since they have their own interfaces at their disposal for saving their data on GlusterFS, which appears to them as a complete system.
|Easy integration into Linux systems||Integration into Windows systems can only be done indirectly|
|Supports FUSE (File System in User Space)|
Short introduction to Ceph
The distributed open-source storage solution Ceph is an object-oriented storage system that operates using binary objects, thereby eliminating the rigid block structure of classic data carriers. Physically, Ceph also uses hard drives, but it has its own algorithm for regulating the management of the binary objects, which can then be distributed among several servers and later reassembled.
Functionalities of Ceph
Every component is decentralized, and all OSDs (Object-Based Storage Devices) are equal to one another. As such, any number of servers with different hard drives can be connected to create a single storage system. Ceph can be integrated several ways into existing system environments using three major interfaces: CephFS as a Linux file system driver, RADOS Block Devices (RBD) as Linux devices that can be integrated directly, and RADOS Gateway, which is compatible with Swift and Amazon S3.
|Easy integration into all systems, irrespective of the operating system being used||Weaker file system functions|
|Block device for Linux||Higher integration effort needed due to completely new storage structures|
|CephFS file system for Linux|
|Amazon S3 API|
|Seamless connection to Keystone authentication|
|FUSE module (File System in User Space) to support systems without a CephFS client|
Comparison: GlusterFS vs. Ceph
Due to the technical differences between GlusterFS and Ceph, there is no clear winner. Ceph is basically an object-oriented memory for unstructured data, whereas GlusterFS uses hierarchies of file system trees in block storage. GlusterFS has its origins in a highly-efficient, file-based storage system that continues to be developed in a more object-oriented direction. In contrast, Ceph was developed as binary object storage from the start and not as a classic file system, which can lead to weaker, standard file system operations.
|File system strengths||Object storage strengths|
|Quicker storage algorithm||Better performance on simpler hardware|
|No central metadata server necessary||Easy integration into all systems, no matter the operating system being used|
|Lower complexity||Block device for Linux|
|Better suitability for saving larger files (starting at around 4 MB per file)||Easier possibilities to create customer-specific modifications|
|Better suitability for data with sequential access||RADOS compatibility|
When should which system be used?
Because of its diverse APIs, Ceph works well in heterogeneous networks, in which other operating systems are used alongside Linux. But the strengths of GlusterFS come to the forefront when dealing with the storage of a large quantity of classic and also larger files. Since Ceph was developed as an open-source solution from the very start, it was easier to integrate into many locations earlier than GlusterFS, which only later became open-source. A major application for distributed memories is cloud solutions. In this regard, OpenStack is one of the most important software projects offering architectures for cloud computing. GlusterFS and Ceph both work equally well with OpenStack.