Are you looking to execute complicated computing processes with large amounts of data? This is exactly what the Big Data framework, Hadoop, focuses on. The open source Apache software offers a Java based frame with which various Big Data applications on computer clusters can be parallelized. Hadoop’s modular expandability, which offers users an enormous range of functions, is a particularly...Apache Hadoop: distributed storage architecture for data quantities
GlusterFS is a distributed, arbitrarily scalable file system that aggregates storage components from several servers into one, uniform file system. Files systems work in the background. Hardly anyone thinks about them after they’ve been installed. That most often changes, though, when data is lost or the file system has reached its limits, either because the maximum size of a partition has been reached or due to limitations on the depths of storage path segments, for example.
Who and what is behind GlusterFS?
The name “Gluster” is a combination of “GNU” (itself an acronym for “GNU’s not Unix!”) and “cluster.” The system was published with a GNU-General Public License (GNU-GPLS), making it free of charge to use. The term “cluster”, in relation to data carriers, is used to describe a combination of physical storage units. In relation to computers, it is used to indicate a connected network of several systems. GlusterFS merges these concepts by combining storage space from computers connected over a network and using it as a single logical entity.
The project was published in 2005 by Gluster Inc. In 2011, the Linux Distributor RedHat took over the company and has since continued to develop the file system. Version 7 of GlusterFS made its debut in January 2020, and has been precompiled for the following Linux distributions:
The limitation to a Unix-based system is the memory’s integration into the FUSE module, which has yet to be made adequately stable for Windows systems.
FUSE is an acronym for Filesystem in Userspace. Operating systems are usually subdivided into user and kernel modes. The latter is particularly well-secured; for example, it can only be accessed by someone with administrator rights. As such, mounting and managing drives can normally only be done by a network administrator. However, FUSE allows other users to manage the file system.
Computers can function both as servers and clients. Access to the file system is also possible from other systems that are supported, such as NFS (Network File System) and SMB/CIFS (Server Message Block/Common Internet File System).
A distributed file system only really makes sense when several computers are connected to each other. The documentation published by GlusterFS states that at least three servers are required. However, the term “server” in this sense should not be taken literally. Virtually any kind of physical or emulated hardware can be integrated. Besides normal computers, the use of virtual machines is also feasible. This also comes with many benefits, especially with regard to flexibility.
Integrated servers act as nodes, which are connected to each other through the TCP/IP network. The integrated devices create a so-called trusted storage pool, whose memory is provided in the form of bricks. Volumes are then built from these bricks. These can subsequently be integrated and used like normal data carriers. Computers with access are identified as clients, but it is possible for one PC to be both a server and a client.
A special feature is the software’s tremendous scalability. Any number of nodes and bricks can then be added later on, and the size of the storage space can be adjusted according to any new requirements. The storage space to be managed has a maximum size of several petabytes.
In addition, GlusterFS guarantees reliability through redundancy. The risk of malfunction is initially distributed among several systems that can also be spatially separated from one another. It is also possible to set up RAID networks. However, in contrast to the standard specified distributed volume, a replicated volume must be stored in this case. As such, each file will be saved twice, which is called RAID mirroring.
Redundant Array of Independent Disks (RAID) is a network of physically independent hard drives, from which one unified drive is created. The focus can be centered around speed or data security, depending on your objective. Storage space is correspondingly reduced through the repeated saving of data or the storage of additional information needed for restoring a file.
For operations done within the storage space, GlusterFS offers ten predefined translators, which translate commands that are given by users to be executed. Two examples are the “storage” translator, which stores data on the local file system and controls access to it, and the “encryption” translator.
A new function is geo-replication, which can be used to execute an asynchronous distribution of data among servers in different locations. This provides additional protection from external, physical impacts on the servers, such as in the event of fire or theft. In this case, one computer acts as the master and another as the slave. Data transfer is secured by SSH (Secure Shell).
Pros and cons of GlusterFS
We’ve compiled a few pros and cons of a distributed file system in comparison to conventional network memory in the table below:
|Pros of Gluster||Cons of Gluster|
|Good utilization of existing capacities||Creation of a complex network structure|
|Increased reliability||Increased administrative effort during set-up|
|Network load distribution||Quick network infrastructure is needed|
|Very good scalability||Additional effort required for technical security|
Applications of GlusterFS
GlusterFS basically creates a classic cloud. Storage space within a network will then be made available to connected clients. This is particularly suitable for large networks that already have sufficient resources available for the creation of a grouped network.
Since devices are connected through the Internet protocol, the use of a distributed file system is especially suitable for company structures that include several branch offices. However, dedicated network memory can also be saved in locally-restricted networks this way, without even needing to forego redundancy.
Would you like to work with GlusterFS yourself? IONOS has written a comprehensive GlusterFS how-to article for installing and setting up the file system.
One notable alternative to GlusterFS is Ceph, which is freely available and also offers many of the aforementioned benefits of distributed file systems. Ceph and Gluster each have their own differing pros and cons.
BeeGFS (formerly FhGFS) was developed by the Fraunhofer Society in Germany specifically for powerful computer systems. It is available free of charge and focuses on easy usability.
In the commercial sector, there are additional systems such as Storage Spaces Direct (S2D) by Microsoft. However, the use of this system is limited to fee-based, licensed Windows servers.