Less redundancy thanks to database normalization
Normalization is one of the basic concepts of relational data modelling. In the relational database model, good database design is characterized by minimal redundancy. The reason for this is that redundant data leads to semantic anomalies, which, in turn, make automatic data processing and database maintenance difficult. Normalization is a strategy to eliminate redundancies in relational databases. We'll show you how to implement database normal forms.
- What is database normalization? A definition
- Database normalization: examples of how to reconfigure a database
- Other normal forms
- Advantages and disadvantages of normalization
What is database normalization? A definition
Normalization is an approach to database design used in relational databases to avoid redundancy.
The relational database model is the most widely used concept in computerized data management. In relational databases, information is stored as records in tables related by keys. A data record consists of several value ranges that are assigned to specific attributes using table columns.
The following table shows the stored invoice data of a fictitious office equipment supplier. John Public has ordered 10 monitors, 12 mouse pads, and 1 office chair for his company. The order from Jane Doe includes 2 laptops and 2 headsets.
In the online shop database, the invoice data is assigned to the attributed invoice number (“Inv. no.”), date, customer, customer number (“Cust. no.”), address, invoice item number (“Inv. item no.”), product, product number (“Prod. no.”), quantity (“No.”), and price. Each row on the table represents one data record. This kind of data set is called a tuple.
The database section shown above is an example of poor database design. At first glance, it’s obvious that the table shows numerous redundancies. In addition, the values in the customer and address columns contain multi-value data. This is called a denormalized database. In other words, it doesn’t follow database normalization rules.
The primary disadvantage of denormalized databases is the increased memory requirement due to redundant values. In addition, attributes that contain multi-value data are difficult to read and don’t relate easily to one another.
Example: Both customers in the database section listed above are located in Springfield, Maine. However, since this information isn’t separated out, the database can’t easily be filtered by customers from the same location.
To avoid duplicate and multi-valued value ranges, three sequential database normal forms have been developed for relational database models.
A database normal form is a defined target state. Special requirements have been defined for each normal form, which must be met if this target state is to occur. A database corresponds exactly to the first, second, or third normal form if all requirements for the respective normal form are fulfilled.
Normalization is the conversion of a database table to a higher degree of normal form. Conversion to a lesser degree of normal form is called denormalization.
Database normalization: examples of how to reconfigure a database
To illustrate the conversion of a relational database into the first, second, and third normal forms, we’ll walk through the individual stages of relational database normalization using the data in the table above as an example.
First normal form (1NF)
A table in a relational database complies with the first normal form (1NF) when it fulfills the following criteria:
- All data is atomic
- All table columns contain identical values
A data set is considered atomic if each item of information is assigned to a separate data field.
In the below table of billing data, all value ranges that are either non-atomic or don’t contain equivalent data have been highlighted in red.
As demonstrated in the highlighted cells, the data in the example table fails to meet either requirement for first normal form compliance.
The following procedure should be implemented to normalize these sections:
- Divide all multi-value data into separate columns
- Check the values in each column for similarity
To convert the data records in the example table into atomic form, the customer and address fields have to be divided into the more specific attributes of first name and last name, and street address, city, state, and ZIP code, respectively.
A value is considered atomic depending on the context of its use. If it isn’t necessary to separate first and last names, a person’s full name can be considered atomic. But in practice, it’s considered best to divide multi-part values into the smallest possible units.
The result is a table that complies with the first normal form but still won’t result in efficient processing due to double values. Converting the table to the second normal form is then recommended to eliminate the redundancies.
The first normal form prescribes atomic value ranges and enables database queries. Data that is part of a non-atomic value range cannot be queried separately.
Second normal form (2NF)
A table in line with the second normal form must fulfill all the requirements of the first normal form in addition to the following:
- Each non-key attribute must be fully functional, dependent on the primary key
In the introduction, a relational database is defined as a system of individual tables that are related to each other by means of keys.
Keys are used in relational databases to uniquely identify data records (tuples). A key that allows you to uniquely name the individual lines of a database table is called a super key. Such a key can represent the values of a single column or the combined values of several columns.
In the given example, a possible super key results from the invoice number (“Inv. no.”), customer number (“Cust. no.”), and invoice item number (“Inv. item no.”) attributes, as highlighted in the table below.
However, not all information in the selected super key is required for a unique identification. A combination of invoice number and invoice item number – that is, a subset of the super key – would suffice to identify individual data records. Such keys with a minimum number of attributes are called key candidates or alternate keys.
As a rule, one key candidate per table is selected to represent the table. Sequential numbering is ideal for this. Such a key is called a primary key and specifies the sequence of the data records.
Like any key candidate, the primary key can be a one-part key or – as in the given example – a composite key. The sample table uses a composite primary key that is comprised of the invoice number and invoice item number.
To convert a database table to the second normal form, you not only need to determine the primary key and all non-key attributes, but also their relationship to one another. Follow these steps:
- Check whether all non-key attributes are entirely functionally dependent on the primary key. Such dependency only exists if all primary key attributes are necessary to uniquely identify the non-key attribute. This also means that tables with one-part primary keys automatically correspond to the second normal form if all prerequisites for the first normal form are fulfilled.
- Move all non-key attributes that are not entirely functionally dependent on the full primary key to separate tables.
Taking a close look at the example table, note that the prerequisites for the second normal form are not fulfilled because the date column is only dependent on the invoice number (“Inv. no.”), not on the invoice item number (“Inv. item no.”). The same applies to the first name, last name, street address, city, state, and ZIP code.
To convert the data table to the second normal form, all attributes entirely dependent on the invoice number have been moved to a separate table called “Invoice.”
Following normalization, the invoice number (“Inv. no.”) is found in both tables and links them together. While the attribute functions as the primary key in the “Invoice” table, it is used as a foreign key in the “Invoice item” table and is also part of the table’s composite primary key.
The link via the foreign key enables both tables to be queried together. This is called a join.
The example data now complies with the second normal form. However, it hasn’t yet been possible to completely eliminate redundancies. The goal of normalization is usually then the third normal form.
Third normal form (3NF)
If a table is to be converted to the third normal form, all prerequisites of the first and second normal form must be fulfilled as well as the following:
- No non-key attribute may be transitively dependent on a key candidate
A transitive dependency occurs when a non-key attribute is dependent on another non-key attribute and so indirectly on its key candidate.
The given database template violates the conditions of the third normal form in several places:
In the “Invoice” table, the first and last name, and the street address, city, state, and ZIP code, depend not only on the primary key (the invoice number), but also on the customer number.
In the “Invoice Item” table, the product and price attributes depend not only on the primary key, derived from the invoice number and invoice item number, but also on the product number. This specific condition also violates the third normal form.
To remove all dependencies between non-key attributes, the relevant attributes have been moved to separate tables, linked to each other by foreign keys. This results in the four normalized tables: “Invoice,” “Customer,” “Invoice item” and “Product.”
The primary key of the “Invoice” table is a sequential invoice number. Each invoice number is assigned a date of invoicing and a customer number.
The “Invoice item” table is a central table in the sample database, containing information about which products should appear on which invoice, as well as how many of the items were ordered. The sequential primary key on the “Invoice item” table is derived from the invoice number and the invoice item number. The respective products are only listed as product numbers that act as foreign keys, and link the “Invoice item” table with the “Products” table.
In the example, splitting two tables into four may not seem very efficient. And indeed, redundancies in the data of only two customers are of little importance. But imagine you want to consistently process several hundred thousand customer or product records in a relational database without contradictions. This is usually only possible with a database formula that corresponds to the third normal form.
Duplicate values in relational databases are often unavoidable. In reviewing the example as its conversion unfolds, it’s evident that the linking of database tables by foreign keys can be connected with redundancies. These are known as key redundancies.
Even if database normalization requires greater programming effort, 3NF – the third normal form – is generally regarded as the standard for relational database formulas, and is only deviated from in exceptional cases. For example, databases complying with the third normal form are sometimes denormalized to the second normal form. This is because joins across multiple tables are time-consuming for very large databases. Denormalization reduces the number of tables and with it the query time.
Other normal forms
In practice, normalization usually ends with the third normal form. The following normal forms refer to database schemata with special conditions and are then used only in exceptional cases.
Boyce Codd normal form (3.5NF)
Boyce Codd normal form is a tightening of the third normal form. For 3NF:
- No non-key attribute may be transitively dependent on a key candidate
In Boyce Codd normal form, however:
- No attribute may be transitively dependent on a key candidate unless it is a trivial dependency
Boyce Codd normal form is only relevant for database tables with several compound key candidates in which the keys overlap, i.e. if one and the same attribute is a subset of two key candidates.
Database tables complying with the third normal form without multiple key candidates automatically then represent Boyce Codd normal form.
The table below shows two key candidates, each composed of two attributes.
- Vendor number and product number
- Vendor and product number
Both keys make it possible to identify each individual data record. The only non-key attribute is the number. Since the number attribute is not transitively dependent on any of the key candidates, the table is 3NF compliant.
On the other hand, it is not compliant with Boyce Codd normal form, because there is a dependency between the vendor number (“V no.”) and vendor (“Vendor”) attributes. The vendor number attribute is transitively dependent on the key candidate that combines the vendor and product number; conversely, the vendor attribute results from the key candidate that combines the vendor number (“V no.”) and product number (“Prod. no.”).
A trivial dependency occurs when an attribute is completely functionally dependent on itself. Since this is always the case for each attribute in all database conditions, trivial dependencies correspond to the logic of a tautology.
Fourth normal form (4NF)
A database table complies with the fourth normal form if the requirements of Boyce Codd normal form are fulfilled in addition to the following:
- There are no multivalued dependencies unless they are trivial
A multivalued dependency always exists if two unrelated attributes are dependent on the same attribute, as illustrated in the example below:
The following table shows which products have been ordered per customer and to which ZIP code they must be delivered.
For example, the customer with the customer number 234 ordered articles 1-0023-D and 2-0023-D, which are to be delivered to his address at ZIP code 12345. For customer 567, articles 1-0023-D, 3-0023-D, 4-0023-D, and 5-0023-D will be delivered to the ZIP code 56789.
The data records can only be identified with a super key resulting from all three attributes – customer number, product number, and ZIP code. Since there is no non-key attribute, the database is 3NF compliant. Furthermore, as there are no non-trivial transitive dependencies, it is also 3.5NF compliant. However, there are multivalued dependencies: both the product number attribute and the ZIP code attribute are dependent on the customer number attribute, but are not related to each other.
The disadvantage of such a database design is that every time a new product is added to the customer’s record, the ZIP code must also be added, which results in redundant data.
These redundancies can be eliminated by converting the table to 4NF. To do this, you have to divide the table in such a way that there are no or only trivial multivalued dependencies. This is possible because the product number and ZIP code are in no way related.
As shown in the example, the fourth normal form eliminates redundancy caused by multivalued dependencies, in this case specifically in the ZIP code column.
In this (admittedly somewhat contrived) example, the assumption has been made that only one ZIP code applies for each customer. However, if customers have the option of ordering products for delivery to multiple locations, there would be a dependence between the product number and the ZIP code, in which case the output table would already be 4NF compliant.
Fifth normal form (5NF)
A database table is compliant with the fifth normal form if it satisfies the conditions of the fourth normal form in addition to the following:
- The table cannot be split further without losing information
Below is an example demonstrating such a case in which a company operates a TYPO3-based website and a Magento web shop. Three employees are responsible for the software projects: Mary Smith, George Miller, and Joe Davis, each with different qualifications.
The table shows which employee’s qualification applies to which software project’s requirements.
The table has only one key composed of all three attributes, meaning that it at least complies with 3NF and the Boyce Codd normal forms. Since there are no dependencies between all three attributes, the table also complies with the fourth normal form.
To check whether the table is also 5NF compliant, divide the output table “Employee qualification for project deployment” into the three tables: “Project deployment,” “Employee qualification,” and “Project requirements.”
The “Project deployment” table shows which employee is involved in which software project.
Reconstructing the output table, you can assume that each employee involved in the project will use each of his or her qualifications, provided that these are required by the respective project. However, in doing so, the information that Joe Davis worked alone in PHP programming for the TYPO3 project has been lost. This means that the output table can’t be broken down without information loss, making it compliant with the fifth normal form.
In practice, you’ll rarely come across database formulas that meet the requirements for 4NF but aren’t compliant with the fifth normal form. However, 5NF is interesting for applications in which new information is obtained from existing data.
In the example, both Mary Smith and George Miller are proficient in PHP, which they could also contribute to the TYPO3 project in the future. The company could use this information to make software development in this project more efficient.
Advantages and disadvantages of normalization
The aim of normalization is to reduce instances of double values. By transferring a database to one of the listed normal forms, the target schema benefits from less redundancy than the source schema. Normalization also makes database maintenance easier.
On the other hand, database normalization always involves storing attributes in separate tables. This may require the integration of foreign keys, which can lead to key redundancies. The primary disadvantage, however, is that in a normalized database, logically related data is no longer stored together. A join is required to merge data that has been split into different tables.
Complex information can be filtered out via database queries using joins. However, joins are more complex to implement than simple queries. This also takes much more time if joins are made using a large number of database tables.