When we shop online, book vacations, and search for gift ideas, we hardly give a second thought to the fact that each search entry leaves behind a trail of our identity. Busy web bots are never far behind and sweep up this information. The result of all of this is Big Data: massive volumes of data that are analyzed and used for a variety of reasons. But is there reason to be wary of leaving behind...Big Data: definition and examples
Data mining: analysis methods for big data
Data plays a fundamental role in the e-commerce sector. In order to optimize sales processes, many online stores work hard to diligently collect data. With the help of analysis tools, they compile numbers and values on customer behavior, products, and shopping cart information. But a vast collection of data alone doesn’t offer any added value to an online business. Those looking to optimize salesmethods and increase profit need to be able to evaluate information in an expedient manner. This is where an analytical approach, called data mining, comes into play.
What is data mining?
In order to reach a definition for data mining, it’s helpful to break down the representation and approach that the term describes. If one views the output of online visitor tracking tools as a seemingly useless pile of data, data mining offers a solution. This involves using the necessary tools for tapping into the collected data, and extracting the relevant information. Unlike actual mining operations, however, statistical methods are used that enable trends and other relationships to be identified.
Data mining is considered to be a sub-step of the knowledge discovery in databases process (KDD), which consists of the following processes:
- Selecting the database
- Pre-processing with the goal of data cleansing
- Transforming data into the form needed for the selected analysis method
- Analysis process by means of mathematical processes
- Interpretation of analysis results
Ultimately, the findings discovered through KDD can be incorporated into the online store’s strategic focus and marketing decisions. In addition, the application fields for which these insights can be applied are also quite diverse.
Applications of data mining
Data mining offers the possibility to optimize e-commerce on a scientific basis. Here, large data sets that are accrued build the basis for explanations and prognoses. Statistically prepared and neatly visualized, these methods allow operators of online stores to identify important factors required for a successful online business. To this end, data mining is used in order to:
- To divide markets into segments
- Analyze shopping cart data
- Create consumer profiles
- Set up prognoses on contract periods
- Analyze demand
- Identify errors in the purchasing process
Data mining methods
In order to be able to extract relevant business information from large data sets, many methods have been established that are based on identifying important relationships, patterns, and trends. These methods can also be used for statistical processes.
- Outlier detection: extreme values that stand out from the rest of data are known as outliers. In data mining, outlier detection is used in order to identify atypical data sets. In practice, these data mining methods can for instance reveal credit card fraud by exposing suspicious transactions.
- Cluster analysis: clusters refer to a group of objects that, in one way or another, are similar to one another. The goal of this analysis is to segment unstructured data. To this end, algorithms are used to search for similarities in the structures of large data sets, in order to identify new clusters. In contrast to the classification process (see below), cluster analyses aim to detect new possibilities for creating groups. In cases where a data set cannot be allocated to any clusters, this can be interpreted as an outlier. A classic application for cluster analyses involves identifying user groups.
- Classification: while cluster analyses primarily aim to identify new groups, classification involves making use of predefined classes. Allocating these occurs with the help of matching characteristics from the data set. A decision tree presents a common method for automatically classifying data. For each node, a property of the object is called up. The presence of this property determines the choice of the following node. For e-commerce related purposes, this process can be used in order to divide customers into different segments.
- Association analysis: an association analysis aims to identify relationships within data sets that can be formulated as inference rules. When it comes to e-commerce, these data mining methods can be used in order to identify the correlation of individual products within shopping carts along the pattern of ‘if product A is bought, then product B will also be bought’.
- Regression analysis: regression analyses help create models that explain dependent variables through various independent variables. In practice, this means that the prognosis for a product’s sales performance can be created by correlating the product price and the average customer income level in a regression model.
The limits of data mining
In data mining, statistical procedures are employed that make it possible to carry out a fundamentally objective analysis of available data sets. The rather subjective nature of selecting an analysis method (as well as the various algorithms and parameters) with the intention of pursuing specific goals can, however, lead to falsified results. Such effects can be evaded by outsourcing data mining processes to external service providers.
One of the most important factors when it comes to the quality of the data gained through data mining is the quality of the data basis. Representative results can generally only be gained from representative data. For this reason, data mining normally requires data sets to be processed in advanced; this allows missing values and biases to be ironed out.
Finally, it’s important to note that data mining only offers results in the form of patterns and cross-connections. Answers can only first be obtained when the analysis results are interpreted with regards to previous questions and goals.