ETL software
When it comes to data collection and integration in the context of an ETL process, there is also the possibility of the open source programs Pentaho DI, Talend OS and Jasper ETL.
- Pentaho DI: Also known under the name ‘Kettle,' the ETL tool Pentaho Data Integration (DI) is part of the larger Pentaho BI suite but can also be used independently from the other Pentaho components as a standalone application in data warehouse architecture. The data collection and integration tool consists of a graphic user interface, which can also be used by users without any prior programming knowledge, to manage and administrate ETL processes. Pentaho DI also offers a comprehensive range of various processing modules that allow you to define individual steps of the ETL process. This data integration tool supports all common database systems. On top of this, flat files like CSV, Excel, or text data can all be used as data sources. Furthermore, the tool also provides interface functions for proprietary BI suites from SAS or SAP as well as from analysis software like Google Analytics.
- Talend OS: Comparable with Pentaho DI is the open source ETL tool from the software provider Talend. Talend Open Studio (OS) makes it possible for users to define data collection and integration processes with the help of parameterized modules (so-called ‘jobs’). The program offers interfaces for all common data sources and various data transformation functions. A map editor allows users to transfer heterogeneous raw data in a predefined target structure. Like with Pentaho DI, Talend OS users can engage with graphic user interfaces without prior programming knowledge.
- Jasper ETL: Jasper ETL is the result of a cooperation between the software producers Jaspersoft and Talend. This ETL is largely based on Talend OS, the market leader when it comes to the data integration tools in the open source sector. Deployment of this tool is especially recommended, in the context of DWH architecture, if it is the case that other BI products from Jaspersoft are also being used.
OLAP applications
Pentaho Mondrian and Jedox are two established OLAP tools under an open source license.
- Pentaho Mondrian: Mondrian is a Java-based OLAP server. Originally developed as a separate open source project, since 2006 Mondrian has been part of the Pentaho BI suite. But users do also still have the option of using it as a standalone application. Alongside Mondrian, with BI solutions there are also other open source providers available to use, like Jaspersoft. Users benefit from a bundling of open source resources that make collective projects possible, like the Mondrian Schema Workbench or the OLAP4J interface. The Mondrian project follows a relational online analytical process (ROLAP). Databases build a databank, where tables are organized in the form of star or snowflake schemata. Access is in the form of multi-dimensional queries (MDX) via XML for Analysis (XMLA) or else via the Java interface OLAP4J. With the Mondrian Schema Workbench, users also have the possibility of a graphic user interface. Mondrian Schemata can be easily and conveniently developed and tested on a desktop.
- Jedox: With its BI suite of the same name, the software developer Jedox offers a complete solution for business intelligence and performance management applications. A central component of this software is a high-performance, in-memory OLAP server, which can be integrated by way of interfaces for Java, PHP, C/C++ or .NET. Of particular use to Jedox users in the area of KMU are the Excel add-ins that allow the OLAP server to interface with this program. This is also the case with the well-known table calculation software from Microsoft. Office applications are very common among small and medium-sized businesses, often acting as the basis for data storage. Therefore, Excel integration reduces the time and energy spent on employee induction and training.
Data mining
Even in the area of data mining, there are several products available under an open source license. Two of these products are RapidMiner and Weka.
- RapidMiner: The analysis platform RapidMiner from the software company of the same name provides users with an integrated environment for machine learning, sentiment and time analysis, as well as forecast models. It also caters for data, text, and web mining. All of this then takes place in an open core model. Support covers all steps of the data mining process including data processing, visualization, validation, and optimization. For some users, the free Community version with just one logical processor and one analysis scope of 10,000 data sets is not sufficient. If this is the case, then there is also the possibility of the fee-based Enterprise license. The program is written in Java and provides a user interface with which an analysis workflow can be easily defined and carried out in just a few mouse clicks.
- Weka: Weka (Waikato Environment for Knowledge Analysis) is an open source project from the University of Waikato in New Zealand. The analysis tool offers users various algorithms in the area of machine learning. Alongside the classic data mining processes like classification, association, regression or cluster analysis, Weka also features various components for data preprocessing and visualization. The program, which is written in Java, also offers a graphic user interface. All software features can also be carried out via command lines. If required, it’s also possible to integrate Weka into various software solutions via a Java interface.
Reporting Systems
Two recommended open source tools in the area of reporting systems are BIRT and SQL Power Wabit. Alongside the classic monthly, quarterly, and annual reports, it also offers ad-hoc functions, allowing you to compile relevant information in real time.
- BIRT: BIRT (Business Intelligence and Reporting Tools) is an open source project from the non-profit Eclipse Foundation that provides BI reporting functions for Rich Clients and web applications. The software is suitable for Java-based applications and covers the broad sectors of data visualization and reporting systems. Designs for BIRT reports are created in a graphic interface, which is based on the open source programming tool Eclipse, then saved as XML files.
- SQL Power Wabit: With the reporting tool SQL Power Wabit users can compile reports based on classic databank queries. OLAP dice are only supported if a description of the data structure is present. The tool supports standard reports, ad hoc queries, user-defined overview pages, as well as drill down operations in the context of online analytical processing. Functions like a drag & drop control system, the updating of result reports in real time, a global search function, and a WYSIWYG Editor for the drafting of reports, all make the SQL Power Wabit suitable for users; even those without SQL capabilities. This allows for the convenient compiling of comprehensive reports in just a few mouse clicks and, if required, also the personalization of fonts, text color, and layout.
Integrated BI solutions
Apart from the fee-based BI suites from established providers like SAP, Oracle, IBM, SAS, HP, and Microsoft, there are also software projects on the open source market, which offer users data warehousing solutions as an integrated program package. Pentaho CE, Jaspersoft, and SpagoBI are recommended by this digital guide.
- Pentaho Community Edition (CE): The Pentaho BI package contains, alongside a selection of in-house developments, a number of already existing open source projects, which can be purchased bit by bit and then integrated into the product portfolio. The main focus of the project is based around data integration and the automatization of reports. The following programs are featured in the package:
- Pentaho Business Analytics Platform: The BA Platform is a web application that makes it possible for users to merge all information through a central platform.
- Pentaho Data Integration: Pentaho DI refers to the ETL tool described above.
- Pentaho Report Designer (PRD): PRD is the evolved version of the project JFreeReport. The open source reporting solution supports a range of output formats like PDF, Excel, HTML, Text, Rich-Text-File, XML, and CSV.
- Pentaho Marketplace: The Marketplace allows users, in just a few clicks, to extend the Pentaho platform by plug-ins.
- Pentaho Aggregation Designer (PAD): Through PAD, users can set up and adjust databank content. Central to this tool is the OLAP server, Mondrian.
- Pentaho Schema Workbench (PSW): This is a graphic design interface that allows users to create and test schemata for Mondrian OLAP dice.
- Pentaho Metadata Editor (PME): PME assists the detailed description of underlying data structures with the help of an XML file.
Pentaho Enterprise Edition (EE) is a fee-based version of the BI suite with a range of additional features and professional support.
- Jaspersoft: Jaspersoft also offers various DWH applications as part of an integrated BI solution. The collection of programs includes:
- JasperReports Server: This is a report server offering OLAP functions via an adjusted Mondrian server.
- JasperReports Library: Jaspersoft provides a library for generating reports.
- Jaspersoft Studio: This is an editor provided by the BI suite for the writing of reports.
- Jaspersoft ETL: The Talend OS-based ETL tool has already been described above.
- Mobile BI: Mobile BI is the native app for iPhone and Android devices. It means reports and dashboards can be accessed from mobiles.
Jaspersoft also offers the possibility of a further range of functions through its fee-based, commercial version.
- SpagoBI: Unlike with Pentaho and Japersoft, who market their products under a dual license, the IT-initiative SpagoWorld offers only an open source solution. However, business users do have the possibility of paying for the professional setting up and customization of the software. The program is made up of the following components:
- SpagoBI Server: At the core of this open source resource is the SpagoBI server that provides all of the various analysis tools and functions.
- SpagoBI Studio: The program includes an integrated development environment.
- SpagoBI Meta: SpagoBI Meta offers users an environment for metadata management.
- SpagoBI SDK: Through the SpagoBI SDK, the Spago BI suite has an integration layer that makes it possible for various external tools to be incorporated, e.g., Talend OS (ETL), Jedox, or Mondrian (OLAP); Weka or R (data mining); as well as BIRT or JasperReports Library (reporting systems).
Data retention
Even when it comes to data retention, users have a choice of various open source software alternatives to the proprietary databank management systems like Microsoft SQL Server, IBM DB2, or the solutions offered by Oracle and Teradata. One of the prominent data stores are the relational database systems MySQL and Maria DB, or else the object-relational database PostgreSQL. Last but not least, there is also the offering from Pivotal that goes by the name of Greenplum Database, an optimized and evolved development offered specifically for data warehouse architecture under an open source license.