Big data for databases and reporting

Big DataBig data is the key technical challenge in recent years. Big data technology makes solutions possible which previously appeared conceivable at most in specific niches of research. As a driving force behind this development, we can see the data and record that more data has been accumulated in recent years than in the entirety of human history. Additionally, it refers to the rate at which this amount of data is growing exponentially every year.

However, the question of why it did not happen until now remains open - because in fact there has always been "a lot of data". On closer inspection, however, it can be seen that there have been technical developments in recent years that have made big data at all possible. Because big data typically requires the use of solutions for storage and processing that are fundamentally different from the solutions traditionally used.

Processors are no longer faster

Big DataSince the sixties of the last century, hardware development has been determined by Moore's Law, which states that the number of transistors on a chip doubles every 12-24 months. For many years this also meant exponential growth of the CPU clock, processors have become faster year after year, and clock speed (MHz, later GHz) was a key feature of new processors.

As the figure shows, this trend was broken in the middle of the last decade. The clock speed of new processors is no longer increasing. But Moore's Law remains unbroken; the number of transistors per chip continues to grow exponentially. Chip designers now use the increasing number of transistors for creating multi-core CPUs. A single chip no longer contains a fast CPU - instead, it contains multiple computing cores.

The consequence: Parallel processing is necessary

If hardware development used to lead to even old software running faster on new computers, this is no longer true today: to fully utilise multi-core processors, parallel processing must be used. In contrast, the old applications were typically programmed sequentially. This is especially true for database interfaces, where sequentially is often enforced to ensure the consistency of the changes. For example, all update transactions of SAP R / were thus processed in a single thread. Parallel processing therefore also means a new way of application development.

Memory will continue to become cheaper

There has also been a decades-long trend in storage media. While one megabyte of memory on core memory cards cost over five million dollars in the sixties, now the same data capacity on a current 16-gigabyte DIMM costs just half a US cent.

Fifteen years ago, this would have been about the price you would have paid for one megabyte of hard disk capacity (0.58 US cents on a Maxtor M96147U8). It has thus become affordable to keep data in the main memory, which were previously reserved for the hard disk. In principle, all of a company's data is economically affordable if kept in in-memory databases.

Big DataThis first represents an enormous speed advantage: Access times are about 100,000 times lower than with hard drives in RAM. Comparable to the difference between the exchange of information by telephone (5 seconds) and by letter (6 days, about 518,400 seconds).

Additionally, RAM also offers flexible access to the stored data. By contrast, it is crucial for a hard drive with rotating plates and a movable read head to read as large data segments as possible without the time-consuming repositioning of the read head.

Column-oriented databases

Big DataA simple in-memory database is initially not more efficient than a traditional relational database with an appropriate amount of memory for caching. However, there are further opportunities for performance improvement due to the flexible access options.

Traditional databases are static and line-oriented. The values of a data set are in a row (on a disk) in the memory. In the database, the same amount of memory is reserved for each data set entry, even if the value is empty (NULL). When reading, the whole record is generally read.

Alternatively, the data in a database can also store it column-oriented, so the data in a column are in the memory consecutively. The flexible memory access allows the data to be saved flexibly - empty cells require only a minimum of memory. In addition, data compression can be applied to reduce memory consumption even further. The column orientation also makes it easier to parallelise typical selections by using multiple CPUs. After the data has been loaded into a column store, the task of identifying the relevant values is divided among multiple cores.

These mechanisms are used in the SAP HANA database, for example. And they also form the basis of the Apache HBase, a popular, big data database based on Hadoop.

These technical solutions allow not only huge amounts of data with high performance to be processed, they also allow new solutions for typical tasks in traditional IT, for example, dealing with databases.

The consequence: Hybrid database usage

In dealing with databases, currently there is a distinction made between OLTP (On-Line Transactional Processing) and OLAP (On-Line Analytical Processing). In OLTP, the database serves as a memory in which an application stores data, because they can no longer be held in the main memory or to keep them permanently available. OLAP refers to the analysis of data for reporting. Two databases are used for traditional OLTP and OLAP. This prevents the application from being impaired through long-running reports, makes it easier to put together data from multiple systems, and facilitates the historicising of past conditions.

The main drawback of this approach is that different data models have to be prepared, developed, and maintained. In addition, the data must be transferred from one database to another. This sort of ETL application (Extract, Transform, Load) must be prepared, developed, and maintained. Moreover, the periods for which these ETL applications are often considerable, which is why the execution is typically carried out at night in order to not strain the application database. The data in the OLAP database are therefore not up to date.

The performance of column-oriented in-memory databases makes it technically possible to abandon the distinction between OLTP and OLAP. The increased efficiency puts the application of a single in-memory database in the position to meet reporting requirements without compromising the application. This is called "Hybrid Transactional Analytical Processing" (HTAP).

Big data technology at S&N

S&N is actively working on the analysis and evaluation of new technologies and has already gained practical experience in the use of technology in several customer projects. In addition, S&N is driving the development as part of a ZIM project funded by the Federal Ministry of Economics.

Text Analysis

Big Data TextanalyseIn a joint project with the University of Paderborn, S&N is working on a tool for the acquisition and analysis of texts. In the project, which is funded by the Federal Ministry of Economics from the Central Innovation Programme for SMEs, a library of standard methods for text analysis was compiled and integrated into the tool, among other things.

The information contained in texts cannot be processed readily without computer assistance. One possibility for the preparation for the computer processing is the extraction of statistical properties of text - such as the frequency of words or word groups. Alternatively, the word stems and basic form of words can also be used. Different word forms are put together from this.

In the project, a collection of standard procedures for these sorts of analyses were compiled in a library. To organise the process, an overall type system was developed. This type system includes additional types of complex analyses, with which moods of texts (sentiment analysis) and moods of subthemes of texts are identified (aspect-based sentiment analysis). The creation of such methods is one of the next milestones in the project.

Contact: Dr Klaus Schröder; Turn on Javascript!