What is “big data”?
The first known use of the term “big data” can be found in a publication from 1997. NASA scientists described “quite large data sets” that were posing interesting challenges to the computer systems of the time, since these data sets would reach the limits of available memory, local disk storage and even external memory.
Of course, these limits to available memory, local disk storage and external memory have shifted significantly since 1997. But the notion of big data as expressed back then still remains relevant:
Wikipedia describes big data as:
Better than this purely negative definition, however, would be a description of big data that refers instead to its particular challenges. Several popular descriptions use handy formulations such as the three (or more) V’s:
- Volume: the sheer volume of data makes it difficult or impossible to process data using traditional methods.
- Velocity: the speed with which data is collected or needs to be evaluated is the main challenge here.
- Variance: big data sources, unlike conventional data sources, often do not have data in a single, defined format. It is possible that format to be evaluated was not even designed for computer evaluation (such as processing text). The evaluation cannot simply discard the different data formats; instead, it needs to find ways to process them.
- Veracity: asks whether the data to be evaluated is reliable. Data sources can include random and systematic errors.
- Value: not all existing data necessarily helps to solve a problem. Big data must therefore develop methods to extract the relevant data.
Machine learning makes it possible to extract models from large, unstructured datasets and use these models to generate predictions or take decisions. The method for determining the predictions or decisions is not defined by the developer; instead, the machine learns the method by analysing the raw data. This is why this technique is often used in the big data context.
Typical questions addressed by machine learning include:
- Regression: predicting a value, for example, a stock price or the midday temperature on the basis of past values.
- Binary classification: predicting a simple yes or no, for example, whether a customer will return a purchased product or whether a business partner is creditworthy.
- Multi-class classification: A case is assigned to multiple classes, for example, by identifying the language of a text or determining a text’s topic.
- Ranking: this determines the arrangement of cases according to certain criteria. Typical applications involve user-specific arrangement of search results or products offered.
Some machine learning methods provide additional information beyond a mere decision or prediction. Access to the rules generated by machine learning makes it possible to check the plausibility of the machine’s learned results. Other learning methods provide additional information on the reliability of the learned results.
Big-Data Development at S&N
Big data and the availability of large amounts of data have dramatically expanded what IT can do. Data is no longer used for just one purpose. Being able to access both internal and external allows existing data to be analysed from many different angles. The usable formats are likewise becoming more diverse. The focus is no longer primarily on numbers (amounts, quantities, performance indicators, account numbers, etc.); instead the analysis of other formats (audio, text, images, human utterances) plays a role as well.
Big data thus raises completely new questions about the extent of automated analyses and offers the opportunity to find completely new answers for old questions
such as assessing the reputation of your company or its brands, estimating the future price performance of a security, or assessing the creditworthiness of a business partner. The basis for these analytical decisions has traditionally been historical data or estimates purchased from pollsters, credit agencies or the like.
Research and analysis of text-based sources online provides an alternative to these traditional data-driven answers. In recent years, the availability of such texts has exploded. On the one hand, this increases the opportunity to conduct research because more data is available, but, on the other hand, it is next to impossible to acquire, filter and analyse all of the relevant sources manually thanks to the sheer amount of data that’s out there.
One solution is to use big-data processes. These methods have become a major driver of innovation and their technical foundations are now well advanced. These processes offer particular advantages when large numbers of similar analyses need to be performed repeatedly. To date, the available software has, however, been either restricted to very narrow applications (for example, estimating a reputation) or is so complex that it can only be implemented with major effort and not without the detailed knowledge of experts.
S&N has therefore decided to develop a software tool in cooperation with the University of Paderborn to make smart-data processes adjusted to the particular application to support the analysis and application of text data. The project is funded by the Central Innovation Programme SME (ZIM) of the German Ministry of Economics.
At the core of the tool are solutions for the four essential steps in the creation of such text-based method that we have identified:
- acquisition of internet sources
- filtering of relevant sources
- extraction of the relevant text properties
- determining the actual analytic function by machine learning
The tool is targeted to companies that want to investigate questions with big-data process, but are not themselves experts in data analysis. A key challenge, consequently, was the design of the tool’s user interface. In addition, we needed to investigate approaches to accountability and visualisation.
The goal of reputation analysis is to make statements about the reputation of companies, products or brands. To do this, social media sources are analysed, in particular, Facebook and Twitter. In addition, discussion forums, newsgroups and press coverage are included in the analysis. Determining relevant sources is relatively simple in these cases: any source where the company, brand or product name appears is relevant.
Reputation analysis gets started with simple collocation analyses to determine which evaluative terms are mentioned in relation to the company or brand, for example “I find X product pretty lame” or “Y AG is an attractive place to work.” The evaluative terms ( “lame”, “attractive”) are then listed by their ranking in simple word lists. These word lists can also contain inflected words (not just “attractive, but also “more attractive”); alternatively, the root word is determined before being assigned to the word list.
This simple process is then often complemented with a trend analysis that can show, for example, that the attribute “lame” has increased significantly in recent weeks, thus indicating emerging issues. In-depth analyses can also take into account the authors of the various texts in order to identify those opinion-makers whose utterances reach a particularly wide audience.
Overall, reputation analysis is one of the most well-known big-data processes and has accordingly been widely used because it answers an important question in a traceable manner. At the same time, it is limited to those things that are publicly discussed by name. Methods capable of handling more complex situations are still yet to appear.