The concepts of Data Lake and Process Data Storage (PIMS, Historian) are often perceived as synonymous and even confused by professionals. The reason for this is their purpose: collecting and storing data. However, this is the only thing they have in common. In fact, there is a significant difference between these two systems, ranging from architecture to the tasks for which they are built.
Three key differences between a process data warehouse and a data lake are:
data lake | Process Data Warehouse | |
data structure | raw data | processed data |
Purpose of data | It is not known how and when they will be used | Necessary for solving specific problems |
Users | Data Engineers | Managers, business analysts |
Let’s take a closer look at the key differences:
1. Raw data or processed
First, let’s understand what is meant by raw data. This is data from multiple controllers and systems stored in its original format. It is they who are collected in the data lake. The main problem in this case is the risk of turning the lake into a swamp, from which, instead of the “golden fish of knowledge”, only a “muddy mire of numbers” can be pulled out. Another disadvantage is the increased requirements of data lakes for server capacities and data transmission networks. A lake needs more disk space than a data warehouse.
Process data stores (PIMS, Historian) work differently. They are initially built in accordance with a clearly defined production information model, in which each object and parameter is assigned a specific data set with specified accuracy characteristics and processing algorithms. There are no random indicators in the repository. All channels of data receipt, the rules for their verification and mathematical processing are prescribed at the stage of determining the goals of creating a repository and forming a unified information model. This implies the main drawback of storages in comparison with lakes – the high laboriousness of structuring data and standardizing the rules for processing and storing them.
2. Purpose of data
If a company does not know what specific tasks it will solve with the help of data, usually the choice falls on the lake. It allows you to accumulate a large amount of heterogeneous information, which can later be used by machine learning systems or data engineers to find patterns that are not currently obvious. Such a campaign is justified if the company plans to use machine learning and artificial intelligence (AI) technologies.
The technology data warehouse is tailor-made to solve specific problems, so only the data that is exactly needed by consumers is collected there. For example, for monitoring technological processes, equipment diagnostics, operational planning of production, generation of production reports.
3. Data users
The process data warehouse is as clear a tool as a table, so data from it, directly or through related systems, can be used by any specialist – from a dispatcher to a business analyst.
Working with raw data from the lake requires specially trained data engineers with excellent knowledge of programming languages (SQL, Python, Scala) and databases (SQL, Hadoop). To extract the benefits and new knowledge, extensive tools and various methods of data analysis are required, which means more time and additional specialists.
As you can see, each approach has its own advantages and disadvantages. Which approach is closer to you?