Now everyone is talking about the benefits of big data. As a result, the business tries to work with large-scale databases, but faces a problem – all data is heterogeneous and unstructured, and it takes a long time to process it before loading it into the database. As a result, working with big data turns out to be too complicated and expensive, and some of the data is lost, although it could be useful in the future.
“Data lakes can help with this – data lakes that help you quickly and inexpensively work with large amounts of unstructured data. Let’s talk about their features, the key differences between lakes and conventional databases, and the areas in which they will be most useful.”
What is data lake
Data lake is translated into Russian as “data lake”. It is a huge repository in which different data is stored in a “raw”, that is, unordered and unprocessed form. The data in the data lake is like a fish in a lake that got there from a river – you don’t know exactly what kind of fish is there and where it is. And in order to “cook” the fish, that is, to process the data, you still need to catch it.
In our life, we most often come across unstructured data. Videos, books, magazines, Word and PDF documents, audio tapes and photographs are all unstructured data, and they can all be stored in Data Lake.
How data lake works
Data lake is a huge storage that accepts any file of all formats. The source of the data is also irrelevant. The data lake can receive data from CRM or ERP systems, product catalogs, banking programs, sensors or smart devices – any system that a business uses.
Later, when the data is saved, you can work with it – extract it according to a specific template into classical databases or analyze and process it right inside the data lake.
To do this, you can use Hadoop – software that allows you to process large amounts of data of various types and structures. With its help, the collected data can be distributed and structured, analytics can be configured to build models and test assumptions, and machine learning can be used.
Another example of a data lake data processing tool is BI-systems that help businesses to solve the problems of in-depth analytics (data mining), predictive modeling, and visualize the results obtained. The area of use is multifaceted – from financial management to risk management and marketing.
“To work with a data lake, a company must have technical specialists: Data Scientist, Data Developer, business analyst. Such specialists have access to data in the data lake and can process them using various analytical systems and approaches. In the data lake, data can be processed without extraction – it is enough to equip the systems for analysis right inside the lake ”, – Konstantin Savchuk, Managing Partner of Constanta.
How data lakes differ from regular databases
The key difference between data lakes and regular databases is their structure. Databases store only clearly structured data, while lakes store unstructured data, not systematized and disordered in any way.
Example : Let’s say there is a loose artistic description of your target audience: “Girls in their 20s and 30s, unmarried, usually without children, working in low management positions. And men are 18-25 years old, married, without children, without a clear place of work. ” Such a description is unstructured data that can be loaded into a data lake.
In order for this data about the target audience to become structured, it needs to be processed and converted into a table:
Floor | Age | Family status | Children | Work | |
Portrait 1 | female | 20-30 | married | No | low management position |
Portrait 2 | male | 18-25 | married | No | any |
In a classic database, you have to define the type of data, analyze it, structure it – and only then write it to a well-defined place in the database. We can create an algorithm that works with specific cells because we clearly know what is stored in those cells.
In the case of a data lake, the information is structured in the output when you need to retrieve the data or analyze it. At the same time, the analysis process does not affect the data in the lake itself – they remain unstructured, so that it is also convenient to store and use them for other purposes.
To put it simply, you can think of the data lake as your hard drive where all your files are stored. And the database is a table in which all these files are taken into account.
There are other differences between databases and data lakes:
Usefulness of data . All data in databases is useful and relevant for the company right now. Data that seems useless for now are eliminated and lost forever.
The lakes also store useless data that may be useful in the future or may never be needed.
Data types . The databases store tables with specific numbers and text, distributed in a clear structure.
The lakes contain any data: pictures, video, sound, files, documents, heterogeneous tables.
Flexibility . The flexibility of the database is low – even at the start, you need to determine the data types and structure that are relevant for it. If data of new formats appears, the database will have to be rebuilt.
Lakes have maximum flexibility, because nothing needs to be determined in advance. If you suddenly decide to record new data, for example, video from cameras for face recognition, the lake does not have to be rebuilt.
Cost . Databases are more expensive, especially if you need to store a lot of data. You need to organize a complex infrastructure and filtering, all this requires money.
The data lake is much cheaper – you pay exclusively for the occupied gigabytes.
Comprehensibility and availability of data . The data in the database can be easily read and understood by any company employee; business analysts can work with them.
Structuring data in a lake requires a technician like Data Scientist.
Usage scenarios . Databases are ideal for storing important information that should always be at hand, or for basic analytics.
Data lakes are good for keeping archives of raw information that may come in handy in the future. It is also good to create a large base for large-scale analytics there.
Who needs data lakes and why
Data lakes can be used by any business that collects data. Marketing, retail, IT, manufacturing, logistics – in all these areas, you can collect big data and upload it to the data lake for further work or analysis.
Lakes are often used to store important information that is not yet used in analytics. Or even data that seems useless but is likely to be useful to companies in the future.
“Data lake allows you to accumulate data“ in reserve ”, and not for a specific business request. Due to the fact that the data is always “at hand”, the company can quickly test any hypothesis or use the data for its own purposes. For example, to optimize logistics and efficient supply chain management – from more detailed planning and forecasting the volume of sales to deliveries in the right quantity, of the right quality, at the right time with minimal costs “, – Alexey Kuleshov, Director of the Department of Organizational Development and Technologies of the IT company OTR …
For example, you use complex equipment in production that often breaks down. You are implementing IoT, the Internet of Things – you have installed sensors to monitor the state of equipment. Data from these sensors can be collected in Data Lake without filtering. When enough data has accumulated, you can analyze it and understand what causes breakdowns and how to prevent them.
Or you can use data lake in marketing. For example, in retail and e-commerce, you can store scattered information about customers in the data lake: time spent on the site, activity in a group in social networks, tone of voice when calling a manager, and regularity of purchases. This information can then be used for global and large-scale analytics and predicting customer behavior.
Thus, data lakes are needed for flexible data analysis and hypothesis building. They allow you to collect as much data as possible, so that later, using machine learning and analytics tools, you can compare different facts, make incredible predictions, analyze information from different angles and extract more value from the data.
Investigation ANGLING FOR INSIGHT IN TODAY’S DATA LAKE shows that companies that have implemented Data Lake, 9% ahead of its competitors in terms of revenue. So we can say that data lakes are needed by companies that want to make more money using their own data analysis for this.
“Leading companies are using advanced approaches to analytics of data stored in the data lake, such as machine learning. Information from various sources is suitable for this: logs (event logs), data from social networks, data from various devices (smartphones, smart watches, tablets) and others. Using this approach to data analysis, the company can obtain useful insights of various nature, deduce patterns, suggest the emergence of certain scenarios in the future, “- Konstantin Savchuk, Managing Partner of Constanta .
Why are data lake dangerous?
Data lakes have one major problem. Any data entering the data lake gets there almost uncontrollably. This means that it is impossible to determine their quality. If a company does not have a clear data model, that is, an understanding of the types of data structures and methods of their processing, the management of the lake is poorly organized, and huge amounts of uncontrolled data, most often useless, quickly accumulate in it. It is no longer clear where and when they came from, how relevant they are, and whether they can be used for analytics.
As a result, our lake turns into a swamp of data – useless, devouring company resources and not bringing any benefit. All you can do with it is erase it completely and start collecting data again.
To prevent the lake from becoming a swamp, it is necessary to establish a data governance process in the company. The main component of this process is determining the reliability and quality of the data even before loading it into the data lake. There are several ways to do this:
- cut off sources with deliberately inaccurate data;
- restrict download access for employees who do not have permission to do so;
- check some parameters of files, for example, do not let pictures that weigh tens of gigabytes into the lake.
Setting up such filtering is easier than structuring the data to be loaded into the database each time. If the process is well-established, only actual data will be sent to the data lake, which means that the database itself will be reliable.
Data management is not optional, but a priority. The company should have a separate employee responsible for data governance. This is usually the Chief Data Officer, CDO.
“Accumulating data in the calculation“ then we’ll figure out why you need it and understand how to use it ”is wrong. Then it will be difficult and costly to isolate something useful from this huge array of completely different data. Therefore, when designing any data lake, first of all, it is still necessary to decide “on the shore” for what purposes to build it ”, – Aleksey Kuleshov, Director of the Department of Organizational Development and Technologies of the IT company OTR.
Key facts about data lake – data lakes
- Data lake is a data lake, a repository that collects unstructured information of any format from different sources.
- Data lakes are cheaper, more flexible, and easier to scale than conventional databases.
- Data lakes can be used for any purpose: analysis, forecasting, business process optimization.
- Data can be extracted from the lake according to certain characteristics or analyzed right inside the lake using analytics systems.
- If you collect too much data “just like that” and do not work with it in any way, the lake can become a useless swamp. Therefore, it is important to determine in advance what exactly you are collecting data for, and not to accumulate it just like that.