A data lake is an element of the Big Data infrastructure, a repository of a large amount of unstructured data generated or collected by a single company or government agency. Data in lakes is stored, as a rule, in an unsystematized form. Simply put, these are the data that “it’s a pity to throw away, and there’s nowhere to put it on.”
Companies create data lakes for several reasons, including: the need to have all the materials in case of verification, the potential value of data in the future, legal requirements, and others.
Data lakes can be located on the company’s own servers or in the cloud. As a rule, all employees have access to data, and the degree of security of lakes is low. Maintaining such a repository is inexpensive.
Today specialized companies are engaged in storage and administration of data lakes: Teradata, Zaloni, HVR, Podium Data, Snowflake and others. Most companies provide not only storage capacity, but also tools for structuring lakes and processing data.
According to Markets and Markets forecast , the data lakes market will grow to $8.81 billion by 2021 at an annual growth rate of 28.3%. Today, lakes are a necessary part of any corporate Big Data infrastructure.
The main problem of data lakes, like natural water bodies, is that they can become polluted and turn into swamps. In other words, repositories are so unstructured and littered with heterogeneous data that it is not possible to understand all this and, moreover, extract valuable information.
In such a situation, company data may be duplicated from department to department or, conversely, lost.
Such lakes need to be “cleaned” and structured so that the storage does not turn into a dump of dead information.
Ken Tsai gives four key tips to keep the data lake from turning into a swamp.
- Entrust the work to specialists
If your company is just about to start its own lake, entrust this business to professionals. There are a sufficient number of specialized firms on the market that, for a small fee, will deal with the structuring and proper storage of data lakes. The effect of this can pay off all the costs.
- Decide what your data lakes are for
Which specialists / departments and how often will contact the data lake for information? How will certain types of data be used? What result are you waiting for? All these issues need to be resolved before you start filling your information reservoir and releasing fish into it.
- Plan your data storage
The most important component of a “clean” data lake is metadata. This is service information that contains the date and time of creation and modification of files, the names of the last users, and other information. In addition, metadata indicate the structural affiliation of the data, their form and type. Based on this information, any data set can be easily fished out of the lake and used for the benefit of the company. All this requires a clear storage plan.
- Decide how many lakes you need
Perhaps the company does not need to start one lake, where the data of all departments and production processes will be dumped. Organizations often create a separate lake for each department and direction. This can be convenient both for the employees themselves and for the one who will manage their repositories and clean them up.
By following these simple rules, you can not only keep data lakes pristine, but also reap great benefits from them in the future.