Data Warehouse VS Data Lake Data Warehouse (DWH) is a convenient solution for enterprises and organizations, the principles of which we decided to cover in our today’s article. Based on our own experience in building data warehouses for financial institutions, we will also try to present all the benefits of using DWH as clearly as possible, as well as compare it with its “competitor” – cloud storage.
The data warehouse is a subject-oriented information database that is processed and stored in a single hardware and software complex that provides quick access to operational and historical information, multidimensional data analysis, forecasts and statistics in the context of consistent reference information. It is built on the basis of data management systems and decision support systems. The data entering the storage is usually read-only. Often, the traditional storage architecture has a three-tier structure, consisting of the following layers:
• The bottom layer containing the database server.
• Middle tier containing the OLAP server. Transforms data into a structure better suited for analysis and complex queries.
• Top level – client level. It contains tools used for high-level data analysis, reporting.
Here is a list of benefits from the introduction of a corporate data warehouse:
1. The emergence of an information storage system using a single reference information;
2. Opportunities for a comprehensive business analysis;
3. Opportunities for analysis using historical data;
4. Ability to connect and analyze information previously stored in different information systems;
5. Possibility of analysis and cross-breeding of different types of data;
6. The emergence of a basis for a better calculation of the cost of services. However, companies are increasingly moving to cloud storage, instead of traditional on-premises systems. Cloud data storage (English Data Lake) is a model of cloud computing that involves storing data on the Internet using a cloud computing resource provider that provides data storage as a service and manages it. There are a number of differences from traditional storages here: cloud storages are faster and cheaper to set up and scale, they can run complex analytical queries much faster due to the use of massively parallel processing. And also there is no need to buy physical equipment.
But despite the fact that cloud storage is a big step forward from the traditional architecture approach, users still face a number of problems when setting it up:
• Loading data into cloud storage is not trivial, and large-scale data pipelines require setup, testing and support of the ETL process. This part of the process is usually done by third party tools;
• Inserts and deletions must be done carefully to avoid performance degradation for update queries;
• Difficult to deal with semi-structured data. they need to be normalized in a relational database format, which requires the automation of large data flows;
• As a rule, nested structures are not supported in cloud data stores;
• Cluster optimization: to achieve optimal performance, you must constantly review and, if necessary, further tweak the configuration;
• The need to optimize queries due to the fact that user queries may not follow best practices and therefore take much longer to complete;
• While data storage vendors provide many options for backing up data, the process requires monitoring and close attention.