Data Warehouse

Data Warehouse

Pinterest LinkedIn Tumblr

A data warehouse is a domain-specific, integrated, immutable and historical dataset organized for decision support purposes. Data warehouses allow more efficient, faster and better data provision for their analytical processing systems than conventional DBMS.

  • Domain-specific means that the data in the store is aggregated according to the domains it describes, rather than the applications that use it.
  • Integration means that the warehouse must support the joint storage of data of different nature, formats and types, reflecting different aspects of the domain , and not separate business functions. The data is contained within the storage in its uniform internal format.
  • Immutability means that the data in the warehouse can only be added, and cannot be deleted or changed. If any changes are nevertheless necessary, the entire repository is “reloaded”. The need for such an approach is explained by the fact that during the industrial operation of the storage together with analytical platforms , the same request to it, executed at any time, must ensure the provision of the same data. Obviously, if changes were allowed in the repository, then two identical queries executed with a certain interval during which the data could change will form two different data sets, the analysis of which can lead to incorrect conclusions and conclusions, which is unacceptable.
  • Historical support indicates that data is stored based on the date and time of its appearance, for which each record is assigned a special time stamp, which allows you to retrieve data in chronological order and analyze time sequences.

Warehouses can use a relational model, when the data in them is normalized, or a multidimensional model , using so-called dimensions . In normalized storages, data is contained in third normal form tables. The advantage of normalized CDs is that it is easy to develop and manage. The disadvantage is the need to denormalize data on the fly when retrieving it from multiple tables when performing complex analytical queries.

When generating large samples, this leads to significant delays in obtaining data, and if the storage and analytical platform are integrated into the enterprise information system, then the load on the entire system increases, which can complicate the work of many users. This problem can be partially solved by using a data model based on dimensions in the warehouse. There are two types of multidimensional data models, star and snowflake . All data loaded into the storage must be defined as a dimension, attribute, or fact .

In addition to the actual data describing the company’s business processes , the warehouse contains metadata – service data that describes the structure of the warehouse, containing information about the data belonging to a particular type or kind (dimension, attribute or fact). With the help of metadata, a semantic layer is formed that provides visual means of managing data and metadata. Metadata in repositories is divided into technical (ensure the operation of the repository itself) and business metadata (describe the data structure within a given business model).

In industrial operation, OLTP systems are the main sources of data for warehouses . In addition, sources can be any files in the enterprise information system that contain structured information, the analysis of which is expected to provide useful knowledge. Such files can be of various types and formats – spreadsheets (Excel), desktop DBMS (Access), delimited text (TXT, CSV files), accounting system files (1C: Enterprise, Parus), etc. Therefore, it is very important for data warehouses to have developed tools for loading and integrating data from various types and formats.

The author of the concept of data warehouses in the form in which it exists today is considered Bill Inmon , who coined the term in the 1970s. Ralph Kimball , who is also the author of the multidimensional data model, made a great contribution to the development of the theory of data warehouses and the practice of their use in the field of business analysis and decision support.

Feature image credit