BASIC DEFINITIONS
Datawarehousing :
DWH (Datawarehousing) is a repository of integrated information, specifically structured for Queries and analysis. Data and information are extracted from heterogeneous sources as they are generated. This makes it much easier and more efficient to run queries over data that originally came from different sources.
"A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile Collection of data in support of management’s decision making process".
Subject-oriented – a DW is organized around major subjects; excludes data that is not useful in the decision support process.
Integrated – a DW is constructed by integrating numerous data sources (relational DB, flat files, legacy systems. DW provides mechanisms for cleaning and standardizing of the data.
Time-variant – data is stored to provide information from a historical prospective. Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time.
Nonvolatile – a DW is physically separated from the operational environment. Due to this separation it does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires
Two operations: initial loading of data and access of data.
Data Warehouse is an architecture constructed by integrating data from multiple heterogeneous sources to support structured and/or ad hoc queries, analytical reporting and decision making.
Data Warehousing is a process of constructing and using data warehouses.
- A Multi-Subject Information Store
- Typically 100’s of Gigabytes to Terabytes
Data Mart : It is a collection of subject areas organized for decision support based on the needs of a given department. Ex: sales, marketing etc. the data mart is designed to suit the needs of a department. Data mart is much less granular than the ware house data.
Data Mart is
- A Single Subject Data Warehouse
- Often Departmental or Line of Business Oriented
- Typically Less Than a 100 Gigabytes
Differences between DWH & Data Mart : DWH is used on an enterprise level, while data marts are used on a business division / department level. Data warehouses are arranged around the corporate subject areas found in the corporate data model. Data warehouses contain more detail information while most data marts contain more summarized or aggregated data.
OLTP : OLTP is Online Transaction Processing. This is standard, normalized database structure. OLTP is designed for Transactions, which means that inserts, updates and deletes must be fast.
OLAP : OLAP is Online Analytical Processing. Read-only, historical, aggregated data.
Difference between OLTP and OLAP:
Fact Table :
It contains the quantitative measures about the business.
Fact tables that contain aggregated facts are often called summary tables.Dimension Table :
It is a descriptive data about the facts (business).
Aggregate tables :
Conformed dimensions :
Conformed dimensions are a dimension table shared by fact tables. These tables connect separate star schemas into an enterprise star schema.
Schema :
A schema is a collection of database objects, including tables, views, indexes, and synonyms. There are a variety of ways of arranging schema objects in the schema models designed for data warehousing. Most data warehouses use a dimensional model.
Star Schema :
Star Schema is a set of tables comprised of a single, central fact table surrounded by de-normalized dimensions. Star schema implement dimensional data structures with de-normalized dimensions
Snow Flake Schema:
Snow Flake Schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies. Snowflake schema implement dimensional data structures with fully normalized dimensions.
Queries :
The DWH contains 2 types of queries. There will be
- Fixed queries
The end-user access tools are capable of automatically generating the database query that answers any question posted by the user.
- Canned Queries:
Kimball (Bottom up) vs Inmon (Top down) approaches :
Bottom up:
Acc. To Ralph Kimball, when you plan to design analytical solutions for an enterprise, try building data marts. When you have 3 or 4 such data marts, you would be having an enterprise wide data warehouse built up automatically without time and effort from exclusively spent on building the EDWH. Because the time required for building a data mart is lesser than for an EDWH.
Top down:
Try to build an Enterprise wide Data warehouse first and all the data marts will be the subsets of the EDWH. Acc. To him, independent data marts cannot make up an enterprise data warehouse under any circumstance, but they will remain isolated pieces of information –stove pieces.
ER Diagram :ER model is a conceptual data model that views the real world as entities and Relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represent data objects.
ETL : Extraction, Transformation & Loading. ETL Tools in the market for eg, Informatica, Ascential Data stage, Acta ,Oracle Warehouse Builder(OWB) etc.,
Staging Area :
It is the work place where raw data is brought in, cleaned, combined, archived and exported to one or more data marts. The purpose of data staging area is to get data ready for loading into a presentation layer.
Slowly Changing Dimensions :
Dimensions are said to be slowly changing dimensions when their attributes remain almost constant, requiring minor alterations.
Eg Marital status
Bitmap index, B tree index are the indexing mechanism use for a typical data warehouse.
OLAP, MOLAP, ROLAP, DOLAP, HOLAP :
OLAP:
Online Analytical Processing. OLAP tools in the market eg Business Objects, Brio, Cognos ,Microstrategy , Alphablock, Crystal Reports etc.,
ROLAP:
Relationnal OLAP, the users see cubes but under the hood it is pure relational table, Micro-Strategy is a ROLAP product.
MOLAP:
Multi dimensionnal OLAP, the users see cubes and under the hood there a big cube, Oracle Express used to be a MOLAP product.
DOLAP:
Desktop OLAP, the users see many cubes and under the hood there are many small cubes, Cognos PowerPlay.
HOLAP:
Hybrid OLAP, combines MOLAP and ROLAP, Essbase
Types of Facts:
- Additive
- Able to add the facts along all the dimensions
- Discrete numerical measures eg. Retail sales in $
- Nonadditive
- Numeric measures that cannot be added across any dimensions
- Intensity measure averaged across all dimensions eg. Room temperature
- Textual facts - AVOID THEM
- Semi Additive
- Snapshot, taken at a point in time
- Measures of Intensity
- Not additive along time dimension eg. Account balance, Inventory balance
- Added and divided by number of time period to get a time-average.
Attributes :
A field represented by a column within an object (entity). An object may be a table, view or report. An attribute is also associated with an SGML(HTML) tag used to further define the usage.
Business Activity Monitoring (BAM) :
BAM is a business solution that is supported by an advanced technical infrastructure that enables rapid insight into new business strategies, the reduction of operating cost by real-time identification of issues and improved process performance.
Business Intelligence (BI) :Business intelligence is actually an environment in which business users receive data that is reliable,consistent, understandable, easily manipulated and timely. With this data, business users are able to conduct analyses that yield overall understanding of where the business has been, where it is now and where it will be in the near future. Business intelligence serves two main purposes. It monitors the financial and operational health of the organization (reports, alerts, alarms, analysis tools, key performance indicators and dashboards). It also regulates the operation of the organization providing two- way integration with operational systems and information feedback analysis.
Data Integration :
Pulling together and reconciling dispersed data for analytic purposes that organizations have maintained in multiple, heterogeneous systems. Data needs to be accessed and extracted, moved and loaded, validated and cleaned, and standardized and transformed.
Data Mapping :
The process of assigning a source data element to a target data element.
Data Mining :
A technique using software tools geared for the user who typically does not know exactly what he's searching for, but is looking for particular patterns or trends. Data mining is the process of shifting through large amounts of data to produce data content relationships. It can predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. This is also known as data surfing.
Data Modeling :
A method used to define and analyze data requirements needed to support the business functions of an enterprise. These data requirements are recorded as a conceptual data model with associated data definitions. Data modeling defines the relationships between data elements and structures.
Drill Down:
A method of exploring detailed data that was used in creating a summary level of data. Drill down levels depend on the granularity of the data in the data warehouse.
Meta Data:
Meta data is data that expresses the context or relativity of data. Examples of meta data include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions and process/method descriptions. The repository environment encompasses all corporate meta data resources: database catalogs, data dictionaries and navigation services. Meta data includes name, length, valid values and description of a data element. Meta data is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems.
Normalization:
The process of reducing a complex data structure into its simplest, most stable structure. In general, the process entails the removal of redundant attributes, keys, and relationships from a conceptual data model.
Surrogate Key:
A surrogate key is a single-part, artificially established identifier for an entity. Surrogate key assignment is a special case of derived data - one where the primary key is derived. A common way of deriving surrogate key values is to assign integer values sequentially.
MOLAP, ROLAP, and HOLAP
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages:
- Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.
- Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.
- Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
- Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
ROLAPThis methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
- Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
- Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
- Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
- Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.
I really liked your blog post. Much thanks again. Awesome.
ReplyDeleteLearn Micro Strategy Online
Micro Strategy Dossier Training
I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.
ReplyDeleteData Engineering Services
AI & ML Solutions
Data Analytics Services
Data Modernization Services
Get General Warehousing Companies In Dubai that provide excellent facilities opted by private parties and business houses.
ReplyDelete