Data Mining Techniques

In my last article an overview of data mining has been provided. In this article we will see the different techniques or algorithm we use in data mining process.

data_base_knowledge_base_system

We all know that necessity is the mother of invention. Need of people changes with the time and to feed our hunger and need, lots of technology are also evolving around the world. Also we can do same work in no of ways and it all depends on the requirement, available time frame and our budget. Lots of data mining techniques and algorithms are available in market from variety of technology vendors in market. Described below are most common data mining techniques-

1. Association:

  • This is one of the best known data mining technique available in market these days.
  • Using this technique a pattern is discovered based on the relationship between items in same transaction of group.
  • This technique is also known as relation technique because of above reason.
  • We use association in Market Basket Analysis to identify a set of products a customer purchase frequently together.
  • Retailer are using this technique to research customer’s buying habits.
  • Based on historical sales data retailers might find out that customers always buys snakes/chips when they buys beers therefore they can put beer and chips/snakes together to save customer time and increase their sales.
  • Another example can be the recommended/suggested videos of similar nature when you browse youtube videos. Or when you visit any e-commerce website they website recommend you few more products which people buys together mostly like cellphone cover and external memory card while browsing cell phone etc.

Continue reading

OLAP & Types (ROLAP, MOLAP, HOLAP)

In my last article I tried to explain about OLAP and its benefit in modern era.

In this article I am trying to brief about OLAP types and their importance.

OLAP Types

OLAP systems have been traditionally categorized using the following taxonomy.

  1. MOLAP – Multidimensional
  2. ROLAP – Relational
  3. HOLAP – Hybrid

Multidimensional – MOLAP:

  • MOLAP (multi-dimensional online analytical processing) is the classic form of OLAP and is sometimes referred to as just OLAP.
  • MOLAP stores this data in optimized multi-dimensional array storage, rather than in a relational database.
  • Some MOLAP tools require the pre-computation and storage of derived data, such as consolidations – the operation known as processing.
  • Such MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
  • The data cube contains all the possible answers to a given range of questions.
  • As a result, they have a very fast response to queries.
  • On the other hand, updating can take a long time depending on the degree of pre-computation.
  • Pre-computation can also lead to what is known as data explosion.
  • The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
  • Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during interactive sessions, not in batch jobs that run overnight.
  • And because OLAP is also analytic, the queries are complex.
  • The multidimensional data model is designed to solve complex queries in real time.
  • The multidimensional data model is composed of logical cubes, measures, dimensions, hierarchies, levels, and attributes.
  • The simplicity of the model is inherent because it defines objects that represent real-world business entities.
  • Analysts know which business measures they are interested in examining, which dimensions and attributes make the data meaningful, and how the dimensions of their business are organized into levels and hierarchies.
  • Other MOLAP tools, particularly those that implement the functional database model do not pre-compute derived data but make all calculations on demand other than those that were previously requested and stored in a cache.

Advantages of MOLAP

  • Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations due to optimized storage, multidimensional indexing and caching.
  • Smaller on-disk size of data compared to data stored in relational database due to compression techniques.
  • Effective data extraction achieved through the pre-structuring of aggregated data i.e. all calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.
  • Automated computation of higher level aggregates of the data.
  • It is very compact for low dimension data sets.
  • Array models provide natural indexing.

Disadvantages of MOLAP

  • This is usually remedied by doing only incremental processing, i.e., processing only the data which have changed (usually new data) instead of reprocessing the entire data set.
  • Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
  • Some MOLAP methodologies introduce data redundancy.
  • Requires additional investment: Cube technologies are often proprietary and do not already exists in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

Relational-ROLAP

  • ROLAP works directly with relational databases.
  • The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information.
  • It depends on a specialized schema design. This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP’s slicing and dicing functionality.
  • In essence, each action of slicing and dicing is equivalent to adding a “WHERE” clause in the SQL statement.
  • ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard relational database and its tables in order to bring back the data required to answer the question.
  • ROLAP tools feature the ability to ask any question because the methodology does not limit to the contents of a cube.
  • ROLAP also has the ability to drill down to the lowest level of detail in the database.

Advantage of ROLAP

  • Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
  • Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantage of ROLAP

  • Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
  • Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL).
  • ROLAP technologies are therefore traditionally limited by what SQL can do.
  • ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

Hybrid – HOLAP

  • HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.
  • For summary-type information, HOLAP leverages cube technology for faster performance.
  • When detail information is needed, HOLAP can drill through from the cube into the underlying relational data.
  • There is no clear agreement across the industry as to what constitutes “Hybrid OLAP”, except that a database will divide data between relational and specialized storage.
  • Sometimes HOLAP database will use relational tables to hold the larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data.
  • HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches.
  • HOLAP tools can utilize both pre-calculated cubes and relational data sources.

Comparison

Each type has certain benefits, although there is disagreement about the specifics of the benefits between providers.

  • Some MOLAP implementations are prone to database explosion, a phenomenon causing vast amounts of storage space to be used by MOLAP databases when certain common conditions are met: high number of dimensions, pre-calculated results and sparse multidimensional data.
  • MOLAP generally delivers better performance due to specialized indexing and storage optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes compression techniques.[16]
  • ROLAP is generally more scalable.[16] However, large volume pre-processing is difficult to implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer tremendously.
  • Since ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can use.
  • HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and MOLAP. It can generally pre-process swiftly, scale well, and offer good function support.

Dimensional Data Modeling

Dimension & Cube

Overview & Purpose

  1. A Dimensional Model is a database structure mechanism that is optimized for online queries (OLAP) and Data Warehousing tools.
  2. Dimensional Models are designed for reading, summarized and analyzing numeric/measures information.
  3. The central attraction of the dimensional model of a business is its simplicity that simplicity is the fundamental key that allows users to understand databases, and allows software to navigate databases efficiently.
  4. Advantage of storing data in such a fashion that it is easier to retrieve the information from the data once the data is stored in database.
  5. This is the reason why dimensional modeling is used mostly in data warehouses built for reporting.
  6. Relational Models are optimized for adding and maintaining data using real-time operational systems.
  7. It is comprises of Fact and Dimension tables.
  8. In Dimensional Data Modeling hierarchies for dimensions are stored in the dimensions itself.
  9. Dimensional Data Modeling is used for calculating summarized data.
  10. Performance of Dimensional Data Modeling can be increased when materialized view are used.
  11. Materialized view contains pre calculated aggregated values from combination of dimension and fact tables.

Dimensional Model Benefits 

  1. Understand Ability – Compared to the normalized model, the dimensional model is easier to understand and more intuitive. In this model data/information is grouped into coherent business categories or dimensions which makes it easier to read and interpret. Simplicity also allows software to navigate databases efficiently. In normalized models, data is divided into many discrete entities and even a simple business process might result in dozens of tables joined together in a complex way.
  2. Faster Data Retrieval: Data stored in Multidimensional model produces very fast result as data is already stored in optimized way for retrieval in terms of OLAP Cubes.
  3. Query performance – Dimensional models are more DE normalized and optimized for data querying, while normalized models seek to eliminate data redundancies and are optimized for transaction loading and updating. The predictable framework of a dimensional model allows the database to make strong assumptions about the data which may have a positive impact on performance. Each dimension is an equivalent entry point into the fact table, and this symmetrical structure allows effective handling of complex queries. Query optimization for star joined databases is simple, predictable, and controllable.
  4. Extensibility – Dimensional models are scalable and easily accommodate unexpected new data. Existing tables can be changed in place either by simply adding new data rows into the table or executing SQL alter table commands. No queries or applications that sit on top of the data warehouse need to be reprogrammed to accommodate changes. Old queries and applications continue to run without yielding different results. But in normalized models each modification should be considered carefully, because of the complex dependencies between database tables.

Dimensional Model Approach

  1. Identify the dimensions
  2. Identify the measures
  3. Identify the attributes or properties of dimensions
  4. Identify the granularity of the measures
  5. History Preservation (Optional) -Slowly Changing Dimensions (SCD)

Multidimensional Model & OLAP

Multidimensional Model

Multidimensional Model:

  • Multidimensional structure is defined as the variation of the relational model that uses multidimensional structures to organize data and express the relationships between data.
  • The ability to analyze metrics in different dimensions such as time, geography, gender, product, etc.
  • The structure is broken into cubes and the cubes are able to store and access data within the confines of each cube.
  • Each cell within a multidimensional structure contains aggregated data related to elements along each of its dimensions.
  • Even when data is manipulated it remains easy to access and continues to constitute a compact database format.
  • The data still remains interrelated. Multidimensional structure is quite popular for analytical databases that use online analytical processing (OLAP) applications.
  • Analytical databases use these databases because of their ability to deliver answers to complex business queries swiftly.
  • Data can be viewed from different angles, which gives a broader perspective of a problem unlike other models.

Aggregations

  • It has been claimed that for complex queries OLAP cubes can produce an answer in around 0.1% of the time required for the same query on OLTP relational data.
  • The most important mechanism in OLAP which allows it to achieve such performance is the use of aggregations.
  • Aggregations are built from the fact table by changing the granularity on specific dimensions and aggregating up data along these dimensions.
  • The number of possible aggregations is determined by every possible combination of dimension granularities.
  • The combination of all possible aggregations and the base data contains the answers to every query which can be answered from the data.
  • Because usually there are many aggregations that can be calculated, often only a predetermined number are fully calculated; the remainders are solved on demand.
  • The problem of deciding which aggregations (views) to calculate is known as the view selection problem.
  • View selection can be constrained by the total size of the selected set of aggregations, the time to update them from changes in the base data, or both.
  • The objective of view selection is typically to minimize the average time to answer OLAP queries, although some studies also minimize the update time.
  • View selection is NP-Complete.
  • Many approaches to the problem have been explored, including greedy algorithms, randomized search, genetic algorithms and a search algorithm.

 Online Analytical Processing (OLAP)

  • OLAP is an approach to answer all multi-dimensional analytical (MDA) queries.
  • The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP.
  • The OLAP Report has proposed the FASMI test, Fast Analysis of SharedMultidimensional I
  • OLAP is part of business intelligence, which also encompasses relational database, report writing and data mining.
  • OLAP enable users to analyze multidimensional (MDX) data interactively from multiple business perspectives.
  • Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad hoc queries with a rapid execution time.
  • They borrow aspects of navigational databases, hierarchical databases and relational databases.

OLAP consists of three basic analytical operations:

  1. Consolidation (Roll-up): Aggregation of data accumulated & computed in one or more dimensions.
  2. Drill-Down
  3. Slicing and Dicing

For example:

  • All sales offices are rolled up to the sales department or sales division to anticipate sales trends. By contrast, the drill-down is a technique that allows users to navigate through the details.
  • For instance, users can view the sales by individual products that make up a region’s sales.
  • Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the OLAP cube and view (dicing) the slices from different viewpoints.
  • These viewpoints are sometimes called dimensions (such as looking at the same sales by salesperson or by date or by customer or by product or by region, etc.)

Overview of OLAP Systems

  • At the core of any OLAP system there is an OLAP Cube (also called a multidimensional cube or a hypercube).
  • It consists of numeric facts called measures that are categorized by dimensions.
  • The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space.
  • The usual interface to manipulate an OLAP cube is a matrix interface, like Pivot tables in a spreadsheet program, which performs projection operations along the dimensions, such as aggregation or averaging.
  • The cube metadata is typically created from a star schema or snowflake schema or fact constellation of tables in a relational database.
  • Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.
  • Each measure can be thought of as having a set of labels, or meta-data associated with it.
  • A dimension is what describes these labels; it provides information about the measure

Types

OLAP systems have been traditionally categorized using the following taxonomy.

  1. MOLAP – Multidimensional
  2. ROLAP – Relational
  3. HOLAP – Hybrid

Data Warehouse Architecture

We all know that for building something from scratch we need an architect that will create a blue print or an architecture of system so that we can execute and implement that blue print to bring into life.

Similarly to create a data warehouse for an organization we need to think of from the prospective of data warehouse architect. There are several phases or steps the operational data need to undergo before being a part of data warehouse. It totally depends how your client want to analyse their data.

Different data warehousing systems have different structures. Some may have an ODS (Operational Data Store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system. The picture in header shows the relationships among the different components of the data warehouse architecture.

DataWarehouseArchitecture

In general, all data warehouse systems have the following 9 layers:

  1. Data Source Layer
  2. Data Extraction Layer
  3. Staging Area
  4. ETL Layer
  5. Data Storage Layer
  6. Data Logic Layer
  7. Data Presentation Layer
  8. Metadata Layer
  9. System Operations Layer

Each component is discussed individually below:

1. Data Source Layer

This represents the different data sources that feed data into the data warehouse. The data source can be of any format — plain text file, relational database, other types of database, Excel file, etc., can all act as a data source.

Data Source Types:-

  • Operations — such as sales data, HR data, product data, inventory data, marketing data, systems data.
  • Web server logs with user browsing data.
  • Internal market research data.
  • Third-party data, such as census data, demographics data, or survey data.

All these data sources together form the Data Source Layer.

2. Data Extraction Layer

Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.

3. Staging Area

This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.

4. ETL Layer

This is where data gains its “intelligence”, as logic is applied to transform the data from a transnational nature to an analytical nature. This layer is also where data cleansing happens. The ETL design phase is often the most time-consuming phase in a data warehousing project, and an ETL tool is often used in this layer.

5. Data Storage Layer

This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.

6. Data Logic Layer

This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but do affect what the report looks like.

7. Data Presentation Layer

This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others. Usually an OLAP tool and/or reporting tool is used in this layer.

8. Metadata Layer

This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that’s in the metadata layer. A metadata tool is often used to manage metadata.

9. System Operations Layer

This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

Hope this article provides you some idea about the life cycle of data before it is an integral part of data warehouse. Looking forward for feedback and suggestion.

%d bloggers like this: