Data Mining Methodology

Data Mining:CRISP - Data mining

Data mining can be defined as the process of extracting valid, authentic, and actionable information from large databases using various data mining techniques like machine learning, artificial intelligence (AI) and statistics to derive patterns and trends that exist in data. These patterns and trends can be collected together and defined as a mining model.

Almost all industries these days are taking advantage of this technique including manufacturing, marketing, chemical, aerospace etc. to increase their business efficiency. Therefore the needs of a standard data mining process increased dramatically which should be easy, comprehensive, reliable and uniform across the industry.

Data Mining Methodology:-

As a result in 1990 a Cross Industry Standard Process for Data Mining (CRISP-DM) first published the uniform and standard process for data mining by defining below 7 steps

1. Defining the problem – Understanding Business Process Continue reading


Data Mining Architecture

Data Mining: 

Data mining is described as the process of discovering or extracting interesting and meaningful knowledge from large volume of data which are stored in multiple data sources like databases, file system , data warehouse etc. The knowledge extracted from data mining contributes a lots of benefit to business strategies, scientific & medical research, governments and individual. Data warehouse systems has been designed to provide analytical reports that helps (DSS- Decision Support System) business users to make decisions.

Data Mining Architecture Continue reading

Data Mining Techniques

In my last article an overview of data mining has been provided. In this article we will see the different techniques or algorithm we use in data mining process.


We all know that necessity is the mother of invention. Need of people changes with the time and to feed our hunger and need, lots of technology are also evolving around the world. Also we can do same work in no of ways and it all depends on the requirement, available time frame and our budget. Lots of data mining techniques and algorithms are available in market from variety of technology vendors in market. Described below are most common data mining techniques-

1. Association:

  • This is one of the best known data mining technique available in market these days.
  • Using this technique a pattern is discovered based on the relationship between items in same transaction of group.
  • This technique is also known as relation technique because of above reason.
  • We use association in Market Basket Analysis to identify a set of products a customer purchase frequently together.
  • Retailer are using this technique to research customer’s buying habits.
  • Based on historical sales data retailers might find out that customers always buys snakes/chips when they buys beers therefore they can put beer and chips/snakes together to save customer time and increase their sales.
  • Another example can be the recommended/suggested videos of similar nature when you browse youtube videos. Or when you visit any e-commerce website they website recommend you few more products which people buys together mostly like cellphone cover and external memory card while browsing cell phone etc.

Continue reading

Data Mining Overview


We all are very much aware of the fact that in this modern era we all are surrounded by technology and technical equipment’s like cellphone, laptop, computer, e-commerce portal, satellites,  ATM machine, banks,bio-metric machines etc and each one of them are collecting huge amount of data at enormous rate. A lot of relational database systems/servers has been designed to store such a massive amount of data. And to put data into database server OLTP system has been designed to run the business process smoothly which keeps track of each individual transaction occurring into system like bank transaction, ATM withdrawal, e-commerce shopping etc.

Now OLTP system is very good to run the business process smoothly and efficiently as it is designed to keep track of each and every transaction as a single unit of work. We also know that the value of historical data an organization has is much costlier than gold and its value will keep growing with the time. We can analyse hist Continue reading


In my last article I tried to explain about OLAP and its benefit in modern era.

In this article I am trying to brief about OLAP types and their importance.

OLAP Types

OLAP systems have been traditionally categorized using the following taxonomy.

  1. MOLAP – Multidimensional
  2. ROLAP – Relational
  3. HOLAP – Hybrid

Multidimensional – MOLAP:

  • MOLAP (multi-dimensional online analytical processing) is the classic form of OLAP and is sometimes referred to as just OLAP.
  • MOLAP stores this data in optimized multi-dimensional array storage, rather than in a relational database.
  • Some MOLAP tools require the pre-computation and storage of derived data, such as consolidations – the operation known as processing.
  • Such MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
  • The data cube contains all the possible answers to a given range of questions.
  • As a result, they have a very fast response to queries.
  • On the other hand, updating can take a long time depending on the degree of pre-computation.
  • Pre-computation can also lead to what is known as data explosion.
  • The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
  • Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during interactive sessions, not in batch jobs that run overnight.
  • And because OLAP is also analytic, the queries are complex.
  • The multidimensional data model is designed to solve complex queries in real time.
  • The multidimensional data model is composed of logical cubes, measures, dimensions, hierarchies, levels, and attributes.
  • The simplicity of the model is inherent because it defines objects that represent real-world business entities.
  • Analysts know which business measures they are interested in examining, which dimensions and attributes make the data meaningful, and how the dimensions of their business are organized into levels and hierarchies.
  • Other MOLAP tools, particularly those that implement the functional database model do not pre-compute derived data but make all calculations on demand other than those that were previously requested and stored in a cache.

Advantages of MOLAP

  • Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations due to optimized storage, multidimensional indexing and caching.
  • Smaller on-disk size of data compared to data stored in relational database due to compression techniques.
  • Effective data extraction achieved through the pre-structuring of aggregated data i.e. all calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.
  • Automated computation of higher level aggregates of the data.
  • It is very compact for low dimension data sets.
  • Array models provide natural indexing.

Disadvantages of MOLAP

  • This is usually remedied by doing only incremental processing, i.e., processing only the data which have changed (usually new data) instead of reprocessing the entire data set.
  • Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
  • Some MOLAP methodologies introduce data redundancy.
  • Requires additional investment: Cube technologies are often proprietary and do not already exists in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.


  • ROLAP works directly with relational databases.
  • The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information.
  • It depends on a specialized schema design. This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP’s slicing and dicing functionality.
  • In essence, each action of slicing and dicing is equivalent to adding a “WHERE” clause in the SQL statement.
  • ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard relational database and its tables in order to bring back the data required to answer the question.
  • ROLAP tools feature the ability to ask any question because the methodology does not limit to the contents of a cube.
  • ROLAP also has the ability to drill down to the lowest level of detail in the database.

Advantage of ROLAP

  • Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
  • Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantage of ROLAP

  • Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
  • Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL).
  • ROLAP technologies are therefore traditionally limited by what SQL can do.
  • ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

Hybrid – HOLAP

  • HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.
  • For summary-type information, HOLAP leverages cube technology for faster performance.
  • When detail information is needed, HOLAP can drill through from the cube into the underlying relational data.
  • There is no clear agreement across the industry as to what constitutes “Hybrid OLAP”, except that a database will divide data between relational and specialized storage.
  • Sometimes HOLAP database will use relational tables to hold the larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data.
  • HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches.
  • HOLAP tools can utilize both pre-calculated cubes and relational data sources.


Each type has certain benefits, although there is disagreement about the specifics of the benefits between providers.

  • Some MOLAP implementations are prone to database explosion, a phenomenon causing vast amounts of storage space to be used by MOLAP databases when certain common conditions are met: high number of dimensions, pre-calculated results and sparse multidimensional data.
  • MOLAP generally delivers better performance due to specialized indexing and storage optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes compression techniques.[16]
  • ROLAP is generally more scalable.[16] However, large volume pre-processing is difficult to implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer tremendously.
  • Since ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can use.
  • HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and MOLAP. It can generally pre-process swiftly, scale well, and offer good function support.
%d bloggers like this: