Data mining can be defined as the process of extracting valid, authentic, and actionable information from large databases using various data mining techniques like machine learning, artificial intelligence (AI) and statistics to derive patterns and trends that exist in data. These patterns and trends can be collected together and defined as a mining model.
Almost all industries these days are taking advantage of this technique including manufacturing, marketing, chemical, aerospace etc. to increase their business efficiency. Therefore the needs of a standard data mining process increased dramatically which should be easy, comprehensive, reliable and uniform across the industry.
Data Mining Methodology:-
As a result in 1990 a Cross Industry Standard Process for Data Mining (CRISP-DM) first published the uniform and standard process for data mining by defining below 7 steps
1. Defining the problem – Understanding Business Process
- 1st it is required to understand business objectives clearly and find out what are business’s needs.
- This involves defining the expectations from data mining, including business analyst’s requirements, marketing strategies, portfolio, forecasting, and decision support requirements.
- Next we have to assess current situation by finding about available resources, assumptions, constraints and other important factors which need to be considered.
- After accessing as is situation we define to be situation (business goal).
- Finally a good data mining detailed plan need to be established to achieve both business and data mining goal.
2. Data Understanding/Exploring Data:
- 1st data understanding phase starts with initial data collection from available data sources to get familiar with available data set.
- It may include data integration and data load to make data collection successfully.
- Then data needs to be explored by tackling data mining questions which can be addressed using querying, reporting and visualization.
- Finally data quality must be examined by for data completeness and missing values.
3. Data Preparation:
- This is most time-consuming step of data mining which takes approx. 90% of total project time.
- Once data sources has been identified and source data has been examined, they need to be selected, cleaned, constructed and formatted into target data set format.
- The result of data preparation steps is final data set which need to be examined by data mining process using some algorithms.
- This steps mainly involves ETL (Extraction, Transformation and Loading) tasks.
4. Building Model-Data Modeling
- A model typically contains input columns, an identifying column, and a predictable column.
- Data type for the columns can be defined in a mining structure based on which data mining algorithms process the data.
- The output of data mining model can provide you with the analyzed and forecast data that can be readily used by the business analysts.
- A data mining model applies a mining model algorithm to the data that is represented by a mining structure.
- Model Parameters and boundary values can be defined on the data mining algorithms and usage parameters on data mining model column.
- You can define columns to be input columns, key columns, or predictable columns.
The following basic terms would be useful to understand about the column types, enable for further studying the rest of the sections.
a) Continuous Column: This column contains numeric measurements typically the product cost, salary, account balance, shipping date, invoice date having no upper bound.
b) Discrete Column: These are finite unrelated values such as Gender, location, age, telephone area codes. They do not need to be numeric in nature, and typically do not have a fractional component.
c) Discretized Column: This is a continuous column converted to be discrete. For example, grouping salaries into predefined bands.
d) Key: The column which uniquely identifies the row, similar to the primary key. This is sometimes called the Case attribute.
5. Evaluating and validating the model:
Well you have created a data model and now it’s time to validate it.
- There are a variety of techniques developed to achieve that goal – typically applying different models to the same data set and then comparing their performance to choose the best one.
- Data analysis life cycle represent the maturity model of the analysis.
- Operational analysis is nothing but business transaction Reports (closing bank balances, who was admitted into the hospital today, how many support calls are closed today etc.).
- As name suggests trend analysis understands the growth of the historical data over a period of time.
- Ad-hoc analysis is business context analysis (Products sales by region) or it can also be used for finding the root cause such as sudden decrease in sales of a product due floods or natural calamity.
- Predictive analysis is predicting the patterns for the future (also called forecasting)
6. Deploying and updating the models
- Once the right data mining model is chosen and trained with the source data, you deploy it on the server for DMX queries and APIs to access.
- These queries act directly on the deployed models and feed the result to the user interface (UI) or portal for customer interaction and decision-making.
- The processed models may need to be updated as the business requirement changes.
- A change in business requirements may require you to consider choosing a different mining model.
- Input data may change or you may have new values you need to predict.
- Data mining models can be trained (processed) and deployed by using SSIS tasks and transformations.
- This is helpful when you want the model to respond to source data changes on a near-real-time basis.
- Also, mining results can be accessed by using SSIS tasks and transformations for feeding OLAP or relational structures.
- This is useful when end users access and analyze the data using existing OLAP or relational reports.
7. Accessing the model:
- Once the model is built and deployed, the next step is to access the mined information from the front-end interface for further analysis.
- The query language DMX is typically used for accessing data mining models.
- DMX is similar to the MDX query language for OLAP queries and to the SQL query language for relational queries.
Following tasks can be performed using DMX query language, they are
- Creating mining structures and mining models
- Processing mining structures and mining models
- Deleting or dropping mining structures or mining models
- Copying mining models
- Browsing mining models
- Predicting against mining models
Hope this post give an overview of the data mining process. Looking forward for your feedback and recommendation.