What Are Data Mining and Data Analytics?
Data mining is the process of discovering hidden patterns in data, where (a) Patterns refer to inherent relationships and/or dependencies in the data, and (b) Large-scale data is typically stored in a database environment. Data mining is also frequently referred to as knowledge discovery in database (KDD).
Data analytics is the process of transforming raw data into knowledge and insight for making better decisions.
Related Fields
Data mining and analytics are closely related to the areas of databases, artificial intelligence (AI), statistics, and information retrieval. However, there are considerable differences between data mining and these fields.
Databases — focuses on data storage and access technology, while data mining focuses on data analysis and knowledge discovery.
Artificial Intelligence — there are overlaps between AI and data mining techniques, including those concerning machine learning. However, AI techniques are not necessarily data-oriented (e.g., expert systems).
Statistics — statistical science assumes data is scarce; it focuses on numeric data and a parametric approach (e.g., “assume data follows normal distribution”). Conversely, data mining assumes data is abundant; it deals with various data types and focuses on efficient algorithms for large-scale data.
Information retrieval (IR) — concerns finding materials (e.g., documents) of an unstructured nature (e.g., text) that satisfies an information need; it is closely related to text and web mining. A typical example of IR techniques is a search engine
Terminology and Business Applications
Example dataset
Customer ID Age Income Year-of-Education Purchase-Amount Favorite
1 35 62,000 10 429 YES
We define a few Terms
Attribute (aka a variable or field or column):
- Numeric attribute (aka a continuous or real attribute) — mathematical operations (e.g., addition, multiplication) can be applied to the values of this type of attribute.
- Categorical attribute (aka a nominal attribute) — mathematical operations cannot be applied to the values of such attribute, even if the values appear in a numeric format (e.g., social security number, credit card number).
Record (aka an observation or instance or row).
Dataset (aka a relation or table) — a set of data with attributes in columns and records in rows.
For the above table, we say
- This is a Dataset.
- We have 6 attributes (variables, or fields, or columns), they are " Customer ID", "Age", "Income", "Year-of-Education", "Purchase-Amount" and " Favorite".
- "Age", "Income", "Year-of-Education" and "Purchase-Amount" are Numeric attributes " - - Favorite" is a Categorical attribute
Now let's look at some possible business applications:
Database marketing
Credit evaluation
Fraud detection
Market basket analysis
Market segmentation
Web usage mining and personalization
Data Mining Tasks
Supervised Learning - Where there is a predefined attribute whose values are to be predicted: Classification Prediction (of numeric values)
Unsupervised Learning - Where there is no predefined attribute for prediction: Clustering (or Cluster Analysis)
Classification
Classification is the process of assigning data records into one of several predefined groups, referred to as classes. Classification involves building a model, called a classifier, which can be a mathematical function, a set of rules, or other representations.
From the above table, if we want to use "Age", "Income", and "Year-of-Education" to predict "Favorite" then this task is Classification , since the outcomes are "YES" or "NO".
More Examples of Classification:
Fraud detection (true or false)
Security trading decision (buy, sell, or hold)
Medical diagnosis (presence or absence of a disease)
Prediction
The prediction of numeric values helps us discover the relationship between one set of variables (called independent or input variables), and another set of variables (called dependent or output variables in data). Once these relationships are discovered, the past or current values of independent variables can be used to predict the future values of dependent variables.
Prediction vs. Classification:
Prediction - The values of the attribute to be predicted (dependent variable) are numeric
Classification - The values of the attribute to be predicted (class attribute) are categorical
From the above table, if we want to use "Age", "Income", and "Year-of-Education" to predict "Purchase-Amount" then this task is Prediction , since the outcomes are numeric values .
More Examples of Prediction:
Sales volume / revenue prediction
Stock price prediction association
Clustering
Clustering is the process of grouping data records into a number of groups, called clusters, such that records within the same cluster are more similar than those belonging to different clusters. This process differs from classification in that clusters are formed as a result of analysis, instead of being predefined.
From the above table, if we want to use "Age" to group customers, then this task is Clustering. For example we can group customers into 3 groups: Young (Age between 20-30), Mid-Age (Age between 31-60), Senior (Age 60 above)
More Examples of Clustering:
Market segmentation
Grouping of library books by field
D a t a M i n i n g P r o c e s s
1. Problem Identification:
- Define the purpose of the data-mining project and nature of the problem (classification, prediction, clustering).
2. Data Preparation:
- Data collection : retrieving, merging, and/or dividing data
- Data cleaning : correcting errors, handling missing data, resolving inconsistencies
- Data reduction: sampling (in rows), feature (attribute) selection (in columns)
- Data transformation: standardizing data, reforming data, conversation between numeric and categorical data
3. Model Formulation and Pattern Exploration:
- Select appropriate data-mining techniques and tools, then use the selected techniques and tools to build models and explore the patterns/relationships hidden in the data
4. Verification and Modification:
- Test if the models built are valid; modify the models if necessary
- Compare different candidate models
5. Interpretation and Implementation: Interpret the results of data mining in an intuitive manner Implement (AKA deploy) the model into related applications
Here you get all data mining related project help and support:
Our Services Related to Data Mining:
Web Data Mining Services
Social Media Data Processing and Mining
SQL Data Processing and Mining
Image Data Processing and Mining
Excel Data Processing and Mining
Word Data Processing and Mining
PDF Data Processing and Mining
Open Source Data Extraction
For more details or to get any help you can contact us(+91 8267813869) or send your requirement details at:
realcode4you@gmail.com
Comments