top of page
realcode4you

Case Study Assignment Help In Data Mining

Requirement Details

An organization wanted to check association among employee experience, skills, traits etc. to better manage human resources.


As a data scientist, you are required to recognize patterns from the available data and evaluate efficacy of methods to obtain patterns. Your activities should include - performing various activities pertaining to the data such as, preparing the dataset for analysis; investigating the relationships in the data set with visualization; identify frequent patterns; formulate association rules and evaluate quality of rules.


Demonstrate KDD process with following activities:

  • Problem statement

  • ·Perform exploratory data analysis

  • Preprocess the data

  • Propose parameters such as support, confidence etc.

  • Discover frequent patterns

  • Iterate previous steps by varying parameters

  • Formulate association rules

  • Compare association rules

  • Briefly explain importance of discovered rules


Following are some points for you to take note of, while doing the assignment:

  • The data in some of the rows in the data set may be noisy

  • Some of the attributes have large number of values – you can consider merging them into 2 or 3 values to simplify the solution

  • State all your assumptions clearly

  • Provide clear explanations to explain your stand


Solution:

Import Libraries

#import libraires
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import mean_absolute_error, make_scorer
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
import itertools 
import numpy as np
import matplotlib.pyplot as plt

Read Data

#Read Dataset
df_train = pd.read_csv("Employee_skills_traits.csv")
df_train

Output:


Replace column attribute space

#Replace Column attribute space with '-'
df_train.columns = df_train.columns.str.replace(' ','_')

Now Check columns

#Show all dataset Columns
df_train.columns

Output:

Index(['ID', 'Employment_period_', 'Time_in_current_department_', 'Gender_',
       'Team_leader_', 'Age_', 'Member_of_professional_organizations_',
       '.Net_', 'SQL_Server_', 'HTML_CSS_Java_Script_', 'PHP_mySQL_',
       'Fast_working', 'Awards', 'Communicative_'],
      dtype='object')

Check Data After Rename Attribute

df_train


Here we see that all column name are renames and space is removed(Here we do this steps because space raise issue to read columns)



Checking Dataset Null Values

#Checking Null Value
#Visualize for check null value
check_null_value = df_train.isnull()
sns.heatmap(check_null_value,yticklabels=False,cbar=False,cmap='viridis')

Output:














As per above heat map we can say that it has no null value.



Checking Shape of dataset

#– the shape of the dataset
df_train.shape

Output:

(998, 14)

Dataset has 998 rows and 14 columns



Checking Data Type

#– info of the dataset
df_train.info()

Output:














Summary of the dataset

#– summary of the dataset
df_train.describe()

Output:



Finding Features and target variable and Split dataset


#Devided dataset into with target attribute "Awards"
x=df_train.drop(['Awards'],axis=1)
target=df_train.Awards
#split dataset with 25 percent test sample
X_train, X_test, y_train, y_test = train_test_split(x, target, test_size=0.25, random_state=0)

Use K means Clustering Algorithm And Fit into Model

#Use Data mining Clustering and Find the Mean Square Error
kmeans = KMeans(n_clusters=2) 
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)

Output:

Validation MAE: 0.548

Accuracy

#Find Accuracy of Model
score = metrics.accuracy_score(y_test,y_pred)
print('Accuracy: %0.2f' % score)

Output:

Accuracy: 0.45


Fine Tune the Model To increase the accuracy

#Use Data mining Clustering and Find the Mean Square Error
#Fine Tune The Model to minimize the MSE and increase Accuracy
kmeans = KMeans(n_clusters=2, init='k-means++', n_init=5, max_iter=100, tol=0.0001) 
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)

Output:

Validation MAE: 0.452

Accuracy

#Find Accuracy of Model
#Find Accuracy After Fine tune Parameter
score = metrics.accuracy_score(y_test,y_pred)
print('Accuracy: %0.2f' % score)

Here we have see increase the model Accuracy when we have tune the model using extra K means parameters



Plot the Confusion Matrix

#Evaluation of Model - Confusion Matrix Plot

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['0','1'],
                      title='Confusion matrix, without normalization')

Output:













Find Recall And Precision

#Find Recall and Precision
cnf_matrix = confusion_matrix(y_test, y_pred)
recall = np.diag(cnf_matrix) / np.sum(cnf_matrix, axis = 1)
precision = np.diag(cnf_matrix) / np.sum(cnf_matrix, axis = 0)

Confusion Matrix:

cnf_matrix

Output:

array([[63, 61],
       [52, 74]], dtype=int64)

Precision:

#Precision Score
precision

Output:

array([0.55, 0.55])

Recall Score:

#Recall Score
recall

Output:

array([0.51, 0.59])


Exploratory Data Analysis

#EDA(Exploratory Data Analysis)
#ploting Graph of co-rrelation
fig = plt.figure(figsize=(10,5))
sns.heatmap(df_train.corr())

Output:











Box Plot

#Box Plot between ID & Employment_Period
#Here Both attribute(ID & Employment_period_) is associated to each other
sns.boxplot(x=df_train['Employment_period_'],y=df_train['ID'] )

Output:












# visualize frequency distribution of `Gender` variable
f, ax = plt.subplots(figsize=(9, 7))
ax = sns.countplot(x="Gender_", data=df_train, palette="Set1")
ax.set_title("Frequency distribution of Gender variable")
ax.set_xticklabels(df_train.Gender_.value_counts().index, rotation=30)
plt.show()

Output:















Line Chart

#Bar Plot(Here we choose some records of daset due to show visible plots)
df_train_plot = df_train.iloc[:18]
df_train_plot.plot(x= "ID", y= "Employment_period_", kind="bar")

Output:















If you need any other help related to machine learning then send your requirement details at realcode4you@gmail.com and get instant help.

Bình luận


bottom of page