In this project, I worked on a time series dataset consisting of event logs with information such as date and time, event type, cluster, duration, and total users. The goal was to predict the next event time, the number of users, and the next 100 cluster names using deep learning techniques, particularly LSTM networks.
Data Processing
Loading Data: I loaded the dataset from a CSV file using Pandas, specifying column names for better organization.
Exploratory Data Analysis: Explored the dataset by displaying the first few rows, checking the shape, and examining the distribution of cluster and event types using bar plots.
Preprocessing: Categorical variables (event type and cluster) were encoded using LabelEncoder. The date and time were converted to Unix timestamp for numerical processing. I defined a sequence length of 10 to create input sequences for the LSTM model.
Model Training
Sequential LSTM Model: I built a sequential LSTM model architecture for predicting the next event time and the number of users. The model consists of two LSTM layers with dropout regularization to prevent overfitting and a dense output layer for regression tasks.
Compilation and Training: The models were compiled using the Adam optimizer and mean squared error (MSE) loss function. They were trained for 10 epochs with a batch size of 64.
Model Evaluation: After training, I evaluated the models' performance using metrics such as R2 score, MSE, and MAE. The models showed reasonable performance in predicting the next event time and the number of users.
Model for Cluster Prediction
Sequence Creation: Sequences with a length of 100 were created for predicting the next 100 cluster names.
Model Architecture: A TimeDistributed dense layer was added to the LSTM model to predict the next 100 cluster names at each time step.
Compilation and Training: The model was compiled with categorical cross-entropy loss and trained for 10 epochs.
Model Evaluation: The model was evaluated on the test data, and metrics such as loss and accuracy were computed. Classification report was generated to assess the model's performance.
Hyperparameters:
· LSTM units: 64
· Dropout rate: 0.2
· Optimizer: Adam
· Loss function: Categorical Crossentropy
· Number of epochs: 10
· Batch size: 64
Overall, the LSTM models demonstrated promising results in predicting various aspects of the time series dataset. Further experimentation and fine-tuning of hyperparameters could potentially enhance the model's performance.
Implementation
#Importing Modules
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
#Load the Data
#Explore Data
mydata.head()
output:
print("[$] Rows of dataset >> ",mydata.shape[0])
print("[$] Columns of dataset >> ",mydata.shape[1])
output:
[$] Rows of dataset >> 40000 [$] Columns of dataset >> 5
output:
mydata['Cluster'].value_counts().plot(kind='bar')
output:
mydata['EventType'].value_counts().plot(kind='bar')
Output:
#Preporcess Data
# Encode categorical variables
encoder = LabelEncoder()
mydata["EventType"] = encoder.fit_transform(mydata["EventType"])
mydata["Cluster"] = encoder.fit_transform(mydata["Cluster"])
# Convert Date_and_Time to datetime object
mydata["Date_and_Time"] = pd.to_datetime(mydata["Date_and_Time"])
#Data Building
#Model Building
...
...
#Model Prediction
#Model Evulation
Comments