1. Summary of Problem Statement, Data, and Findings
Problem Statement:
Industries worldwide face significant challenges in preventing accidents in manufacturing plants. These accidents sometimes result in severe injuries or fatalities. Despite advancements in safety measures, the frequency and severity of these incidents remain concerning. This project aims to design a machine learning (ML) and deep learning (DL)-based chatbot utility to assist safety professionals in identifying safety risks by analyzing accident descriptions in real-time. By leveraging historical accident data, the chatbot will highlight potential risks based on incident details.
Data Description:
The dataset comprises accident records from 12 manufacturing plants located in three different countries. Each record represents an individual accident occurrence, and the dataset includes the following columns:
➢ Data: Timestamp of the accident ➢ Countries: Country where the accident occurred (anonymized)
➢ Local: City where the plant is located (anonymized)
➢ Industry Sector: The sector of the manufacturing plant
➢ Accident Level: Severity of the accident, ranging from I (least severe) to VI (most severe)
➢ Potential Accident Level: Severity of the accident based on other contributing factors ➢ Genre: Gender of the affected individual (Male/Female)
➢ Employee or Third Party: Indicates whether the injured person was an employee or a third-party individual
➢ Critical Risk: Describes the primary risk factor involved in the accident
➢ Description: Detailed narrative of the accident
Key Findings:
➢ The dataset provides valuable insights into accident patterns across different sectors and countries.
➢ Features such as "Accident Level" and "Critical Risk" may be significant in predicting the severity and potential risks in future accidents.
➢ Textual descriptions of accidents offer rich data for Natural Language Processing (NLP) techniques to identify safety concerns.
We are now loading the dataset into df variable for further exploration of data like how many rows and columns and how data look like
The dataset consists of 425 entries and 11 columns, with no missing or null values. The columns include various attributes related to industrial accidents, such as the timestamp of the incident ("Data"), the country and location where the accident occurred ("Countries" and "Local"), the industry sector, accident severity levels ("Accident Level" and "Potential Accident Level"), the gender of the affected individual ("Genre"), whether the person involved was an employee or third party ("Employee or Third Party"), the critical risk factor ("Critical Risk"), and a detailed description of the incident ("Description"). The data types include integer values, categorical strings, and a datetime object for the accident t imestamps. The absence of null values indicates that the dataset is clean and ready for further analysis and modeling.
2. Summary of the Approach to EDA and Pre-processing
➔ Data Pre-Processing
➢ Dropping the 'Unnamed: 0' column: The column 'Unnamed: 0' is identified as an index column and is not needed for analysis. Therefore, it is removed from the dataset using the drop() function with axis=1 to drop a column and inplace=True to modify the dataset directly.
➢ Converting categorical columns: All categorical columns in the dataset, such as 'Countries', 'Local', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Genre', 'Employee or Third Party', and 'Critical Risk', are converted to the category data type using astype('category'). This step optimizes memory usage and helps in better handling of categorical variables for analysis and modeling.
Exploratory Data Analysis (EDA):
Accident severity levels vary, with I being the most common level of accident severity, followed by lower counts for levels II, III, IV, and V. The high count of level I incidents might indicate a larger number of less severe incidents in the dataset
1. Count of Countries: The dataset has three country categories, with Country_01 having the highest count of incidents, followed by Country_02. Country_03 has significantly fewer incidents. This indicates that Country_01 is the most frequent location for incidents in the data.
2. Count of Local: The "Local" feature displays a wide distribution across 12 local categories, with certain local codes like Loc04 and Loc09 having a higher frequency of incidents. This variation suggests that incidents are not evenly distributed across locations, possibly indicating areas with higher risk.
3. Count of Industry Sector: The incidents are primarily concentrated in two industry sectors: Mining and Metals, with Mining reporting the most incidents. This may imply that mining operations have a higher propensity for accidents, which could be valuable for targeted risk management.
4. Count of Accident Level: Accident severity levels vary, with I being the most common level of accident severity, followed by lower counts for levels II, III, IV, and V. The high count of level I incidents might indicate a larger number of less severe incidents in the dataset.
5. Count of Potential Accident Level: The "Potential Accident Level" feature shows that level IV has the highest frequency, followed by levels III, II, and I, with level VI having the least count. This distribution can help in prioritizing preventive measures based on the potential risk severity.
6. Count of Genre: The "Genre" feature indicates a predominantly male workforce in the dataset, with significantly fewer incidents involving females. This may reflect the demographics of the industry or the workforce distribution within these sectors.
7. Count of Employee or Third Party: This chart shows that incidents are almost equally split between Employee and Third Party, with Third Party (Remote) having fewer cases. This insight can be useful in distinguishing between internal and external risks when planning safety measures.
1. Peak in Early Months: The number of accidents is highest in the early part of the year, with a peak in February (around 60 incidents). This could indicate heightened risks or increased activity in the early months, possibly due to production cycles or weather-related factors.
2. Gradual Decline: Following the peak in February, the accident count gradually decreases over the next few months, though it remains relatively high until June. This suggests a period of consistently higher risk in the first half of the year.
3. Lowest Incidents in Late Months: Accident frequency drops significantly from July onwards, reaching the lowest point in November. This decline could be due to various factors, such as seasonal slowdowns, increased safety measures, or fewer work hours towards the end of the year.
4. Slight Increase in December: There's a minor rise in accident occurrences in December, which might be related to year-end rush or holiday-related adjustments in operations. This analysis of monthly accident trends can guide targeted interventions and resource allocation, especially in the first half of the year when accidents are more frequent.
1. Mining Sector: This sector has the highest number of incidents overall, with most occurring in Country_01, followed by Country_02. This suggests that Country_01 is particularly affected by mining related accidents, making it a critical area for safety measures.
2. Metals Sector: Incidents in the metals sector are more balanced between Country_01 and Country_02, with Country_02 reporting slightly more incidents. This balance indicates that both countries face similar levels of risk in the metals industry.
3. Others Sector: Incidents in the "Others" category are mainly concentrated in Country_03, with only a few incidents reported in the other two countries. This may imply that Country_03 has a more diverse industrial base, with different accident risks compared to the mining- and metals-focused Country_01 and Country_02.
Preprocessing (NLP Preprocessing Techniques):
Since a significant portion of the data is textual ("Description"), NLP techniques were employed to preprocess and clean the textual data:
➢ Tokenization: Splitting the descriptions into individual words or tokens.
➢ Stopword Removal: Removing common words that do not contribute much to the meaning of the text.
➢ Lemmatization: Reducing words to their base form to standardize the dataset.
➢ Text Vectorization: Converting textual data into numerical form using methods such as TF-IDF or word embeddings for model compatibility.
We implemented a text preprocessing function applied to the Description column in a dataset containing details about accidents. The preprocessing steps aim to clean and standardize the text for further analysis or model training.
1. Lowercasing: The text is converted to lowercase to ensure uniformity and eliminate case sensitivity in subsequent analysis.
2. Removing Punctuation and Special Characters: Non-alphanumeric characters (such as punctuation) are removed using a regular expression to focus only on meaningful words.
3. Stopword Removal: Common English words (such as "the," "and," etc.) that do not carry significant meaning are filtered out using a predefined list from the stopwords library.
4. Lemmatization: Each word in the text is lemmatized using the WordNetLemmatizer, reducing words to their base form (e.g., "running" becomes "run").
This preprocessing pipeline helps clean and transform the textual data into a more suitable format for machine learning or text analysis tasks.
We converted the cleaned text data from the Description column into a numerical format using the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer. Here's a summary of the steps involved:
1. TF-IDF Vectorization: The code uses the TfidfVectorizer from sklearn to transform the text data into a matrix of numerical features. This method calculates the importance of each word in the context of the entire dataset by considering both the term frequency (how often a word appears in a document) and the inverse document frequency (how common or rare the word is across all documents). Words that appear frequently across many documents are given lower importance, while rare words in a specific document are given higher importance.
2. Limiting Features: The max_features=500 parameter limits the number of features (i.e., unique words or terms) to 500. This helps in reducing the dimensionality of the text data and focusing on the most relevant words for the analysis, improving computational efficiency.
3. Transformation: The fit_transform() method applies the transformation to the Description column, converting the text data into a sparse matrix X, where each row represents a document (accident description), and each column represents a word's TF-IDF score. This transformation enables the use of machine learning algorithms, which require numerical input, to analyze the text data.
We applied Label Encoding to several categorical columns in the dataset, converting each unique category into a corresponding numeric label. The columns encoded include Accident Level, Potential Accident Level, Genre, Employee or Third Party, Countries, Local, and Industry Sector.
This transformation makes the categorical data compatible with machine learning algorithms that require numerical input. After preprocessing, the cleaned data was saved in an appropriate format (CSV) for further model training.
3. Deciding Models and Model Building In this section, we experimented with different classification models and sampling techniques to determine the most effective model for predicting the "Accident Level" in the dataset.
1. Data Preparation: - We dropped unnecessary columns such as 'Description' and 'Data' from the features and split the dataset into training and testing sets.
2. Model Selection:- We chose three different classification models: Logistic Regression, Random Forest, and Support Vector Machine (SVM).
3. Sampling Strategies:- To address class imbalance, we tested three different sampling methods: Original, Undersampling, and Oversampling.
4. Model Evaluation: - For each model, we trained it with all three sampling strategies and evaluated its performance using accuracy, precision, recall, and F1-score.
- Original Data:
➢ Logistic Regression showed an accuracy of 0.84, but it had issues with recall for some classes.
➢ Random Forest achieved perfect accuracy (1.00) across all classes.
➢ Support Vector Machine performed the worst, with an accuracy of 0.80.
- Undersampled Data:
➢ Logistic Regression's accuracy dropped to 0.60, showing better performance for some classes but still struggling with class imbalance.
➢ Random Forest remained strong with an accuracy of 0.96.
➢ SVM performed poorly again, with an accuracy of 0.51.
- Oversampled Data:
➢ Logistic Regression improved slightly to 0.74 but still had a few issues with some classes.
➢ Random Forest remained perfect with 1.00 accuracy.
➢ SVM performed very poorly with only 0.13 accuracy.
Best Model:- Random Forest consistently performed the best across all sampling strategies, achieving perfect accuracy (1.00) in the original and oversampled datasets and a high score in the undersampled dataset.
Conclusion:- The Random Forest model proved to be the most reliable for this classification task, regardless of the sampling strategy used. While Logistic Regression performed reasonably well in some scenarios, its performance suffered with class imbalance. The Support Vector Machine (SVM) showed poor results across all sampling strategies and is not suitable for this problem in its current form
4. How to Improve Model Performance?
1. Hyperparameter Tuning: Use techniques like Grid Search or Randomized Search to fine-tune model parameters, or consider advanced methods like Bayesian Optimization for better efficiency.
2. Ensemble Learning: Combine models using Bagging (e.g., Random Forest) or Boosting (e.g., XGBoost, LightGBM) to improve predictive accuracy.
3. Feature Engineering: Enhance model performance by creating new features, performing feature selection, or applying dimensionality reduction (e.g., PCA) to remove noisy or redundant features.
4. Cross-Validation: Implement Stratified K-fold Cross-Validation to ensure a balanced representation of classes, helping to avoid overfitting and providing a more reliable evaluation.
5. Class Weight Adjustment: Adjust class weights to handle imbalanced datasets more effectively, especially in models like Random Forest and Logistic Regression.
5. Conclusion and Next Steps
Conclusion: After applying SMOTE for handling class imbalance, the model's performance has shown improvement in terms of more balanced class distribution. While the oversampling technique has addressed the issue of imbalanced data, further steps can be taken to enhance the model's robustness and accuracy. Hyperparameter tuning, ensemble methods, and feature engineering still remain as important avenues to explore for better performance.
To get complete solution of above problem with code implementation then you can contact us:
コメント