1. Understanding the significance of Case Study ‘Submission I’ Case Study is an excellent method of learning, where the practical aspects of the subjects are involved. A good case study should comprise of an appropriate introduction to the case followed by identification of the problem and concrete analysis of the problem.
2. Meaning of a Case Submission A case study is a research strategy and an empirical inquiry that investigates a phenomenon within its real-life context. It provides the opportunity for learner to demonstrate independence and originality, to plan and organize the case analysis and to put into practice some of the legal aspects, learners have been taught throughout the program.
3. Technical specifications of the Submissions
Font: Times New Roman, font size 12, spacing 1.5
Margin: Left 35 mm, Right 20mm, Top 35mm, Bottom 20mm
First (preliminary) page should have the following information:
i) Top: The title in block and capital (uppercase).
ii) Centre: Full name of the student in capital letters and registration number.
iii) Full name of the Guide (If any) in block letters, with the designation.
iv) Bottom: Name of the Institute (i.e. SCDL), in block and capital with the academic year.
4. How to prepare a case study
Introduction
Provide a context for the case and describe any similar cases previously reported. Here you can give problem statement in brief.
Case presentation
Several sentences describe the history and results of any examinations performed. The working diagnosis and management of the case are described.
i. Introductory sentence
ii. Describe the essential nature of the problem.
iii. Further development of solution of the problem.
iv. Summarize the results of examination.
v. Explain the importance of various variables/features in the dataset
Management and Outcome
Simply describe the course of the complaint. Where possible, make reference to any outcome measures which you used to objectively demonstrate.
Describe as specifically as possible the solution provided, including the nature of the formula.
If possible, refer to objective measures of the solution progress.
Describe the resolution of solution
5. The entire content can be submitted in a Word or PDF file as an attachment.
Some examples include solving common business problems in Marketing, Sales, Customer Clustering, Banking, Real Estate, Insurance, Travel and many more. Please choose a dataset for doing the submission in 2 parts:
1) The first part will consist of EDA
The use case and dataset for both these submissions should be the same. Also, please keep in mind that the solution set for semester 2 includes submitting the first part and for semester 4 you have to submit the ML application use case and code.
2) Data Science at Flipkart[Recommendation engine for products]
3) Case Study of Data Science at Facebook
4) Customer Analytics at Flipkart.Com [Clustering users based on their interests till date]
Please note that a plagiarism check is mandatory before submitting the final soft copy of the Submission I Report to SCDL for evaluation. In simple terms, plagiarism is copy-paste. Students can use the Free and Open Source Software (FOSS) available online for checking the same (For e.g. http://plagiarisma.net/). As per the UGC guidelines, similarity index should be less than 10%. The plagiarism report should be attached with the Submission I without which the case study report will not be evaluated
Case Study Topic
MOVIE SUCCESS RATE PREDICTION
1. Problem Statement
Here our data analysis task related to evaluate the compatibility of movie success rate with their corresponding success variable. Recent time lots of information share through internet and social media such as, entertainment, news, and business related and so on. There are lots of movie’s release and produce in every month or year all over the word. Here some of movies is success and some of are not success. Many people watching movies through multiplex or online social media portals like Amazon Prime, Disney Hot star, and more other. Success rate of movies is important because in movie huge amount of money is invest to making the movie. Director invest huge amount of money to make the movie better so he can earn the money. All production team are worried about to success rate of movies. Movie industries always worries about movies success rate. Here in our paper work machine learning Unsupervised techniques used to predict the movie success rate. In this we have used data from Kaggle which is free and open source which is public and used by any researchers easily provide by Internet Movie Database. The dataset contains both numeric and categorical variables such as rating, director, actor/actress, budget, genre, title, running time, MPAA ratings, awards, etc. In this paper we used the one Supervised Machine Learning Algorithms such as Logistic Regression. And last we find the accuracy of the model using precision, recall , and accuracy.
2. Introduction
Recent time large number of movies released in every year and it is the good source of the watching and the entertain. Prediction of movies is deciding the success rate of movies. Film industry related to Hollywood or Bollywood growth in last 15 year and reach to peak to earn money through online and offline.In every year different kinds of movies are released. In this some of movies are affect the people and inspire to success in like which is related to motivation and historical data. Main objective of each movie maker to gain profit to earning point and make it popular to view point. To predict the success of each movie is very difficult for film industry. The success of movie is depending on many features like songs, story, title, genre rating, movie actors, and graphics, etc. The “Fight Club” was very famous movie but it earns huge profit. As per some movies analyst it earns approx. 25 percent gain in first two weeks which is low as per investment in movie. Sometimes people are confused to select the movie for look for. To handling this situation, such type of machine learning techniques used to decide the best and good movie. The recent machine learning prediction could also help to investor and movie producers to choosing new movie, actor, and actress wisely for future investment. Now a days there are many websites and sources are available which contains various information about movies and others entertainment programs. Internet Movie Database is the good collection of large movies dataset which is launched on October 1990. Here the lots of datasets available which is related to database of TV programs, Hollywood movies, Bollywood Movies, games. Internet Movie Database offer several features and streaming associated with movies, rating, production crew, reviews, images, runtime, quotes, videos, and more. Internet Movie Database has approximately more than 6 million titles and 10 million personalities in it database.
3. Introduction to Dataset
The dataset “movie_success_rate.csv” focused on seeking diverse representation while posing for information starting from technologies and behaviour to questions which will help them improve and predict the dataset which need to analyze the movie dataset. For nearly a decade, recent time it being the most important in the world due to people interest in film industry. This dataset has 839 rows and 33 columns. Below we have described all the related features which is used to predict the movie_success_rate.
Dataset: (https://www.kaggle.com/datasets)
...
...
Movie_suceess_rate.csv
Shape Of dataset: (839, 33)
As per above dataset shape, we can say that it has 839 row and 33 features columns.As per above table data has different data type. Some of its columns are in float data type (Rank,Actors, Year, Runtime, Rating, Votes, Revenue,Metascore, Action, Adventure, Aniimation, Biography, Comedy Crime, Drama, Family, Fantasy, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi , Sport, Thriller, War, Western, Success) and some of these are Object data type(Title, Genre, Description, Director).
4. Data Pre-Pre-processing & Summary
4.1 Data Pre-Processing:
Any data or real-world data generally contains many issues like noises, missing values, and not given in proper format which cannot be directly used for machine learning algorithms. This is the process for cleaning the data and making it suitable for a ML model to increase the model efficiency and increase the accuracy of the model also.
➢ Data pre-processing is a main and first step that helps enhance the quality of data to promote the extraction of meaningful insights from the data. Data pre-processing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for building and training Machine Learning models.
➢ Make a list for Data and target. Checking null value if exist then need to fill it.
In our paper task we have used some steps for pre-processing the data:
Checking the Null Values: Below the code which used to checked the null value in dataset, in movie_success_rate dataset we have not find any missing values.
#Checking Null Value
#Visualize for check null value
check_null_value = df.isnull()
sns.heatmap(check_null_value,yticklabels=False,cbar=False,cmap='viriis')
Output:
As per above heat map we have clearly understand that in dataset no missing or Nan values.
4.2 Summary Statistics
Summary Statistics is summarizing the data at hand through certain numbers like mean, std etc. so it makes the data easier to understand. It used find statistic information of all numeric In machine learning we have use simple predefined function describe() to show all summary related to dataset.
#– summary of the dataset
df.describe()
output:
Count: It used to count the all-feature values of dataset columns
Mean: This is the statistic term which used to find the mean of each numeric feature columns in dataset
Std: It means standard deviation; this is also statistic term to find the standard deviation of dataset features.
Min: find the min values
Max: find the max values
5. Feature Selection
This is the next steps after pre-process the dataset. In big data machine learning feature selection is the process of reducing the number of input variables when developing a predictive model.
This is basically used to reduce the input variables to reduce the computational cost of modelling and some cases it used to increase the performance of the model. In this we choose some variable which is useful and remove some features which is not useful to predict the model.
In machine learning it used to evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. It makes the model fast and efficient and give the more accurate result. In machine learning features can be select both manually or using algorithms. In our Big data task here we select some features manually for both features and target.
See in below code:
#selecting features and target variable and leave all unnecessary variables
x = df[['Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore', 'Action', 'Adventure', 'Aniimation', 'Biography', 'Comedy',
'Crime', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller', 'War',
'Western']]
y = df['Success']
In above code x is a feature variable and y is the target variable which we choose manually from our movie_success_rate dataset.
6. Exploratory Data Analysis (EDA)
In this we have to use some visualization to show and understand the features and their relationship with other features easily.
It is divided into two categories:
– Univariate Analysis: histogram, distribution (distplot, boxplot, violin)
– Multivariate Analysis: scatter plot, pair plot, etc
# visualize frequency distribution of `Rating` variable
f, ax = plt.subplots(figsize=(12, 14))
ax = sns.countplot(x="Rating", data=df, palette="Set1")
ax.set_title("Frequency distribution of Rating variable")
ax.set_xticklabels(df.Rating.value_counts().index, rotation=60)
plt.show()
Output:
In above figure 6.1 we have visualize the frequency of “rating” variable here it use count() to count the frequency of each movie rating and draw it. In this the minimum frequency for 7.0 rating and the max. frequency is for 5.2 rating.
# visualize frequency distribution of Votes per Year
f, ax = plt.subplots(figsize=(12, 14))
ax = sns.barplot(x="Year",y = "Votes", data=df, palette="Set1")
ax.set_title("Bar Plot To Visualize Votes In Each Year")
ax.set_xticklabels(df.Year.value_counts().index, rotation=60)
plt.show()
output:
In the given figure 6.2 we find the votes in each years. Votes are also important part to find the success of movie. Person give the votes to each movies as per his success and success. There are many social sites which used to count the votes of each person. Person give the more votes which movies has the good rating and give the less votes which movie has low rating. In figure 6.2 we find the minimum votes in year 2006 and find the max votes in year 2011.
# Box Plot
f, ax = plt.subplots(figsize=(12, 14))
ax = sns.boxplot(x="Year", y = "Votes", data=df, palette="Set1")
ax.set_title("Box Plot For Votes In Year")
ax.set_xticklabels(labels=["2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016"], rotation=60)
plt.show()
Output:
In above figure 6.3 we use box plot to find the votes per year. Box plot is also useful to find the good result.
7. Process Pipeline for Model
Movie Rate Prediction Using K-mean
So here we will use movie_success_rate dataset which is related to collection of movies. This dataset is public and easily download from Kaggle. For this unsupervised machine learning we have select the K-means Clustering to predict the movie dataset.
Below is the flowchart of the program that we will use for k-mean clustering algorithm example.
Figure 3: screenshot
In above figure first read the dataset which is related to movie success rate and dataset taken from Kaggle website which is public and accessible by anyone. After this we need to load the model from sklearn. The next steps to pre-process the data to remove Nan values, show dataset summary and others. After this we need to split the dataset into train and test data and then fit the k-means clustering (Unsupervised learning algorithm) and finally we got the accuracy to model as per above Figure.
8. Unsupervised Learning Algorithms
In unsupervised learning algorithms, the training data is unlabelled.
The unsupervised machine learning is typically tasked with finding relationships and correlation within data.
– Used mostly for pattern detection and descriptive modelling
Some of the most important unsupervised learning algorithms include:
– Clustering
– Visualization and dimensionality reduction
– Association rule learning
8.1 Clustering
Clustering is the common EDA technique used to get an intuition about the structure of the data. It can use to identifying subgroups in the data such that data points in the same cluster(subgroup) are very similar while data points in different clusters are very different. Here we can try to find the homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure.
This can be done the basis of features where we try to find subgroups of samples based on features samples where we try to find subgroups of features based on samples. Here we will covers clustering(k-means) based on features. It used in market segmentation; document clustering, image segmentation and image compression, etc.
Here we have implemented Clustering for prediction using the first dataset as per given requirement. Here, we have implemented the K-means clustering over the given dataset and finding the Accuracy and Validate Mean Square Error and also finding the precious and recall for this. And after this we have fine tune the model to improve the accuracy of the dataset.
The various types of clustering are:
Hierarchical clustering
Partitioning clustering
Hierarchical clustering is further subdivided into:
Agglomerative clustering
Divisive clustering
Partitioning clustering is further subdivided into:
K-Means clustering
Fuzzy C-Means clustering
8.2 K-means Clustering
k-means clustering algorithm is one of the important and simplest unsupervised learning algorithms that use to solve clustering problem. The procedure of this follows the simple and easy way to classify a given data set through a certain number of k clusters fixed apriori. The main point is to define k centers, one for each cluster. So, choice is to place them as much as possible far away from each other. After this take each point belonging to a given data set and associate it to the nearest center. When no point is remaining, the first step is completed and an early group age is done. After this we need to re-calculate k new centroids of the clusters resulting from the previous step. And continue this process and finally you can get the result.
Applications of k-means clustering:
There are many applications of k-mean clustering:
market segmentation;
document clustering,
image segmentation and
image compression
9. Evaluating metrics
Accuracy: it compares the predicted general sentiment (positive or negative) to the
Accuracy: it compares the predicted general sentiment (positive or negative) to the
Accuracy: it compares the predicted general sentiment (positive or negative) to the real one, which was determined based on the stars or Accuracy is the most intuitive performance measure. It explains that how close a measurement is to the correct value for that measurement. It is defined as ratio of correctly predicted observation to the total observation. It can be calculated by using following expression:
Accuracy = (TP + TN)/ (TP + TN + FP + FN)
Precision:this is the ration between True Positives and the sum of True Positives and False Positive reviews. It tells us how accurate we are about saying that a review is positive or Precision is how close two or more measurements are to each other. It is calculated as the number of correct positive predictions divided by the total number of positive predictions. It is defined as:
Precision = TP/ (TP + FP)
Recall: this is the ration between True Positives and the sum of True Positives and False Negatives or Recall or True positive rate is calculated as the number of correct positive predictions divided by the total number of positives.
Recall = TP (TP + FN)
F1-score: this is the harmonic mean of the precision and the recall or F1 Score is an overall accuracy of the model. It is harmonic mean of precision and recall. It can be defined as:
F1 Score = 2 x (Precision x Recall ) /(Precision + Recall )
10. Experiments and results
Here we add screenshots of python script which is used show our result with score:
Here two classes are used 0 and 1.
11. Conclusions
In this report, we propose a data analysis approach to predict the success of movies in which we predict the success of movies. In this task, we have find the performance using unsupervised machine learning techniques(k-means) for success of movies in sense of popularity. Here we find the good success rate by selecting the features from given dataset. We know accuracy or result are depends on datasets so we can say that, in future, such prediction can be make more accurate by using more features and by making large size of dataset. Here we got the better success rate with low Mean Absolute Error. In future it generates better success rate when data is increases. In this we find the accuracy 84.52 which is good as per given dataset in future it should we increase by increasing dataset features.
12. Future research
In future, algorithms generate the good and accurate result because success rate depends of dataset features. We know accuracy or result are depends on datasets so we can say that, in future, such prediction can be make more accurate by using more features and by making large size of dataset.
REFERENCES
[1] J. Valenti, “ Motion Pictures and Their Impact on Society in the Year 2000, speech given at the Midwest Research Institute, Kansas City, April 25, p. 7(2018)
[2] Stöcker, Christian , "20 Jahre Internet Movsie Database: Im Clubhaus der Kinojunkies". Spiegel Online. Retrieved November 20, 2019.
[3] J. S. Simonoff and I. R. Sparrow, “Predicting Movie Grosses: Winners and Losers, Blockbusters and Sleepers,” Chance, vol. 13, no. 3, pp. 15–24, 2000.
[4] A. Chen, “Forecasting gross revenues at the movie box office,” Working paper, University of Washington, Seattle, WA, June 2002.
To Get any help in Case Study Project Help, assignment help or homework Help you can contact Us or send your requirement details at:
realcode4you@gmail.com
Comments