By Wiley Battles, Grace Brasfield, Robert Kullberg
Abstract
Our project takes a data mining approach to predict which factors will result in a film being highly rated by viewers. The predictors that we planned to assess are from IMDb datasets. The predictors are genres, actors, and directors. We created bins for the terrible, poor, average, and excellent rating levels. For analysis, four different models were utilized: logistic regression, naive Bayes, random forest, and bagging. The results show that films have too many factors to accurately predict success based on the previous performance of other films. Each of the models used was very similar in accuracy. The coefficients for the predictors found in the models were largely unmeaningful. The four bins were our target variable, with a more specific focus on the excellent bin. The model performed proved inaccurate overall, as it demonstrated an insufficient ability to classify positive cases. This was due to an imbalance in the data. Most film ratings fell into the average bin, with very few landing in excellent.