Taste or Waste?
Authors: Christopher Shang, In Lorthongpanich
Introduction
As society moves in the direction of physical fitness, understanding how different factors influence the popularity of dishes can uncover how to promote healthier ones. The question we try to answer is how nutrition and effort affect the popularity of recipes. We used the “Recipes and Ratings” dataset, which originally contains 234,429 rows after merging the recipes
and interactions
datasets on recipe_id
. The relevant names of the columns are: name
, id
, minutes
, n_steps
, n_ingredients
, rating
, nutrition
, and tags
. name
describes the name of the recipe, which aren’t all unique, which is why id
is important in separating recipes for analysis. minutes
is how long the recipe takes to make, with n_steps
and n_ingredients
representing the required number of steps and ingredients. rating
is the rating that each individual reviewer left on each recipe, so each recipe has multiple ratings. nutrition
and tags
are columns of lists, where nutrition
includes the calories, total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV), and tags
includes many different shared identifiers such as the meal of the day and other groups.
Data Cleaning and Exploratory Data Analysis
Data Cleaning
Before cleaning, we performed some analysis of graphing the distributions of a few of the important columns, such as ratings, to learn about any outliers and interesting skews. We found that there were some outrageous outliers in calories and minutes due to recipes that took months to distill or recipes for a huge part of people. Because of this, we decided to filter the dataset to only include recipes that took under 10,000 minutes with under 2,000 calories, which we deemed reasonable after looking at the distributions again. Additionally, we expanded nutrition
and tags
. We turned each element in the nutrition list into its own column and turned tags into a categorical column that only included the type of meal
a recipe was (breakfast, lunch, dinner, dessert) in order to one hot encode later.
We filled all ratings of 0 with np.nan
because a rating of 0 likely represents missing or invalid data rather than an actual user evaluation, as ratings typically range from 1 to 5. We also decided to drop all rows that had NaN ratings, since when we are trying to predict ratings these will provide no additional information, and the distribution of the NaN ratings matches the distribution of the overall dataset, so no significant skews will result from just dropping them. Finally, we decided to drop all other irrelevant columns for our prediction, including id
, nutrition
, steps
, description
, ingredients
, and review
, just to make the table cleaner to look at.
name | meal | n_ingredients | n_steps | minutes | calories | total fat (PDV) | sugar (PDV) | sodium (PDV) | protein (PDV) | saturated fat (PDV) | carbohydrates (PDV) | rating | avg_rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 brownies in the world best ever | lunch | 9 | 10 | 40 | 138.4 | 10 | 50 | 3 | 3 | 19 | 6 | 4 | 4 |
millionaire pound cake | dinner | 7 | 7 | 120 | 878.3 | 63 | 326 | 13 | 20 | 123 | 39 | 5 | 5 |
5 tacos | dinner | 9 | 5 | 20 | 249.4 | 26 | 4 | 6 | 39 | 39 | 0 | 4 | 4 |
blepandekager danish apple pancakes | breakfast | 10 | 10 | 50 | 358.2 | 30 | 62 | 14 | 19 | 54 | 12 | 5 | 5 |
bbq spray recipe it really works | breakfast | 3 | 5 | 5 | 47.2 | 0 | 2 | 0 | 0 | 0 | 0 | 5 | 4.75 |
Univariate Analysis
We first investigated the overall distribution of the rating
column that we are trying to predict and noticed that there are an overwhelming majority of 5
s. This will be important when we consider how our model is being trained, since we might be oversampling fives and that assessing accuracy will be interesting with such little representation from other ratings.
We also plotted the distribution of meal
, and the trend shows that the most common type of recipe is for dinner dishes, followed by dessert, lunch, and breakfast, respectively. The good thing is that there is a good amount of data for all meals, and these are mutually exclusive categories that will hopefully help us better predict ratings, as it factors in a new factor of what time of day the food is being eaten.
Bivariate Analysis
We plotted minutes
against rating
to see if there were any interesting insights, as intuition leads us to believe that some people may prefer shorter and easier recipes to longer and harder ones. However, the graphs show us that there is minimal difference between how long the recipes take and the average ratings they end up receiving, which will be interesting to see if this feature plays a good role in predicting in the final model. There is some slight variation, but since fives were such a huge part of the dataset, it might make sense that most averages are high.
A similar trend exists in the relationship between calories
and average rating (avg_rating
), where the distribution looks almost completely evenly distributed. This presents a similar problem to investigate as we try to predict ratings.
Finally, we plotted the number of ingredients (n_ingredients
) against the average ratings (avg_rating
), with a similar trend as the other two relationships.
Interesting Aggregates
This grouped table shows the relationship between the number of ingredients (n_ingredients
) and the average of the nutrional value of each ingredient. It can be interesting to see that the number of ingredients seems to be positively correlated to calories and many of the other nutritional metrics like fat, sugar, and carbs. However, one that stands out is protein, which follows a more quadratic relationship. This also shows that protein is likely the only nutrition that is linearly independent from calories, which makes sense from general food knowledge since fat, sugar, and carbs directly affect calories while protein is more separate.
Top 7 rows sorted by calories:
n_ingredients | calories | total fat (PDV) | sugar (PDV) | sodium (PDV) | protein (PDV) | saturated fat (PDV) | carbohydrates (PDV) |
---|---|---|---|---|---|---|---|
3 | 248.927 | 16.4457 | 65.4795 | 21.4057 | 13.9062 | 21.7418 | 9.22445 |
2 | 262.96 | 17.1529 | 67.6815 | 45.1688 | 14.8981 | 21.3057 | 10.1115 |
4 | 278.519 | 20.0597 | 61.0194 | 17.6704 | 18.2671 | 27.1001 | 9.34015 |
5 | 294.87 | 21.6144 | 59.1062 | 20.4802 | 20.3656 | 28.4102 | 9.88809 |
6 | 314.897 | 23.1842 | 54.5716 | 21.6814 | 23.1432 | 29.7676 | 10.2697 |
33 | 338.2 | 25 | 18 | 16 | 8 | 12 | 14 |
7 | 345.104 | 26.7115 | 49.8763 | 25.7251 | 27.6638 | 33.3561 | 10.4786 |
Bottom 7 rows sorted by calories:
n_ingredients | calories | total fat (PDV) | sugar (PDV) | sodium (PDV) | protein (PDV) | saturated fat (PDV) | carbohydrates (PDV) |
---|---|---|---|---|---|---|---|
26 | 629.591 | 51.0909 | 38.2727 | 50 | 59.8182 | 74.9091 | 17.6364 |
25 | 645.219 | 39.9375 | 54.375 | 70.6875 | 76.875 | 44.75 | 21.1875 |
30 | 666.383 | 58.5 | 39.5 | 42.5 | 60 | 66.8333 | 16.8333 |
27 | 719.288 | 51.375 | 183.875 | 48.5 | 50.5 | 69 | 27.375 |
29 | 827.4 | 55.5 | 121.333 | 44.6667 | 83 | 64 | 29.1667 |
28 | 842.614 | 48.8571 | 133.714 | 591.429 | 93 | 42.4286 | 34.1429 |
31 | 1184.37 | 114.667 | 346 | 181.667 | 38 | 83 | 37 |
Imputation Strategies
Only the rating
column contained any missing values, and we decided not to impute the missing values because the graph of the distribution of the average recipe ratings for each missing rating showed the same distribution as the distribution of just the total ratings. This tells us that simply dropping the ratings will not affect the distribution of our dataset, so we can safely do so to avoid any complexities.
Framing a Prediction Problem
Our prediction task is a multiclass classification problem, where we aim to predict the rating of a recipe (1-5) based on its characteristics, such as nutritional content, preparation time, and meal type. We chose rating
as the response variable because it reflects recipe popularity and aligns with our goal of understanding how factors like nutrition and effort influence user preferences. The ratings are categorical integers, making classification more appropriate than regression, as it ensures predictions align with the actual classes.
To evaluate the model, we use the F1-score, which balances precision and recall, making it suitable for our imbalanced dataset where rating = 5
dominates. Accuracy is not ideal in our case because it could overrepresent the majority class without reflecting model performance for minority classes. Our features include nutritional data (e.g., calories
, protein (PDV)
), n_ingredients
, minutes
, and meal
type, as they are available at the time of prediction.
Baseline Model
Our baseline model predicts the rating of a recipe (1-5) using five features: calories
, minutes
, protein (PDV)
, n_ingredients
, and meal
. Among these features, four (calories
, minutes
, protein (PDV)
, and n_ingredients
) are quantitative, while meal
is nominal. For preprocessing, we applied one-hot encoding to the meal
column to transform it into binary indicators, while leaving the numerical features unchanged. All steps, including feature transformation and model training, were implemented in a single sklearn pipeline.
The baseline model uses a Logistic Regression classifier wrapped in OneVsRestClassifier
, with class_weight='balanced'
to address the imbalance in the dataset, where rating = 5
dominates. On the test set, the model achieved an accuracy of 22%, with a weighted F1-score of 0.29. While the model demonstrates some ability to predict the majority class (rating = 5
), its performance on minority classes is poor, as evidenced by the low macro-average F1-score of 0.11. This indicates that the current model struggles to generalize to unseen data and handle the imbalanced dataset effectively.
Classification Report for Baseline Model
precision recall f1-score support
1.0 0.01 0.50 0.03 252
2.0 0.00 0.00 0.00 193
3.0 0.00 0.00 0.00 560
4.0 0.16 0.29 0.21 3193
5.0 0.77 0.21 0.33 14500
accuracy 0.22 18698
macro avg 0.19 0.20 0.11 18698
weighted avg 0.63 0.22 0.29 18698
Final Model
Our final model improves upon the baseline by incorporating feature engineering and a hyperparameter search. We engineered two new features using PolynomialFeatures
for calories
and protein (PDV)
to capture non-linear relationships that may exist between these features and user ratings. Additionally, we applied transformations such as log1p
and sqrt
to the calories
feature using a FunctionTransformer
, which helps reduce the skewness in its distribution, making it more interpretable. These new features and transformations, combined with StandardScaler
for the minutes
column and OneHotEncoder
for the meal
column, were integrated into a single pipeline.
The model used for this pipeline was a Logistic Regression classifier wrapped in a OneVsRestClassifier
to handle multiclass classification. To optimize the model’s performance, we performed hyperparameter tuning using GridSearchCV
, testing polynomial degrees (2–4), different transformations (log1p
and sqrt
), and regularization strengths (C
) for the Logistic Regression classifier. The best-performing model used a polynomial degree of 2, a square root transformation (sqrt
) for calories
, and a C
value of 0.1 for regularization. These parameters acheived a balance between capturing complex relationships of recipes and ratings, while preventing overfitting.
The final model showed minor improvements over the baseline. The baseline model achieved an accuracy of 22% and a weighted F1-score of 0.29. In contrast, the final model achieved an accuracy of 29% and a weighted F1-score of 0.36. The macro-average F1-score also improved alightly, indicating better performance across all classes, including minority ones. This improvement is attributed to the additional feature engineering, which allowed the model to better represent the data, and the hyperparameter tuning, which optimized the model for the given task.
Classification Report for Final Model
precision recall f1-score support
1.0 0.02 0.26 0.03 252
2.0 0.01 0.21 0.03 193
3.0 0.04 0.04 0.04 560
4.0 0.19 0.37 0.26 3193
5.0 0.79 0.28 0.41 14500
accuracy 0.29 18698
macro avg 0.21 0.23 0.15 18698
weighted avg 0.65 0.29 0.36 18698
Conclusion
From our exploration of the Recipe and Ratings dataset, it can be seen that there is a very low correlation between the nutritional value of a food and the ratings it receives on a recipe website. When running our model to predict the ratings from a variety of factors including calories, number of ingredients, protein, our final model was only able to get a weight F-score of 0.36. When referring back to the distribution of ratings and its relationship between these features, it is not a surprise that they have a hard time accurately predicting since the original data is so skewed and they had limited correlations. While we were unsuccessful in creating a highly accurate model to predict ratings from nutrition, the data can still be useful in hinting that good and bad nutrition have a similar popularity, which is a good sign for the future of the health and fitness industry. For the next steps of the research, it would be interesting to create predictions from a more evenly distributed dataset of ratings to see if it would make a difference from how we handled the uneven distribution (using weights). Additionally, finding a metric for taste would be interesting to see how that factor might affect ratings more than the nutritional content of the food.