Skip to the content.

Taste or Waste?

Authors: Christopher Shang, In Lorthongpanich

Introduction

As society moves in the direction of physical fitness, understanding how different factors influence the popularity of dishes can uncover how to promote healthier ones. The question we try to answer is how nutrition and effort affect the popularity of recipes. We used the “Recipes and Ratings” dataset, which originally contains 234,429 rows after merging the recipes and interactions datasets on recipe_id. The relevant names of the columns are: name, id, minutes, n_steps, n_ingredients, rating, nutrition, and tags. name describes the name of the recipe, which aren’t all unique, which is why id is important in separating recipes for analysis. minutes is how long the recipe takes to make, with n_steps and n_ingredients representing the required number of steps and ingredients. rating is the rating that each individual reviewer left on each recipe, so each recipe has multiple ratings. nutrition and tags are columns of lists, where nutrition includes the calories, total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV), and tags includes many different shared identifiers such as the meal of the day and other groups.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Before cleaning, we performed some analysis of graphing the distributions of a few of the important columns, such as ratings, to learn about any outliers and interesting skews. We found that there were some outrageous outliers in calories and minutes due to recipes that took months to distill or recipes for a huge part of people. Because of this, we decided to filter the dataset to only include recipes that took under 10,000 minutes with under 2,000 calories, which we deemed reasonable after looking at the distributions again. Additionally, we expanded nutrition and tags. We turned each element in the nutrition list into its own column and turned tags into a categorical column that only included the type of meal a recipe was (breakfast, lunch, dinner, dessert) in order to one hot encode later.

We filled all ratings of 0 with np.nan because a rating of 0 likely represents missing or invalid data rather than an actual user evaluation, as ratings typically range from 1 to 5. We also decided to drop all rows that had NaN ratings, since when we are trying to predict ratings these will provide no additional information, and the distribution of the NaN ratings matches the distribution of the overall dataset, so no significant skews will result from just dropping them. Finally, we decided to drop all other irrelevant columns for our prediction, including id, nutrition, steps, description, ingredients, and review, just to make the table cleaner to look at.

name meal n_ingredients n_steps minutes calories total fat (PDV) sugar (PDV) sodium (PDV) protein (PDV) saturated fat (PDV) carbohydrates (PDV) rating avg_rating
1 brownies in the world best ever lunch 9 10 40 138.4 10 50 3 3 19 6 4 4
millionaire pound cake dinner 7 7 120 878.3 63 326 13 20 123 39 5 5
5 tacos dinner 9 5 20 249.4 26 4 6 39 39 0 4 4
blepandekager danish apple pancakes breakfast 10 10 50 358.2 30 62 14 19 54 12 5 5
bbq spray recipe it really works breakfast 3 5 5 47.2 0 2 0 0 0 0 5 4.75

Univariate Analysis

We first investigated the overall distribution of the rating column that we are trying to predict and noticed that there are an overwhelming majority of 5s. This will be important when we consider how our model is being trained, since we might be oversampling fives and that assessing accuracy will be interesting with such little representation from other ratings.

We also plotted the distribution of meal, and the trend shows that the most common type of recipe is for dinner dishes, followed by dessert, lunch, and breakfast, respectively. The good thing is that there is a good amount of data for all meals, and these are mutually exclusive categories that will hopefully help us better predict ratings, as it factors in a new factor of what time of day the food is being eaten.

Bivariate Analysis

We plotted minutes against rating to see if there were any interesting insights, as intuition leads us to believe that some people may prefer shorter and easier recipes to longer and harder ones. However, the graphs show us that there is minimal difference between how long the recipes take and the average ratings they end up receiving, which will be interesting to see if this feature plays a good role in predicting in the final model. There is some slight variation, but since fives were such a huge part of the dataset, it might make sense that most averages are high.

A similar trend exists in the relationship between calories and average rating (avg_rating), where the distribution looks almost completely evenly distributed. This presents a similar problem to investigate as we try to predict ratings.

Finally, we plotted the number of ingredients (n_ingredients) against the average ratings (avg_rating), with a similar trend as the other two relationships.

Interesting Aggregates

This grouped table shows the relationship between the number of ingredients (n_ingredients) and the average of the nutrional value of each ingredient. It can be interesting to see that the number of ingredients seems to be positively correlated to calories and many of the other nutritional metrics like fat, sugar, and carbs. However, one that stands out is protein, which follows a more quadratic relationship. This also shows that protein is likely the only nutrition that is linearly independent from calories, which makes sense from general food knowledge since fat, sugar, and carbs directly affect calories while protein is more separate.

Top 7 rows sorted by calories:

n_ingredients calories total fat (PDV) sugar (PDV) sodium (PDV) protein (PDV) saturated fat (PDV) carbohydrates (PDV)
3 248.927 16.4457 65.4795 21.4057 13.9062 21.7418 9.22445
2 262.96 17.1529 67.6815 45.1688 14.8981 21.3057 10.1115
4 278.519 20.0597 61.0194 17.6704 18.2671 27.1001 9.34015
5 294.87 21.6144 59.1062 20.4802 20.3656 28.4102 9.88809
6 314.897 23.1842 54.5716 21.6814 23.1432 29.7676 10.2697
33 338.2 25 18 16 8 12 14
7 345.104 26.7115 49.8763 25.7251 27.6638 33.3561 10.4786

Bottom 7 rows sorted by calories:

n_ingredients calories total fat (PDV) sugar (PDV) sodium (PDV) protein (PDV) saturated fat (PDV) carbohydrates (PDV)
26 629.591 51.0909 38.2727 50 59.8182 74.9091 17.6364
25 645.219 39.9375 54.375 70.6875 76.875 44.75 21.1875
30 666.383 58.5 39.5 42.5 60 66.8333 16.8333
27 719.288 51.375 183.875 48.5 50.5 69 27.375
29 827.4 55.5 121.333 44.6667 83 64 29.1667
28 842.614 48.8571 133.714 591.429 93 42.4286 34.1429
31 1184.37 114.667 346 181.667 38 83 37

Imputation Strategies

Only the rating column contained any missing values, and we decided not to impute the missing values because the graph of the distribution of the average recipe ratings for each missing rating showed the same distribution as the distribution of just the total ratings. This tells us that simply dropping the ratings will not affect the distribution of our dataset, so we can safely do so to avoid any complexities.

Framing a Prediction Problem

Our prediction task is a multiclass classification problem, where we aim to predict the rating of a recipe (1-5) based on its characteristics, such as nutritional content, preparation time, and meal type. We chose rating as the response variable because it reflects recipe popularity and aligns with our goal of understanding how factors like nutrition and effort influence user preferences. The ratings are categorical integers, making classification more appropriate than regression, as it ensures predictions align with the actual classes.

To evaluate the model, we use the F1-score, which balances precision and recall, making it suitable for our imbalanced dataset where rating = 5 dominates. Accuracy is not ideal in our case because it could overrepresent the majority class without reflecting model performance for minority classes. Our features include nutritional data (e.g., calories, protein (PDV)), n_ingredients, minutes, and meal type, as they are available at the time of prediction.

Baseline Model

Our baseline model predicts the rating of a recipe (1-5) using five features: calories, minutes, protein (PDV), n_ingredients, and meal. Among these features, four (calories, minutes, protein (PDV), and n_ingredients) are quantitative, while meal is nominal. For preprocessing, we applied one-hot encoding to the meal column to transform it into binary indicators, while leaving the numerical features unchanged. All steps, including feature transformation and model training, were implemented in a single sklearn pipeline.

The baseline model uses a Logistic Regression classifier wrapped in OneVsRestClassifier, with class_weight='balanced' to address the imbalance in the dataset, where rating = 5 dominates. On the test set, the model achieved an accuracy of 22%, with a weighted F1-score of 0.29. While the model demonstrates some ability to predict the majority class (rating = 5), its performance on minority classes is poor, as evidenced by the low macro-average F1-score of 0.11. This indicates that the current model struggles to generalize to unseen data and handle the imbalanced dataset effectively.

Classification Report for Baseline Model

          precision    recall  f1-score   support

     1.0       0.01      0.50      0.03       252
     2.0       0.00      0.00      0.00       193
     3.0       0.00      0.00      0.00       560
     4.0       0.16      0.29      0.21      3193
     5.0       0.77      0.21      0.33     14500

accuracy                           0.22     18698
macro avg      0.19      0.20      0.11     18698
weighted avg   0.63      0.22      0.29     18698

Final Model

Our final model improves upon the baseline by incorporating feature engineering and a hyperparameter search. We engineered two new features using PolynomialFeatures for calories and protein (PDV) to capture non-linear relationships that may exist between these features and user ratings. Additionally, we applied transformations such as log1p and sqrt to the calories feature using a FunctionTransformer, which helps reduce the skewness in its distribution, making it more interpretable. These new features and transformations, combined with StandardScaler for the minutes column and OneHotEncoder for the meal column, were integrated into a single pipeline.

The model used for this pipeline was a Logistic Regression classifier wrapped in a OneVsRestClassifier to handle multiclass classification. To optimize the model’s performance, we performed hyperparameter tuning using GridSearchCV, testing polynomial degrees (2–4), different transformations (log1p and sqrt), and regularization strengths (C) for the Logistic Regression classifier. The best-performing model used a polynomial degree of 2, a square root transformation (sqrt) for calories, and a C value of 0.1 for regularization. These parameters acheived a balance between capturing complex relationships of recipes and ratings, while preventing overfitting.

The final model showed minor improvements over the baseline. The baseline model achieved an accuracy of 22% and a weighted F1-score of 0.29. In contrast, the final model achieved an accuracy of 29% and a weighted F1-score of 0.36. The macro-average F1-score also improved alightly, indicating better performance across all classes, including minority ones. This improvement is attributed to the additional feature engineering, which allowed the model to better represent the data, and the hyperparameter tuning, which optimized the model for the given task.

Classification Report for Final Model

          precision    recall  f1-score   support

     1.0       0.02      0.26      0.03       252
     2.0       0.01      0.21      0.03       193
     3.0       0.04      0.04      0.04       560
     4.0       0.19      0.37      0.26      3193
     5.0       0.79      0.28      0.41     14500

accuracy                           0.29     18698
macro avg      0.21      0.23      0.15     18698
weighted avg   0.65      0.29      0.36     18698

Conclusion

From our exploration of the Recipe and Ratings dataset, it can be seen that there is a very low correlation between the nutritional value of a food and the ratings it receives on a recipe website. When running our model to predict the ratings from a variety of factors including calories, number of ingredients, protein, our final model was only able to get a weight F-score of 0.36. When referring back to the distribution of ratings and its relationship between these features, it is not a surprise that they have a hard time accurately predicting since the original data is so skewed and they had limited correlations. While we were unsuccessful in creating a highly accurate model to predict ratings from nutrition, the data can still be useful in hinting that good and bad nutrition have a similar popularity, which is a good sign for the future of the health and fitness industry. For the next steps of the research, it would be interesting to create predictions from a more evenly distributed dataset of ratings to see if it would make a difference from how we handled the uneven distribution (using weights). Additionally, finding a metric for taste would be interesting to see how that factor might affect ratings more than the nutritional content of the food.