Wine Recommender

7 min readMay 21, 2021

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Winter Term 2020).

Background

More than 20 billion bottles of wine are produced annually [1]. Each of these bottles of wine originates from one of the world’s 10.000 different grapevines [2]. We love wine, and who can blame us? A flavourful red wine over dinner with friends or a refreshing white wine in a restaurant at a sunny holiday location: some moments in life are just better with wine. But how to decide which wine to pick from the countless amount of options out there? This is where wine professionals, more commonly referred to as ‘sommeliers’, come into play [3]. Sommeliers are experts in wine tasting. They can describe the taste of a wine in vast detail, tell you the origin of the grapevine, and suggest a matching wine for any type of meal you can imagine. Unfortunately, however, there are just 269 master sommeliers in the world [4]. Most of us will therefore never get to experience the phenomenal advice of a sommelier first hand. What if we found a modern way to make the master sommelier’s wine wisdom available to everyone? This is where our Wine Classifier comes into play.

Dataset

We used a dataset from Kaggle.com which consisted of nearly 130.000 detailed wine descriptions [5]. These descriptions were complemented by information about the wine’s origin (i.e., country, grapevine designation, province and region). The dataset also included the price and the amount of points that the taster awarded the specific wine. This dataset was constructed by scraping the WineEnthousiast website [6]. Wine Enthusiast is a magazine and website specializing in wines, spirits, food and travel. Founded in 1988 by Adam Strum, the magazine is run from its headquarters in New York and has a distribution of more than 500,000 readers [7].

Methodology

Missing values The dataset had several missing values for most of the variables. We first filled all missing values with the word ‘unknown’. We then initially imputed the mean for the missing price values. However, since these missing values accumulated to almost 9000 cases, the impact of this imputation was too big. We decided to drop all rows that missed a value for price.

Exploratory data analysis With our new clean dataset, we examined the distribution and potential relations within and between variables. We measured descriptive statistics such as mean, standard deviation and correlation(s). This was supplemented by creating plots such as boxplots, histograms and scatter plots. At first, it seemed as if there was no relationship between prices and points of the wine by visual inspection of scatter plots. However, after we categorized the wines into 5 classes according to their points (i.e., a method called ‘binning’), a positive correlation between price and points became evident.

Outliers With the view of the price boxplot, we detected that most of the data had prices lower than $1100. Only 14 rows are with the price higher than $1100, and the maximal price was $3300. In order to prevent these data from interfering with our prediction results, we decided to define them as outliers and drop these 14 rows.

Recoding categorical features Because the dataset had several categorical features, we had to decide how to recode their values in order to use them in predictive models. We tried two distinct approaches for this, namely label encoding and one-hot-encoding. The problem, however, was that the categorical variables had a large amount of unique values. To illustrate, one of them had almost 38.000 different values. Performing label/one-hot encoding on this variable would result in 38.000 dummy variables. We solved this issue by calculating the frequencies of values within each categorical variable, and then only using the top 10 most frequent values for each of those variables. The newly created variables with their 10 unique values were then recoded to be used in our models. While we initially tried both label encoding and one-hot encoding, we later decided to only stick with one-hot encoding since it circumvents the issue of ordered values.

NLP The dataset consisted of 129.971 different descriptions of wines. An example of such a description goes as follows: _“Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.” We wanted to use the wine descriptions in our predictive models. However, they consisted of the abstract and unstructured data that constitute our natural language. Natural language is something machine learning, with their numerical input-output syntax, cannot comprehend as of yet. We needed to find a way to convert these natural language wine descriptions into numerical values, digestible by our models. Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages [8]. NLP is the overarching term used for making natural language digestible for machine learning models. In our NLP approach, we aimed to convert each distinct wine description into numerical values. For this, we followed several steps. First, we removed any leading or trailing whitespaces, converted all descriptions into lowercase characters, and removed any punctuation marks (using the string module for Python). After this, the descriptions were split into tokens, where each word in a description was characterized as a token. We then used the NLTK module to remove stop words (e.g., the, he, have, etc.) from each description. Then, the tokens (i.e., words) in each description were normalized. Simply speaking, token normalization entails giving similar tokens the same value (e.g., drinking == drink == drank). We had the option between two normalization techniques: stemming and lemmatization. Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. We decided to go with lemmatizing because it returns actual meaningful words as tokens. Finally, the sklearn module was used to convert our descriptions into a term frequency-inverse document frequency (TF-IDF) vector. TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. The descriptions were now ready to be used in our predictive models.

Models and results

Classification We used a Naive Bayes classifier to classify the 10 most frequent wine countries, wine varieties, and wine provinces based on the TF-IDF vectorizer of the wine descriptions. The classification of the top 10 provinces came out on top (i.e., highest accuracy, recall and precision scores).

Regression We used K-Nearest Neighbor (KNN) and Random Forest regressors to fulfill the regression of wine prices. First, we split the data into training data (70%) and test data (30%) in order to avoid the prediction results overfitting our training data. Next, we applied hyperparameter optimization with 5-fold cross-validation. For the KNN regressor, we selected the number of neighbors as our hyperparameter. For the Random Forest regressor, we chose the number of trees in the forest as our hyperparameter. We determined how many nearest neighbors and decision trees should be picked so as to acquire an optimal prediction result. Cross-validation is a resampling process to result in less biased estimation in general. A 5-fold cross separates data into 5 groups, selects one of them as a validation set to evaluate the prediction result from other groups. It repeats the process 5 times so that each group acts as a validation set once. Finally, we set the acquired hyperparameters to our KNN and Random Forest regression models and trained them with training data. We assessed the regression performance by R-squared of test data. Moreover, we predicted the price of wines based on countries, provinces, regions, tasters, varieties, and wineries, using two different machine learning models, K-Nearest Neighbors (KNN) and Random Forest. The regression performances of them were close with R-squared scores of 0.6. The hyperparameter had a critical impact on the KNN regressor; however, it only affected the Random forest tree regressor slightly

Conclusion and Outlook

Our wine classifier was successful in classifying the wine province based on a tasters’ description, and managed to predict the price of a wine based on several other variables. Of course, there is always room for improvement. For example, the current TF-IDF vectorizer had a very high amount of dimensions, making it impossible to use in our regression model. Future projects could reduce the dimensions of such a TF-IDF vectorizer by applying Principal component analysis (PCA) or a similar dimension reduction method. Also we had some trouble in our group communication at the start, and did not use any version control platform (e.g., Git) during our work.

Group

This project was performed by the TechLabs Digital Shaper group 9 of winter semester 2020. The project group consisted of Cheng-En Tsai, Xin Yang, and Koen van Kan. We would like to thank the Aachen TechLabs Management team for providing us with an interesting semester and keeping us motivated. Special thanks to our mentors Hannah Hechenrieder and Simon Lyra for all the guidance and support throughout these past months.

References

TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact journey.ac@techlabs.org so that the corresponding item can be removed.