Starbucks Capstone Challenge

Dong Zhang
12 min readNov 30, 2020
Copyright: Shutterstock: www.eatthis.com/starbucks-facts

This is a capstone project of the Udacity data science nanodegree program.

The dataset contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. The dataset includes three json files:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed

Problem Statement

The primary problem I would like to look into is how likely a customer received an offer will complete it. In order to understand this question, I am building a supervised machine learning classification model based on the offer portfolio and demographics information.

Before working on the modeling, more questions need to be asked to better understand the data. Here, I list five questions I explored:

  • What is the relation between offer viewed/completed rate and offer difficulty level?
  • What is the relation between offer viewed/completed rate and offer duration?
  • Is there any correlation between offer types and offer completed rate?
  • For individual customers, what are the relations between customer age, income, membership duration, and offer completed rate?
  • Is customer gender important for the offer completed rate?

The goal of this project can be achieved by the following steps:

  • Data loading and cleaning
  • Exploratory data analysis
  • Data visualization
  • Data analytics to answer the above questions
  • Data preprocessing for machine learning modeling
  • Machine learning model implementation
  • Model improvement using training dataset
  • Model evaluation and validation using testing dataset
  • Final conclusions, discussion including reflection and improvement

The model metrics are given in the session of machine learning modeling.

Data Loading and Cleaning

1. portfolio

Above screenshot gives the original portfolio data for offers. The fields in the table include:

  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

To join with other tables, I renamed the field ‘id’ as ‘offer_id’, and use one-hot method to convert channels to four columns.

The new dataset of portfolio is as follows:

2. profile

This is the demographic data for customers. There are 17,000 data points/rows from the profile dataset, with the screenshot of the head 10 rows:

Each row is for one customer id include the following message:

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income.

Note that there are missing values in ‘gender’ and ‘income’ columns. I found that there are 12.7% of gender and income values are missing.

The distributions of the customer age, gender and income are:

Distribution of customer age (left), gender (middle) and income (right).

The age distribution shows an outlier value as 118. There are 2175 customers/data points with age = 118. I found that these data points are all associated with missing gender and income.

For gender distribution, we can see that there are more males than females, Since I have no further information of the gender ‘O’, I still keep this third gender.

For income distribution, I further segmented the data by gender (male and female):

Gender distribution segmented by gender (male and female)

Interestingly, we can see that men are more concentrated at the low-income end than women. Statistically men have lower income than women.

3. transcript

The transcript is a large dataset, including 306,534 data points, with four columns, and the screenshot of the first 10 rows:

The definitions of the fields:

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

For the following analysis and modeling, ‘person’ is renamed as ‘customer_id’.

All 17,000 customer ids have records in the transcript dataset. For total 306,534 event entries there are 76,277 offers received, 57,725 offers viewed, 138,953 transaction, and 33,579 offers completed.

The value of the transcript table include three different types: offer_id, amount (for transaction), and reward. Therefore, I split the transcript value into three columns: offer_id, amount and reward.

Next, I split the transcript table into three tables based on three events:

  • The transcript_received table is for events ‘offer received’
  • The transcript_viewed table is for events ‘offer viewed’
  • The transcript_completed table is for events ‘offer completed’.

Note that, a single customer can receive multiple offers, and even for a single offer id, a single customer may receive it multiple times. The situation is similar for customer viewed and completed records.

Exploratory Data Analysis and Data Visualization

First I looked at the transcript data associated with event ‘offer received’. We know that a customer may receive an offer multiple times. I considered for a unique combination of customer_id and offer_id as one entry/record. I combined all records of different combination of customer_id and offer_id, and counted how many times one customer received an offer. The head of the results are:

For events as ‘offer received’, there are totally 63,288 records of different combinations/records of customer_id and offer_id.

Also, ‘received_count’ shows how many times a customer received each offer. The values of ‘received_count’ have 51,570 data points as one, 10,523 as two, 1124 as three, 66 as four, and 5 as five. That means there are 66 records of (customer_id, offer_id) repeat four times, and 5 records of (customer_id, offer_id) repeat 5 times, etc. But most customers just received an offer once.

For events as ‘offer viewed’, sometimes a customer could view an offer multiple times. The head of the results are the following, with ‘viewed_count’ records how many times a customer viewed the offer:

There are 49,135 data points for viewed offers. Note that there are totally 63,288 records of offers sent to 16,994 customers, which means 49,135/63,288 = 77.6% of them were viewed. And 65.3% offers were viewed once, but 22.4% offers were never viewed.

Next I counted how many offers were eventually completed. Following screenshot shows the first 10 records of offer completeness. Looks like some one completed one offer multiple times:

There are 28,996 completed offer records, which means the completed rate is 28,996/63,288 = 45.5%. Pretty good result!

To better understand the offer viewed and completed records, I segmented the date with different features and visualize the data analytics.

Question 1:

What is the relation between offer viewed/completed rate and difficult level?

Counts of offers viewed and not viewed segmented by difficulty levels
Counts of offers of completed, viewed but not completed, neither completed nor viewed, segmented by difficult levels
Offer viewed and completed rates segmented by difficulty level

Note that in figure legend the “not viewed” means offers neither viewed nor completed. There are offers completed without viewed.

The average difficulty level of received offers to be viewed is 7.23, while the average difficulty level of offers not viewed is 9.38, higher than the viewed average. Difficulty level 7 shows the highest viewed rate, while difficulty level 20 shows the lowest viewed rate. Interestingly, we see that none offers with difficulty level zero were completed. Finally we look into the offer viewed and completed rates. Difficult level 7 gives both the highest viewed and completed rates, while difficulty level 20 shows higher completed rate than viewed rate.

Question 2:

What is the relation between of offer viewed/completed rate and duration?

Counts of offers of completed, viewed but not completed, neither completed nor viewed, segmented by duration values
Offer viewed and completed rates segmented by duration values

There are five values of offer opening duration: 3, 4, 5, 7 and 10. No single offer was completed with duration as 3 or 4. Offers with duration as 5 has the highest viewed rate but lowest completed rate among three duration values.

Question 3:

What is the relation between offer types and offer completed rate?

Offer viewed and completed rates segmented by offer types

The result shows that informational offers were viewed but never completed. The completed rate for discount offers is higher than bogo offers.

Question 4:

For individual customers, what are the relations between customer age, income, membership duration, and offer completed rate?

Distribution of age segmented by completed/not completed offers
Distribution of income segmented by completed/not completed offers
Distribution of member days segmented by completed/not completed offers

The above three figures show interesting results. Younger people are more likely to not complete offers. Higher income people have higher completed rate, same as people with longer membership days. Here the membership days (member days) are defined as the days since the first membership day.

Question 5:

What is the relation between completed rate and gender?

Completed ratios for males and females

Here completed ratio is divided in three parts: female, male, and ‘O’. Offers which were not completed is also divided in three parts: female, male, and ‘O’. We can see that for completed offers about 55% of them are men, but for not-completed cases 72.9% of them are men. Men are more likely to drop offers rather than completed offers.

Machine Learning Modeling

In order to predict how likely a customer will complete an order or not, I built a supervised classification machine learning model. The model focuses on transcripts with the events ‘offer received’ and ‘offer completed’, but ignores other two events ‘offer viewed’ and ‘transaction’.

Data Preprocessing

The first step is to aggregate all needed data from transcript, portfolio and profile. Here is the sample of the collected data,

with features of:

received_count, difficulty, duration, offer_type, reward, email, mobile, social, web, age, gender, income, member_year, member_days

where the ‘member_year’ is the year of customer to become a member, and ‘member_days’ are the days since the first day of membership.

‘event_completed’ is the labels with binary values 0 and 1. I dropped the null values of gender, so the data includes 55,222 data point, with 27,280 not completed (event_completed = 0) and 27,942 completed (event_completed = 1).

Among the fields, ‘offer_type’ and ‘gender’ are categorical fields, I used LabelEncoder to convert them to numbers. Then I split the data into 80% of training and 20% of testing, with the shape of training and testing datasets:

  • Training Features Shape: (44177, 14)
  • Training Labels Shape: (44177,)
  • Testing Features Shape: (11045, 14)
  • Testing Labels Shape: (11045,)

I rescaled the features to values between 0 and 1 as another step of data preprocessing.

Model Metrics

Supervised classification models were built. The main metrics for model performance are

- Precision
- Recall
- Weighted F1 score
- ROC AUC score

Metric Justification

This problem is a supervised classification machine learning problem. The labels with binary values are pretty balanced, so I first used confusion matrix, in particular, precision and recall to measure the classification model accuracy. Note that for balanced labels, precision and recall of both label values need to be looked into.

To simplify the metrics, since I found the values of precision and recall for this problem are similar, I used the weighted F1 score, which is the combination of precision and recall to measure the model accuracy. On the other hand, ROC-AUC curves help to evaluate the performance of various classifiers, I also used ROC-AUC score as another metric to evaluate the classifier accuracy.

Implementation and Improvement

The baseline models are given by logistic regression and random forest classifier without parameter fine tuning. The algorithm of logistic regression is straightforward and easy to understand, thus for many cases logistic regression is always the first choice for supervised classification problem to set up the “bottom line” model accuracy. Random forest is a widely used algorithm for classification, and setting up another “bottom line” model accuracy.

I implemented the baseline models for the testing data, and found the found logistic regression gives weighted F1-score about 0.72, and random forest gives higher F1-score to 0.76.

Then I tried the following algorithms, and implemented the trained models to to test the model accuracy using ROC-AUC score:

AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier, LogisticRegression, PassiveAggressiveClassifier, RidgeClassifierCV, SGDClassifier, Bernoulli Navies Bayes, Gaussian Navies Bayes, KNeighborsClassifier, Linear Discriminant Analysis, Quadratic Discriminant Analysis, XGBoost Classifier, LightGBM Classifier

Among the above logarithms, XGBoost Classifier gives weighted F1 score of 0.8 and ROC-AUC score of 0.796, which are the top scores compared to others, followed by LightGBM Classifier which gives weighted F1 score of 0.79 and ROC-AUC score of 0.787.

Therefore, I chose XGBoost Classifier for model implementation for this problem. Moreover, later I mixed XGBoost and LightGBM to evaluate the feature importance for the model.

For model improvement, I grid searched the parameters of XGBoost, such as n_estimators and learning rate. The model averages 5-fold cross validation.

Model Evaluation and Justification

Implemented the model to the testing dataset, the model accuracy can be increased from the baseline 0.72 to 0.80. The details of precision and recall scores for testing dataset are as follows:

And the confusion matrix for testing dataset is

Confusion matrix for the testing dataset

where the rows in the matrix are predicted from the model, and the columns in the matrix are for real values from the testing labels.

The model accuracy (F1 score) is ~0.8. On the other hand, the offer completed rate is 45.5%. The predictor has much higher probability to target potential customers who can complete offers, and helps Starbucks to significantly increase offer completed rate.

I also analyzed the feature importance based on averaging the XGBoost and LightGBM classification:

As a result, ‘difficulty’ shows a dominant importance for the classification, followed by ‘income’, ‘age’, and offer information. Also note that reward does not impact the model. This figure is useful for feature selections.

Conclusions and Discussion

In this project I looked into the simulated dataset from Starbucks that mimics customer behavior.

I delivered a series of data analytic results, and built a model to predict if an offer will be completed by individual customers.

Data Analytic Results

The data shows that there is no customer completed the received offers if

- The offer difficulty level is 0.
- The offer duration is less than 5.
- The offer type is informational.

Also I found that younger people, people with lower salaries, relatively new members, or men are less likely to complete offers. To increase the revenue, Starbucks needs to focus more on offers featured with higher completed rate.

Machine Learning Predictor

The supervised classification machine learning model gives a predictor for if an individual customer will complete an offer. The difficulty level, income, age, channel and offer types are among the topic importance for the model. The model accuracy (F1 score) is ~0.8. On the other hand, the offer completed rate is 45.5%. The predictor has much higher probability to target potential customers who can complete offers, and helps Starbucks to significantly increase offer completed rate.

Reflection

The most difficult part of this project is to split the transcript dataset and well define the problem. Do we want a prediction model based on individual customer? Or do we want a prediction model based on both customers and offers? I chose the second question, and combine transcript entries of same customer and offer as one record.

I enjoyed the data analytics part to find the customer behavior, which depends on offer information such as difficulty level, duration, and offer type. In general, the whole project is really interesting.

Improvement

For the future improvement I will use a wider range of grid search to fine tune model parameters to improve the model accuracy. In order to do this, I need to use GPU to accelerate the computing ability.

Another direction is to use the data analytic results to build a better model. For example, since informational offers never get completed, in the model I can construct if and else such as

if offer type is informational → not completed
else → go to the machine learning model

The above method based on the analytic results can also improve the model accuracy.

The code can be found in my github

Find me on LinkedIn

--

--