You have just begun working for a baby products company called BabiesRWe (in no way related to another company by a similar name). The company has a database of 1 million existing customers that are registered with an account on their website.
Just prior to your arrival, the company sent an advertisement via email to a random sample of its existing customers for a premium bottle warmer. Now, the company wants to send out another advertisement to another sample of existing customers for the same product. However, instead of choosing randomly, they want to target the customers most likely to respond based on the results of the first round of advertisements. You have been given a data set by your manager (baby_data.csv) with the goal of creating a classification model to send targeted ads.
Target variable:
- purchased: whether the customer used the discount offer
Attributes:
- repeat_customer: whether the customer has previously purchased a product from BabiesRWe
- total_spent: the total amount of money the customer has spent on BabiesRWe products
- children: how many children the customer has
- adults: how many adults live in the customer’s household
You also have the following information about the product and advertisement:
- Bottle warmer price: $40
- Bottle warmer cost: $10
- Advertisement cost: $0.50
- A) Create a cost/benefit matrix for this situation using the information above (I recommend just using the create table function in Word).
- B) Evaluate both models on the test set and report the precision, recall, and ROC AUC for each model at a 50% probability threshold (see outputs below, already done). Explain what each of these measures means in this context (Do this please)
- Decision tree (Positive class = Yes)
https://bigml.com/shared/evaluation/28lE5IzJE8yVI3KnETdXssMblgu
(ROC Curve)
- Precision = 65.06%
- Recall = 37.09%
- ROC AUC = 0.0
- Logistic regression (Positive class = Yes):
https://bigml.com/shared/evaluation/6rkTUlqy5MtKPamaIGD5Qzx2BxT
(ROC Curve)
- Precision = 59.18%
- Recall = 30.80%
- ROC AUC = 0.0
- C) Explain what the probability threshold means in this context and discuss the relationship between precision and recall that you see in each model as you vary the probability threshold.
- E) Suppose that you are given a fixed budget of $50,000 to email targeted ads for the bottle warmer and that you decide to set your models to a modestly conservative 60% probability threshold.
Using the confusion matrices (the table on the links above) from the BigML output and the cost/benefit matrix from Part A, what is the expected profit for each targeted advertisement sent when using the decision tree and when using logistic regression subject to your budget constraint?
Based on your calculations, which model yields a greater expected profit, and would you recommend BabiesRWe send targeted ads?