ISYE 6740 Homework 3
Total 100 points.
1. Basic optimization. (30 points.)
Consider a simpli_x000c_ed logistic regression problem. Given m training samples (xi; yi), i = 1; : : : ;m.
The data xi 2 R (note that we only have one feature for each sample), and yi 2 f0; 1g. To _x000c_t a
logistic regression model for classi_x000c_cation, we solve the following optimization problem, where 2 R
is a parameter we aim to _x000c_nd:
`( ); (1)
where the log-likelhood function
`( ) =
f???? log(1 + expf???? xig) + (yi ???? 1) xig :
(a) (10 points) Show step-by-step mathematical derivation for the gradient of the cost function `( )
in (1) and write a pseudo-code for performing gradient descent to _x000c_nd the optimizer . This is
essentially what the training procedure does. (pseudo-code means you will write down the steps
of the algorithm, not necessarily any speci_x000c_c programming language.)
(b) (10 points) Present a stochastic gradient descent algorithm to solve the training of logistic
regression problem (1).
(c) (10 points) We will show that the training problem in basic logistic regression problem
is concave. Derive the Hessian matrix of `( ) and based on this, show the training problem (1)
is concave (note that in this case, since we only have one feature, the Hessian matrix is just a
scalar). Explain why the problem can be solved e_x000e_ciently and gradient descent will achieve a
unique global optimizer, as we discussed in class.
2. Comparing Bayes, logistic, and KNN classi_x000c_ers. (30 points)
In lectures we learn three di_x000b_erent classi_x000c_ers. This question is to implement and compare them. We are
suggest use Scikit-learn, which is a commonly-used and powerful Python library with various machine
learning tools. But you can also use other similar library in other languages of your choice to perform
Part One (Divorce classi_x000c_cation/prediction). (20 points)
This dataset is about participants who completed the personal information form and a divorce predic-
The data is a modi_x000c_ed version of the publicly available at https://archive.ics.uci.edu/ml/datasets/
Divorce+Predictors+data+set (by injecting noise so you will not replicate the results on uci web-
site). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued. The
dataset marriage.csv. The last column of the CSV _x000c_le is label y (1 means divorce”, 0 means no
divorce”). Each column is for one feature (predictor variable), and each row is a sample (participant).
A detailed explanation for each feature (predictor variable) can be found at the website link above.
Our goal is to build a classi_x000c_er using training data, such that given a test sample, we can classify (or
essentially predict) whether its label is 0 (no divorce”) or 1 (divorce”).
Build three classi_x000c_ers using (Naive Bayes, Logistic Regression, KNN). Use the _x000c_rst 80% data for
training and the remaining 20% for testing. If you use scikit-learn you can use train test split to split
Remark: Please note that, here, for Naive Bayes, this means that we have to estimate the variance for
each individual feature from training data. When estimating the variance, if the variance is zero to
close to zero (meaning that there is very little variability in the feature), you can set the variance to
be a small number, e.g., _x000f_ = 10????3. We do not want to have include zero or nearly variance in Naive
Bayes. This tip holds for both Part One and Part Two of this question.
(a) (10 points) Report testing accuracy for each of the three classi_x000c_ers. Comment on their perfor-
mance: which performs the best and make a guess why they perform the best in this setting.
(b) (10 points) Use the _x000c_rst two features to train three new classi_x000c_ers. Plot the data points and
decision boundary of each classi_x000c_er. Comment on the di_x000b_erence between the decision boundary
for the three classi_x000c_ers. Please clearly represent the data points with di_x000b_erent labels using di_x000b_erent
Part Two (Handwritten digits classi_x000c_cation). (10 points) Repeat the above using the MNIST
Data in our Homework 2. Here, give digit” 6 label y = 1, and give digit” 2 label y = 0. All the
pixels in each image will be the feature (predictor variables) for that sample (i.e., image). Our goal
is to build classi_x000c_er to such that given a new test sample, we can tell is it a 2 or a 6. Using the _x000c_rst
80% of the samples for training and remaining 20% for testing. Report the classi_x000c_cation accuracy on
testing data, for each of the three classi_x000c_ers. Comment on their performance: which performs the best
and make a guess why they perform the best in this setting.
3. Naive Bayes for spam _x000c_ltering. (40 points)
In this problem we will use the Naive Bayes algorithm to _x000c_t a spam _x000c_lter by hand. This will en-
hance your understanding to Bayes classi_x000c_er and build intuition. This question does not involve any
programming but only derivation and hand calculation.
Spam _x000c_lters are used in all email services to classify received emails as Spam” or Not Spam”. A
simple approach involves maintaining a vocabulary of words that commonly occur in Spam” emails
and classifying an email as Spam” if the number of words from the dictionary that are present in the
email is over a certain threshold. We are given the vocabulary consists of 15 words
V = fsecret, o_x000b_er, low, price, valued, customer, today, dollar, million, sports, is, for, play, healthy, pizzag:
We will use Vi to represent the ith word in V . As our training dataset, we are also given 3 example
• million dollar o_x000b_er
• secret o_x000b_er today
• secret is secret
and 4 example non-spam messages
• low price for valued customer
• play secret sports today
• sports is healthy
• low price pizza
Recall that the Naive Bayes classi_x000c_er assumes the probability of an input depends on its input feature.
The feature for each sample is de_x000c_ned as x(i) = [x(i)
1 ; x(i)
2 ; : : : ; x(i)
d ]T , i = 1; : : : ;m and the class of the
ith sample is y(i). In our case the length of the input vector is d = 15, which is equal to the number
of words in the vocabulary V . Each entry x(i)
j is equal to the number of times word Vj occurs in the
(a) (5 points) Calculate class prior P(y = 0) and P(y = 1) from the training data, where y = 0
corresponds to spam messages, and y = 1 corresponds to non-spam messages. Note that these
class prior essentially corresponds to the frequency of each class in the training sample.
(b) (10 points) Write down the feature vectors for each spam and non-spam messages.
(c) (15 points) In the Naive Bayes model, assuming the keywords are independent of each other (this
is a simpli_x000c_cation), the likelihood of a sentence with its feature vector x given a class c is given
P(xjy = c) =
c;k; c = f0; 1g
where 0 c;k 1 is the probability of word k appearing in class c, which satis_x000c_es
c;k = 1; 8c:
Given this, the complete log-likelihood function for our training data is given by
`( 1;1; : : : ; 1;d; 2;1; : : : ; 2;d) =
k log y(i);k
(In this example, m = 7.) Calculate the maximum likelihood estimates of 0;1, 0;7, 1;1, 1;15 by
maximizing the log-likelihood function above. (Hint: We are solving a constrained maximization
problem. To do this, remember, you need to introduce two Lagrangian multiplier because you
have two constraints.)
(d) (10 points) Given a test message today is secret”, using the Naive Bayes classier that you have
trained in Part (a)-(c), to calculate the posterior and decide whether it is spam or not spam.