Logistic Regression from Scratch

/blog/2019-11-05-logistic-regression-from-scratch/images/featured_hu3d03a01dcc18bc5be0e67db3d8d209a6_3454120_1595x1148_resize_q75_box.jpg

Logistic Regression is a supervised classification algorithm that uses logistic function to model the dependent variable with discrete possible outcomes.

Logistic regression can be binary, helping us to e.g. predict if a given email is spam or not, or it can support modelling of more than two possible discrete outcomes, also known as multinomial logistic regression.

Logistic regression can be viewed as an extension of the linear regression, adapted for classification tasks. Both calculate the weighted sum of the input variables and a bias term. However, whereas with linear regression this is already the output, the logistic regression calculates the logistic of the sum:

Formula 1

Using the logistic value, we can predict the class by comparing the value with the decision threshold (DT):

Decision threshold

Selecting the decision threshold for logistic regression

For many classification problems, the decision threshold is left at the default value of 0.5. In selecting its proper value it is prudent however to consider the relative importance/cost of false positives and false negatives in your classification problem before deciding on its value.

Varying the decision threshold has a different effect on two important metrics that we typically use for evaluating the classifier:

  • precision (which helps us answer how many of the positive identifications were correct)
  • recall (helping us answer how many of the actual positives were correctly identified)

Increasing the decision threshold leads to a decrease in the number of false positives, and an increase in false negatives, thus leading to higher precision and lower recall. Conversely, decreasing the decision threshold results in decreased precision and increased recall. There is a general tradeoff between precision and recall, with different classification problems requiring different combinations of both.

If our classifier is used for predicting for example, life threatening diseases, our objective is to reduce false negatives as much as possible (high recall), with the increase of false positives (low precision) being part of the tradeoff. This is acceptable because discharging a sick patient leaves us with less options to correct the mistake than wrongly identifying healthy patients as sick does. A similar case where high recall is desirable, and low precision is tolerable, is fraud detection.

On the other hand, let us consider building a classification system for identifying potential new mines in a large mining area. In this case, we want to reduce the number of false positives (high precision), as initial exploration/verification costs can be extremely high. However if we wrongly classify the prospects of a potential mine, there are still other mine sites available in the considered region (low recall).

A helpful method in deciding what decision threshold to select is to examine the ROC curve for the classification problem by plotting True Positive Rate, TPR=TP/(TP+FN), against the FPR, FPR=FP/(FP+TN), for different threshold settings.

How to train a logistic regression model

Intuitively, we would like our loss function to have a form where the optimal model predicts high probabilities for positive outcomes (yi=1) and low probabilities for negative outcomes (yi=0).

The log loss function for logistic regression meets these criteria:

Log loss function

The log loss function is convex and we most often minimize it with the use of a gradient descent algorithm.

As we will implement a logistic regression from scratch, using the gradient descent method, we also need to derive the partial derivatives of the cost function:

Partial Derivative

The partial derivatives can thus be obtained by computing for each instance the product of the prediction error with the j-th feature value and then averaging over all training instances.

Regularization of logistic regression

Logistic regression can often be prone to overfitting, especially in classification problems with a large number of features. One of the most common solutions to overfitting is to apply L2 regularization, adding a penalty to the loss function:

Regularization

where is the regularization parameter. This also leads to a slight modification of the formula for the partial derivatives of the loss function:

Updated loss function

Logistic regression from scratch (in Python)

We will now demonstrate how to implement a logistic regression from scratch, using Python. First, we generate a data set using a multivariate normal distribution. We will use two features and a binary classification (denoted as 1 and 0). We will add a column of ones for biases.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline
 
np.random.seed(1)
number_of_points = 8000
means = [[-1,-1],[0.1,2.8]]
cov = -0.3
covariances = [[[1,cov],[cov,1]],[[1,cov],[cov,1]]] 
a = np.random.multivariate_normal(means[0],covariances[0],number_of_points)
b = np.random.multivariate_normal(means[1],covariances[1],number_of_points)
X = np.vstack((a,b))
X = np.hstack((np.ones((X.shape[0],1)),X)) # adding column of ones (biases)
y = np.array([i//number_of_points for i in range(2*number_of_points)])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)
 
plt.figure(figsize=(10,16))
plt.subplot(2, 1, 1)
plt.scatter(a[:,0],a[:,1],c='black',alpha=0.5,label='class 0')
plt.scatter(b[:,0],b[:,1],c='yellow',alpha=0.5,label='class 1')
plt.legend()
plt.xlabel('x1')
plt.ylabel('x2')

Updated loss function

We next define our custom Logistic Regression class:

class LogisticRegressionCustom():
    
    def __init__(self, l_rate=1e-5, n_iterations=50000):
        self.l_rate = l_rate
        self.n_iterations = n_iterations
 
    def initial_weights(self, X):
        self.weights = np.zeros(X.shape[1])
 
    def sigmoid(self, s):
        return 1/(1+np.exp(-s))    
 
    def binary_cross_entropy(self, X, y):
        return -(1/len(y))*(y*np.log(self.sigmoid(np.dot(X,self.weights)))+(1-y)*np.log(1-self.sigmoid(np.dot(X,self.weights)))).sum()  
    
    def gradient(self, X, y):
        return np.dot(X.T, (y-self.sigmoid(np.dot(X,self.weights))))    
 
    def fit(self, X, y):
        self.initial_weights(X)  
        for i in range(self.n_iterations):
            self.weights = self.weights+self.l_rate*self.gradient(X,y)
            if i % 10000 == 0:
                print("Loss after %d steps is: %.10f " % (i,self.binary_cross_entropy(X_test,y_test)))
        print("Final loss after %d steps is: %.10f " % (i,self.binary_cross_entropy(X_test,y_test)))
        print("Final weights: ", self.weights)
        return self    
 
    def predict(self, X):        
        y_predict = []
        for t in X:
            y_predict.append(1) if self.sigmoid(np.dot(self.weights,t))>0.5 else y_predict.append(0)
        return y_predict    
    
    def predict_proba(self, X):        
        y_predict = []
        for t in X:
            y_predict.append(self.sigmoid(np.dot(self.weights,t)))
        return y_predict

And use it to train our model on previously generated data:

lr = LogisticRegressionCustom()
lr.fit(X_train,y_train)

obtaining the following weights of our logistic regression model:

[-2.92024022 2.48637602 4.53860122]

with final loss of 0.0353669480 after 50,000 iterations.

We can visually inspect the misclassified points in the test data set using:

def colors(s):
    if s == 0:
        return 'black'
    elif s == 1:
        return 'yellow'
    else: 
        return 'coral'
    
plt.figure(figsize=(10,16))
y_predict = lr.predict(X_test)
classify=[]
for i,p in enumerate(y_predict):
    if y_predict[i] == y_test[i]:
        classify.append(y_predict[i])
    else:
        classify.append(2)        
        
plt.subplot(2, 1, 2) 
plt.xlabel('x1')
plt.ylabel('x2')
for i,x in enumerate(X_test): 
    plt.scatter(x[1],x[2],alpha=0.5,c=colors(classify[i]))  

giving us the following scatter plot:

Updated loss function

Misclassified data points (denoted in image above as red) are predictably lying either near the boundary between both classes or in the region of the opposite class.

Further information can be gained by generating contour plot using:

import seaborn as sns
sns.set(style="white")
N=1000
x_values=np.linspace(-5.0, 5.0, N)
y_values=np.linspace(-5.0, 5.0, N)
x_grid, y_grid=np.meshgrid(x_values, y_values)
grid_2d=np.c_[x_grid.ravel(), y_grid.ravel()]
new=[[1]+i for i in grid_2d.tolist()]
classes=np.array(lr.predict_proba(new)).reshape(x_grid.shape)
 
fig,ax=plt.subplots(figsize=(20, 8))
cs=ax.contourf(x_grid, y_grid, classes, 25, cmap="cividis", vmin=0, vmax=1)
cbar=fig.colorbar(cs)
cbar.set_label("P(y = 1)")
cbar.set_ticks([0, 0.25, 0.5, 0.75, 1])
 
ax.scatter(X_test[:,1],X_test[:,2],c=y_test[:],s=40,cmap="cividis",vmin=-.25,vmax=1.25,
           edgecolor="white",linewidth=1)
 
ax.set(aspect=0.7, xlim=(-5, 5), ylim=(-5, 5), xlabel="x1", ylabel="x2")

resulting in this contour map for our classification problem:

Updated loss function

Comparison of logistic regression from scratch with scikit-learn implementation

We can also assess our results obtained with logistic regression from scratch by comparing it with those obtained with the LogisticRegression class from the sklearn library. We will set regularization parameter C to a high value to minimize the regularisation in the sklearn library, as our logistic regression from scratch also does not include regularization term.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
model = LogisticRegression(C=1e5, solver = 'saga', fit_intercept=False) #high C ensures very weak regularisation for proper comparison with our method, fit_intercept is set at false as we already set a column of ones for biases
model=model.fit(X_train, y_train)
y_pred=model.predict(X_test)
 
y_pred = model.predict_proba(X_test)
print('*** Final set of weights and logloss (with scikitlearn)*** ')
print('Weights: ',model.coef_[0])
print('Logloss: ',log_loss(y_test,y_pred))

Comparison of results and losses for both approaches shows negligible differences between both implementations:

Method Weights vectors Loss
Logistic regression from scratch [-2.9202402 2.48637602 4.53860122] 0.0353669480
Logistic regression (Scikit-learn) [-2.9204342 2.48644559 4.53881492] 0.0353667283
Written by:

Samo Plibersek, Ph.D.