Ever wondered how logistic regression really works?

You’re not alone. It’s confusing.

This post drills down into the mechanics of logistic regression. We’ll go through the mathematical aspects and also construct our own logistic regression function in R. Exciting!

What is logistic regression?

In linear regression the dependent variable Y takes continuous values, like house prices for an area or the income of a individual.

Logistic regression differs in that it is used for modelling categorical data. Typically you model data that falls into two categories, but using logistic regression with more than two categories is certainly possible.

We will assume for this post that our response variable Y falls into exactly two categories. Here are some examples where you might want to use logistic regression:

  • If a plant lives/ dies
  • If someone was accepted/rejected for a home loan
  • If a student graduates/drops out of university
  • Owning/not owning a TV/laptop/segway/aircraft

We represent one of the categories by Y = 1 and the other by Y = 0. For example a plant that lives could be represented by Y=1, and a dead plant by Y=0.

Mathematical representation

Let us have n data points, or training samples. Each one of these training samples is comprised of measurements of m independent variables which are denoted by X_1, X_2, \cdots, X_n. The realised values of these variables are given by X_1 = x_1, X_2 = x_2, \cdots, X_m = x_m.

Each training sample also has a value of the dependent variable Y associated with it, and we let y be the realised value of Y.  We have that Y = y where either y =1 or y=0.

The ith training sample will consist of the data points x_{i1}, x_{i2}, \cdots, x_{im} and the observed response value y_i. Because our data is all realised values (we know the actual values of the variables) we use lowercase notation to refer to it.

We refer to individual column vectors like

    \begin{align*} \mathbf{x}_{*1} &= \begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} \end{align*}

and row vectors like

    \[  \mathbf{x}_{1 *} = \left[ x_{11}, x_{12}, \cdots, x_{1m}  \right]  \]

We will find it useful to define the matrix \mathbf{X}, which is known as the design matrix. Let

    \[ \mathbf{X} = \left[  \mathbf{x}_{* 0}, \mathbf{x}_{* 1},\mathbf{x}_{*2}, \cdots,\mathbf{x}_{*m} \right] \]

where \mathbf{x}_{*0} is a vector of 1’s that allows us to estimate the intercept term.

In addition, let p_i = p_i (\mathbf{x}_{i*}) = P(Y = 1 | X = \mathbf{x}_{i*}), the probability of the response being 1 given the data \mathbf{x}_{i*}. You can then set the vector of probabilities \mathbf{p} as a N \times 1column vector where p_i is the probability of y_i = 1.

The logistic regression model can be modelled as

    \[  \log \left( \frac{\mathbf{p}}{1-\mathbf{p}} \right)  =  \mathbf{X} \cdot \boldsymbol{\beta}  \]

where \boldsymbol{\beta} = \left[  \beta_0, \beta_1, \cdots, \beta_m  \right] is the vector of coefficients. Solving for \mathbf{p} gives

    \[ \mathbf{p}  = \frac{\exp( \mathbf{X} \cdot \boldsymbol{\beta})}{1+\exp(\mathbf{X} \cdot \boldsymbol{\beta})} = \frac{1}{1+\exp \left(-(\mathbf{X} \cdot \boldsymbol{\beta}) \right) }  \]

Likelihood function

Looking at our data, for each vector of features \mathbf{x}_{i*} we have an observed class y_i. We’ll be able to fit a likelihood to the data.

This is how we’ll find the \beta coefficients –  by maximising the likelihood function with respect to \beta.

I’ll only provide the key findings below, skipping over some of the mathematical steps. Feel free to try the derivations yourself and see if you can get to the same answer.

We have that  y_i = 1 with probability p_i, and y_i = 0 with probability 1-p_i. From this we can formulate the likelihood function

    \[  L = \prod_{i=1}^n    p_i^{y_i} (1-p_i)^{(1-y_i)}   \]

and hence the log likelihood

    \[ \log(L) = \ell =  \sum_{i=1}^n \left[ y_i \log(p_i)  + (1- y_i) \log (1-p_i) \right] \]

which can be rearranged to get

    \[ \ell =  \sum_{i=1}^n   \left[ y_i \boldsymbol{\beta}^T  \mathbf{x}_{i*}  - \log \left(1 + \exp(  \boldsymbol{\beta}^T \mathbf{x}_{i*}  \right)    \right] \]

by substituting in the definition of p_i.

We can take the derivative of the log likelihood and see for what value of \beta it equals zero: in other words, looking for a place where the slope of the \beta vector is zero. This is the standard way of maximising the log likelihood and ends up being

    \[   \frac{\partial \ell}{\partial \beta} = \sum_{i=1}^n \mathbf{x}_{i*}(y_i - p_i) = 0  \]

 

which is not just one equation, but a series of them!

In fact, matrix notation is the easiest way that we can represent this. We can rewrite it as

    \[  \frac{\partial \ell}{\partial \beta}  = \mathbf{X}^T (\mathbf{y}-\mathbf{p} )  \]

which is a lot easier to work with in R.

To solve this we’re going to use an iterative algorithm called the Newton-Raphson method. In order to be able to implement the method, it turns out we’re going to require the second derivative of the log likelihood. Let’s just grab that now:

    \[  \frac{\partial ^2 \ell}{\partial \beta^2} = - \sum_{i=1}^n \mathbf{x}_{i*} \mathbf{x}_{i*}^T p_i (1-p_i) \]

Again we can rewrite this in matrix form but this time we’re going to define a new matrix \mathbf{W} to help us out. Let \mathbf{W} be a n \times n diagonal matrix where the ith diagonal element of \mathbf{W} is equal to p_i (1-p_i). Another way of saying this is W_{ii} = p_i (1-p_i) where 1 \leq i \leq n.

Rewriting gives

    \[  \frac{\partial ^2 \ell}{\partial \beta^2} = - \mathbf{X}^T \mathbf{WX} \]

The Newton-Raphson Method

Lacking an explicit solution for \boldsymbol{\beta} like we had for linear regression, we are forced to turn to alternate non-exact methods. One of these methods is the Newton-Raphson method.

The Newton-Raphson method is an iterative method used to calculate maximum likelihood estimators. Lucky us that \boldsymbol{\beta} is a maximum likelihood estimator!

I won’t go into the method in depth, but a good explanation of Newton-Raphson can be found here. The important thing to remember is that it’s an iterative method where in each iteration \boldsymbol{\beta} moves closer to its best value (ideally). The term “best value” here means the \boldsymbol{\beta} with the highest likelihood, so on every iteration we should find that our likelihood increases with each new \boldsymbol{\beta}.

The good news for us is that we’ve already got everything we need to perform the Newton-Raphson method. We’re going to have to clear up the notation a little, however.

Define \boldsymbol{\beta}_{new} as the \boldsymbol{\beta} vector that we get after performing a Newton update, and \boldsymbol{\beta}_{old} as \boldsymbol{\beta} from the previous iteration.

A Newton update is then:

    \[  \boldsymbol{\beta}_{new}  =\boldsymbol{\beta}_{old} - \left( \frac{\partial ^2 \ell}{\partial \beta^2}  \right) ^{-1}\frac{\partial \ell}{\partial \beta }   \]

Substituting in the values that we got earlier gives

    \[ \boldsymbol{\beta}_{new}  =\boldsymbol{\beta}_{old} + \left( \mathbf{X}^T \mathbf{WX}  \right) ^{-1} \mathbf{X}^T (\mathbf{y} - \mathbf{p} ) \]

In each Newton update (or iteration), our vector of probabilities \mathbf{p} will update as we recalculate it using an updated \boldsymbol{\beta}. \mathbf{W} will also change with each iteration as a consequence of its direct reliance on \mathbf{p}.

Applying the Newton-Raphson method

The algorithm can be split into two sections: setup for the iterative loop, and the iterative loop itself.

For the setup:

  1. Set \boldsymbol{\beta} to some initial value – a guess for the coefficients of the logistic regression. You can achieve quicker convergence if you already have a rough idea of what your coefficients are likely to be, since you can set the initial guess to a value close to its true value. This’ll lead to less iterations but the difference shouldn’t be too significant.
  2. Set a threshold value \epsilon . The threshold value is used to check if the logistic regression algorithm is converging or not. The idea is that the amount that \boldsymbol{\beta} changes by should get smaller and smaller each iteration and eventually drop below our threshold value. If it never drops below the threshold value, then we can say that the algorithm isn’t converging and terminate it.
  3. Set an iterations counter. We’ll use this counter to track the number of iterations, and if we find that \boldsymbol{\beta} isn’t converging after some predetermined number of iterations (i.e. | \boldsymbol{\beta}_{old} -\boldsymbol{\beta}_{new} | > \epsilon ) then we’ll end the algorithm.

The iterative part then goes like this:

  1.  Calculate \mathbf{p} using \mathbf{p}  = \frac{\exp(\mathbf{X} \cdot \boldsymbol{\beta})}{1+\exp( \mathbf{X} \cdot \boldsymbol{\beta})}
  2. Calculate \mathbf{W} using the updated \mathbf{p}. Remember that \mathbf{W} is a diagonal matrix where W_{ii} = p_i (1-p_i).
  3. Update \boldsymbol{\beta}}  by using \boldsymbol{\beta}_{new}  =\boldsymbol{\beta}_{old} + \left( \mathbf{X}^T \mathbf{WX}  \right) ^{-1} \mathbf{X}^T (\mathbf{y} - \mathbf{p} ), where \boldsymbol{\beta}_{old} is the \boldsymbol{\beta} value from the previous iteration.
  4. See how much \boldsymbol{\beta} has changed by in the current iteration. If it’s less than the threshold value then end the loop and return \boldsymbol{\beta}.
  5. See if the number of iterations we’ve done is higher than some predetermined number (like 100).  If so then end the algorithm and display an error message along the lines of “the function failed to converge”.

In R

We’ll first generate some sample data. We’ll have 30 sample points and three independent variables, which we’ll imaginatively name x_1, x_2 and x_3. Our dependent variable is denoted by y.

We construct next our design matrix \mathbf{X} which contains our data for x_1, x_2 and x_3 as well as a vector x_0 as a column of 1’s. Remember that we need x_0 in order to calculate the intercept \beta_0.

Here’s the algorithm. The code pretty much follows the steps described above.

One thing to note is that I defined calc_p()  (the function to calculate \mathbf{p}) inside the manual_logistic_regression()  function. This means that calc_p() is a function only accessible from within  manual_logistic_regression()  and other functions won’t be able to access calc_p() . Keep this in mind if you wanted to extend this code or perform some testing or something.

We can go ahead and see how well our version compares against R’s implementation of logistic regression. We usually run logistic regression by calling the glm()  function whilst setting the family argument to “binomial”.

Let’s compare the two:

And that’s all there is to it.

2 thoughts on “Write your own logistic regression function in R

  1. The objective of the article is to bring out how logistic regression can be made without using inbuilt functions and not to give an introduction on Logistic regression.

Leave a Reply

Your email address will not be published. Required fields are marked *