The logistic regression framework can be used when we are interested in a dependence structure,
having a dependent variable (response) in our possession and a set of one or multiple independent
variables (explanatory) (Tranmer and Elliot, 2008).
Linear regression is considered appropriate in cases where a continuous variable is investigated
with multiple independent variables. However, variables may not always be specified
in an interval scale but rather as categorical variables, such as college major (Math, Physics,
etc.), blood type (AB, A, B, etc.), pregnancy status (Pregnant, Not Pregnant) and more. For
instance, having to predict the college major using a set of explanatory categorical variables
such as age, sex, ethnic group and so forth, the linear least square regression would massively
fail. The reason is that most of the assumptions of the linear regression model would not be
met (e.g. Least square regression estimates assume symmetric error distributions and that the
response variable is defined on the real line) (Montgomery et al., 2012) page 171. Instead, we
use a model known as a logistic regression model.
Considering the problem of trying to predict a binary variable Y ? f0; 1g, we write the logistic
regression model as
log(
(X)
1 ? (X)
) = 0 + 1X1 + 2X2 + + pXp (1)
where X = (X1;X2; ;Xp) are p predictors of n observations and Xj = (x1j ; x2j ; ; xnj)T ,
j = 1; 2; ; p. X can be represented as an np matrix commonly known as the design matrix.
It is also represented as X = (xT1
; xT2
; ; xT
n )T with xi = (xi1; xi2; ; xip)T (James et al.,
2013). The left hand side in equation 1 is known as the log of odds or logit function and it
measures the odds of observing an outcome over the absence of that outcome. Accordingly, the
fraction (X)
1?(X) is called the odds, and its formula
(X)
1 ? (X)
= e0+1X1+2X2++pXp (2)
can be derived from equation 1.
Similarly, equation 2 is transformed into the logistic function that is shown below.
11
(X) =
e0+1X1+2X2++nXn
1 + e0+1X1+2X2++pXp
(3)
with (X) = E(Y jX).
Paying closer attention to equation 1, it seems that an increase in a predictor Xj while
keeping the rest of them fixed, it will increase the log of odds by the value of j in contrast to
linear regression. In addition, from equation 2, it is clear that there is no linear relationship
between (X) and X. This indicates that a change of a unit in X associated with (X) does
not correspond to the values of (Walker and Duncan, 1967).
Logistic regression is a statistical method, belonging to a class of algorithms known as
Generalized Linear Models (glm), that is commonly used for regression analysis in cases where
a binary (dichotomous) variable is examined. Generalized Linear Models consists of 3 parts
(McCullagh and Nelder, 1989).
• Random Component: Probability distribution of the response variable Y; e.g. Y follows
binomial distribution in binary logistic regression.
• Systematic Component: Covariates X1;X2; ;Xp produce a linear predictor given
by
=
?p
j=1
Xjj
• Link Function: The link between the random and the systematic component, specified
by a function g(.), so that i = g(i) ; e.g. = logit() for logistic regression.
In contrast to continuous variables, measuring proportions and probabilities require different
handling. The latter one is obligated to be within the range of 0 to 1, whereas a continuous
variable can have no bounds, allowing a range from minus infinity to infinity (Tranmer and
Elliot, 2008). The above mentioned logit function serves this exact purpose and solves this issue
by binding the outcome of the systematic component to the reasonable predicted probability
range of 0 to 1. In comparison to the linear regression’s straight line output, the output of the
logit function will always be an S-shaped curve.
Figure 3: Difference between the linear regression’s straight line and the S-shaped curve of
logistic regression (James et al., 2013).
On the left on figure 3, we have the output of a linear regression model generated by random
data and on the right-hand side the S-shaped logistic function. We see that using the logit
12
function we solve the modelling problem of negative predicted probabilities and probabilities
with values exceeding the number of 1 (James et al., 2013) page 131.
Another difference that distinguishes logistic regression with the linear regression is the
distribution of the error-? term. This term denotes the deviation of observations compared to
the conditional mean
y = E(Y jX) + ?
with y being an observation of the response variable and E(Y jX) the conditional probability
of the response Y given the data X. For a continuous response variable, given that ? belongs
to a normal distribution with mean 0 and a constant variance, it follows that its conditional
distribution is normal with the mean of E(Y jX) and a constant variance as well. However,
this is not the case when the response variable is dichotomous. We express an outcome of the
response variable as
y(X) = (X) + ?
In this case, we see that the error term ? can only take the quantities 1?(X) with probability
(X) and ?(X) with probability 1?(X). Hence, the error ? follows a normal distribution with
mean 0 and variance (X)(1?(X)) and a conditional distribution of the response variable being
binomial with a probability (X) (Hosmer Jr et al., 2013) page 7. Finally, logistic regression
uses the Maximum Likelihood Estimation (MLE) to estimate the values of the coefficients in
contrast to the Least Square Estimates (LSE) of linear regression. The maximum likelihood
estimation derives from the distribution of the dependent variable. Since every observation
yi represents a binomial count of the i-th population, the joint probability function of Y is
expressed as
f(Y j) =
?n
i
ni!
yi(ni ? yi)!
yi
i (1 ? i)ni?yi (4)
where every i is related with the coefficients as displayed in equation 1 and i = 1; 2; ; n
the number of observations. Similarly, equation 4 can be written as
L(jY ) =
?n
i
(
i
1 ? i
)yi
(1 ? i)ni (5)
which corresponds to the simplified version of the maximum likelihood estimation since the terms
ni!
yi(ni?yi)! do not contain any i values. Hence, they can be omitted during the maximisation
of the equation 5 (Czepiel, 2002). It can be shown that maximing equation 5 is equivalent in
minimizing the negative-log likelihood equation ?L(), where L() is
L() =
?n
i=1
yi
( ?p
k=1
xikk
)
? nilog(1 + exp
?p
k=0 xikk ) (6)
and can be derived from equation 5 and 2.
This method is widely preferred due to its statistical properties. The intuition behind MLE
is to find this parameter that maximizes the ‘agreement’ between the data and the predictions
so as to estimate the coefficients 0; 1; ; p (Czepiel, 2002).
Note that several distributions have been proposed over the time as stated in (McCullagh
and Nelder, 1989) page 31. However, the logistic distribution is most commonly selected thanks
to its simplistic form and flexibility.