Logistic regression Quantitative Statistical Methods Dr. Szilágyi Roland
Dependent (y) Quantit ative Qualitative Gazdaságtudományi Kar Connection Analysis Qualitative Independent variable() Quantitative crosstabs Discriminant-analysis, Logistic regression ANOVA Correlation-, regression analysis
Logistic regression The Logistic regression is a multivariate method that helps to predict the classification of cases into groups on the basis of independent variables. So those independent variables () are identified in the analysis, which cause significant difference in the dependent variables categories. binary (the dependent variable has two categories) Multinomial
Logistic regression in practice Market research Modelling (by or no) Segmentation reliability Enterprise analysis (default, non default) etc.
Stages of Analysis 2 3 4 5 General Purpose Assumptions Estimaton of Function Coefficients Interpretation of Results Validity Tests
General purposes To create logistic regresion function, which is the best split of the categories of dependent variables as linear combination of independent variables. To determine whether there is a significant difference among groups according to independent variables. To determine which independent variables eplain the most the differences among groups. Based on the eperience obtained by a known classification, we can predict the group membership of new cases analyzing their independent variables. To measure the accuracy of classification
The Assumptions for Logistic regression. Measure of variables The dependent variable should be categorized by m (at least 2) tet values (e.g.: -good student, 2-bad student; or - prominent student, 2-average, 3-bad student). Independent variables could be measured on whatever scale.
The Assumptions for Logistic regression 2. Independence Not only the eplanatory variables, but also all cases must be independent. Therefore, panel, longitudinal research, or pre-test data cannot be used for logistic regression analysis.
The Assumptions for Logistic regression 3. Sample size It is a general rule, that the larger is the sample size, the more significant is the model. The ratio of number of data to the number of variables is also important. The results can be more generalized if we have at least 60 observations.
The Assumptions for Logistic regression 4. Multivariate normal distribution In case of normal distribution, the estimation of parameters are easier, because the parameters can be defined according to the density or distribution function. It can be tested by histograms of frequency distributions or hypothesis testing.
The Assumptions for Logistic regression 5. Multicollinearity Independent variables should be correlated to the dependent variable, however there must be no correlation between the independent variables, because it can bias the results of analysis.
Binary Logistic Regression The logistic function is useful because it can take any input linear combination of independent variables (X i ), whereas the output always takes values between zero and one and hence is interpretable as a probability. The logistic function is defined as follows: F ( ) 0... p Note that F () is interpreted as the probability of the dependent variable equaling a "success" or "case" rather than a failure or non-case. P( Y X ) e 0 e i... p p p
Binary Logistic Regression We can now define the inverse of the logistic function, the logit (log odds): p p F F Y... ln 0 ) ( ) ( p p odds e... 0 after eponentiating
Binary Logistic Regression The odds of the dependent variable equaling a case (given some linear combination i of the predictors) is equivalent to the eponential function of the linear regression epression odds P P
Binary Logistic Regression p p p p e e P...... 0 0 P P odds
Maimum Likelihood Method The maimum likelihood method finds a set of coefficients (β), called the maimum likelihood estimates, at which the log-likelihood function attains its local maimum: L n 0 e e i 0... p p... p p ma Forrás: Hajdu Ottó: Többváltozós statisztikai számítások; KSH, Budapest, 2003.
Tests of Model Fit The Binary Logistic Regression procedure reports the Hosmer-Lemeshow goodness-of-fit statistic. It helps you to determine whether the model adequately describes the data Ho: model fits H: model don t fit The Hosmer Lemeshow test specifically identifies subgroups deciles of fitted risk values. Models for which epected and observed event rates in subgroups are similar (khi square) are called fitted (well calibrated).
Testinf of parameters (β) H H 0 : i 0 : 0 i Wald i = bi s(b i ) 2
Choosing the Right Model Based on residual sum of squares (linear regression) Based on Likelihood ratio (compare the Likelihood of the model with the Likelihood of a baseline (minimal) model) Proportion of good predictions.
Pseudo R 2 Co and Snell's R 2 is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maimum value of less than, even for a "perfect" model. Nagelkerke's R 2 is an adjusted version of the Co & Snell R-square that adjusts the scale of the statistic to cover the full range from 0 to.
Eample If you are a loan officer at a bank, then you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and use those characteristics to identify good and bad credit risks. Variables Age in years Level of education Years with current employer Years at current address Household income in thousands Debt to income ratio (00) Credit card debt in thousands Other debt in thousands Previously defaulted
Outputs Step 0 Observed Previously defaulted Classification Table a,b Predicted Selected Cases c Unselected Cases d,e Previously Previously defaulted Percenta defaulted Percenta ge ge No Yes Correct No Yes Correct No 375 0 00,0 42 0 00,0 Yes 24 0,0 59 0,0 Overall Percentage 75,2 70,6 a. Constant is included in the model. b. The cut value is,500 Source: Help- IBM SPSS Statistics
Hosmer and Lemeshow Test Step Chi-square df Sig. 3,292 8,95 2,866 8,57 3 9,447 8,306 4 4,027 8,855 Source: Help- IBM SPSS Statistics
Model Summary -2 Log Co & Snell R Nagelkerke R Step likelihood Square Square 498,02 a,6,72 2 447,30 b,20,299 3 4,553 b,257,38 4 394,72 c,28,47 Source : Help- IBM SPSS Statistics
Classification Table a Step Step 2 Step 3 Step 4 Observed Previously defaulted Predicted Selected Cases b Unselected Cases c,d Previously defaulted Previously defaulted Percentage Percentage No Yes Correct No Yes Correct No 36 4 96,3 37 5 96,5 Yes 00 24 9,4 45 4 23,7 Overall Percentage 77,2 75, Previously defaulted No 35 24 93,6 36 6 95,8 Yes 80 44 35,5 36 23 39,0 Overall Percentage 79,2 79, Previously defaulted No 348 27 92,8 35 7 95, Yes 72 52 4,9 28 3 52,5 Overall Percentage 80,2 82,6 Previously defaulted No 352 23 93,9 30 2 9,5 Yes 67 57 46,0 27 32 54,2 Overall Percentage 82,0 80,6
Classification table (Confusion matri) (observed) no (0) yes () (predicted) no (0) yes () true negative (TN) False positive (FP) Type I False negative (FN) Type II True positive (TP) negative predictive positive predictive value value (precision) TN/(TN+FN) TP/(FP+TP) specificity TN/(TN+FP) sensitivity TP/(FN+TP) accuracy (TP+TN)/ (TN+FP+FN+TP)
Variables in the Equation 95% C.I.for EXP(B) B S.E. Wald df Sig. Ep(B) Lower Upper Step a Debt to income ratio (00),2,07 52,676,000,29,092,66 Constant -2,476,230 6,3,000,084 5 Step 2 b Years with current employer -,40,023 38,58,000,869,83,909 Debt to income ratio (00),34,08 54,659,000,43,03,85 Constant -,62,259 39,038,000,98 Step 3 c Years with current employer -,244,033 54,676,000,783,734,836 Debt to income ratio (00),069,022 9,809,002,072,026,9 Credit card debt in thousands,506,0 25,27,000,658,36 2,02 Constant -,058,280 4,249,000,347 Step 4 d Years with current employer -,247,034 5,826,000,78,73,836 Years at current address -,089,023 5,09,000,95,875,957 Debt to income ratio (00),072,023 0,040,002,074,028,23 Credit card debt in thousands,602, 29,606,000,826,470 2,269 Constant -,605,30 4,034,045,546
Meaning of coefficients The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient. While B is convenient for testing the usefulness of predictors, Ep(B) is easier to interpret. Ep(B) represents the ratiochange in the odds of the event of interest for a one-unit change in the predictor (X i ) Ceteris Paribus (all other things being equal). Source : Help- IBM SPSS Statistics
Source : Help- IBM SPSS Statistics Gazdaságtudományi Kar
Thank you for your attention! email: strolsz@uni-miskolc.hu