Miskolci Egyetem Gazdaságtudományi Kar Üzleti Információgazdálkodási és Módszertani Intézet. Correlation & Regression

Correlation & Regression

Types of dependence association between nominal data mixed between a nominal and a ratio data correlation among ratio data

Correlation describes the strength of a relationship, the degree to which one variable is linearly related to another Regression shows us how to determine the nature of a relationship between two or more variables X (or X 1, X,, X p ): known variable(s) / independent variable(s) / predictor(s) Y: unknown variable / dependent variable causal relationship: X causes Y to change

Correlation Measures 1. Covariance. Coefficient of correlation 3. Coefficient of determination 4. Coefficient of rank correlation

1. Covariance A measure of the joint variation of the two variables; An average value of the product of the deviations of observations on random variables from their sample means. C x, y x xy y n 1 ranges from - to +; C = 0, when X and Y are uncorrelated; its sign shows the direction of correlation it doesn t measure the degree of relationship!!!

. Linear correlation coefficient r C s s x y = Σd d x x d d y y Pearson correlation A measure of how closely related two data series are. Its sign shows the direction of correlation It measures the strength of correlation 0 < r < 1 statistical dependence r = 0 X and Y are uncorrelated r = -1 negative r = 1 positive You can use only in case of linear relationship!

3. Coefficient of determination r The square of the sample correlation coefficient between the dependent and independent variables. Measures the degree of correlation in percentage (%) Shows how many percent of the variance of dependent variable is explained by the independent variable. Varies from 0 to 1. r S S yˆ y =1- S S e y

No relationship 4000 Number of births 3000 000 1000 0 0 10 0 30 40 Number of storks

Independence Y = - 7. 4 E - 0 + 0. 0 8 3 4 8 X 3 R - S q = 3. 4 % 1 0-1 - - 3 - - 1 0 1 N i n c s k o r r e lá c i ó

Positive correlation Y = -8. 6 E -0 + 0. 6 9 0 8 6 X 3 R -S q = 6. 5 % 1 0-1 - - 3-3 - - 1 0 1 3 P o z i t ív k o r r e l á c i ó

Negative correlation Y = 5. 0 7 E - 0-0. 6 4 7 8 7 X 3 R - S q = 7 0. 9 % 1 0-1 - - 3-3 - - 1 0 1 3 N e g a t ív k o r r e lá c i ó

Curvilinear relation Y = 1. 0 9 5 8 + 6. 0 7 6 8 4 X + 1. 1 6 6 8 6 X * * 4 0 R - S q = 8 8. 4 % 3 0 0 1 0 0-3 - - 1 0 1 3 N e m l i n e á r i s k o r r e lá c i ó

Scatter diagrams linear S a l e s i n 1600 100 800 400 $ 0 0 0 10 0 30 40 50 Advertising in $ S e l l i n g p r i c e 5000 4000 3000 000 1000 4000 0 4 6 8 10 1 Age of a house (year) curvilinear w a s t a g e 40 30 0 10 S e l l i n g p r i c e 3000 000 1000 0 0 10 0 30 40 Production (number of products per day) 0 0 5 10 15 Age of a car (year) direct relationship positive slope inverse relationship negative slope

Example A firm administers a test to sales trainees before they go into the field. The management of the firm is interested in determining the relationship between the test scores and the sales made by the trainees at the end of one year in the field. The following data were collected for 45 sales personnel who have been in the field one year. Calculate different correlation measures!

X Y independent dependent variable Salesperson Test score Number of units sold x i d x K. A. 5 188 +9 + +198 L. Z. 16 157 0-9 0 B. E. 30 165 +14-1 -14 G. P. 5 14-11 -4 +46 S. G. 10 158-6 -8 +48 J. T. 4 4 +8 +58 +464 V. P. 17 169 +1 +3 +3 T. L. 6 114-10 -5 +50 Total 716 7 464 0 0 d x d y =8 894.5 x y i y d xi xyi y dxd y y

Number of observed pairs: n = 45 x 16 y 166 s s x y 8.6 30.99 C dxd n 1 y 8894.5 45-1 0.15 Positive correlation

r s x C s y 0.15 8.6 30.99 0.7897 r 6.36 % There is a strong & positive relation between test scores and number of units sold. The test scores explain 6.36 percent of the variation of number of units sold.

4. Coefficient of rank correlation 6di 1-0 1 n (n 1) ρ Spearman correlation Measure of the relationship between two ordinal data n = number of paired observations, d = difference between the ranks for each pair of observations. perfect correlation ρ= 1 perfect inverse correlation ρ = -1 in case of independence ρ = 0

Example Ten students were ranked by their mathematical and musical ability: Ability Student A B C D E F G H I J Total Mathematics 1 3 4 5 6 7 8 9 10 - Music 3 4 1 5 7 10 6 8 9 - d i = x i - y i - - 0-1 -3 1 1 0 d i 4 4 4 4 0 1 9 4 1 1 3 6di 63 1-1- 0.806 strong relationship n (n 1) 10(10-1) ρ

Simple Linear Regression Model We model the relationship between two variables, X and Y as a straight line. The model contains two parameters: an intercept parameter, a slope parameter. E (y) y = β 0 + β 1 x + ε where: β 1 = slope β 0 = y-intercept x y dependent or response variable (the variable we wish to explain or predict) x independent or predictor variable ε random error component β 0 y-intercept of the line, i.e. point at which the line intercept the y-axis β 1 slope of the line

Assumptions of the Linear Regression Model Assumptions for Error term: Normally distributed; Expected value = 0 (E(ε)=0); The variance is the same for all observations (Homoscadasticity); Uncorrelated across observations (there isn t any autocorrelation). Assumptions for the Independent Variables: Not random, etc.

y Deterministic component ŷ i = b 0 + b i x i Random error x y = deterministic component + random error We always assume that the mean value of the random error equals 0 the mean value of y equals the deterministic component. It is possible to find many lines for which the sum of the errors is equal to 0, but there is one (and only one) line for which the SSE (sum of squares of the errors) is a minimum: Least squares line / regression line.

The method of least squares gives us the best linear unbiased estimators of the regression parameters: β 0, β 1. The least-squares estimators: b 0 estimates β 0 b 1 estimates β 1 The regression line: y caret ( hat ): Ŷ = b 0 + b 1 X The normal equations (with 1 x) Σy = nb 0 + b 1 Σx Σxy = b 0 Σx + b 1 Σx

Interpretation b 0 : when x=0, y=b 0 b 1 : for every 1 unit increase in x we expect y to change by b 1 units

Elasticity % change in x demanded % change in y E(y,x) b 1 b 0 x b 1 x E(y, x) = b x 1 y Elasticity at the mean

Estimation in Regression Regression estimation is a technique used to replace missing values in data. If we know: 1. The estimated parameter value;. The hypothesized value of the parameter; 3. Confidence interval around the estimated parameter. The number of degrees of freedom equals the number of observations minus the number of parameters estimated. = n-

Estimation in Regression Parameter Estimated value Standard error 0 b 0 1 b 1 0 ŷ 0 s e s e n(x x i s e i x) (xi x) 1 ( x0 n ( x i x) x) b b 0 1 yˆ t yˆ t Y 0 t t s s b s yˆ yˆ b 1 s 0 ŷ 0 s e 1 1 n ( x0 + (x i x) x) = n- In case of average Y values In case of discrete Y values

Analysis of Variance in Regression Analysis Sum of Squares Df Regression Sŷ = (ŷi y) 1 Mean Sum of Squares S Residual S = (y ŷ n- se Se /( n ) e i ) ŷ F = S F e S ŷ /(n - ) Total S y = (y i y) n-1 S y n -1 S y S y ˆ S e n i=1 n n i y) (ŷi y) + (yi y) i=1 i=1 (y

Model testing H 0 : β 1 = 0 H 1 : β 1 0 (linear model) Test statistic: F = S s Pr ŷ e 0 H 1 : 1 S 1 F1 ( ; 1) /(n - F-statistic tests whether all the slope coefficients in a linear regression are equal to 0. Measures how well the regression equation explains the variation in the dependent variable. e S ŷ ) F Pr H 0 1 F ( ; 1) 1 1 : 1 H 0 F ( 1; ) 1 F

H1 : m 0 H 0 : β 1 = 0 H 1 : β 1 0 Parameter testing Pr H 1 : m 0 Pr H 0 t 1 0 Test statistic: t b 1 s( b 1 ) t 0 1 / t1 / where: b 1 is the least square estimate of the regression slope s(b 1 ) is the standard error of b 1

Seminar

Exercise 1 Book: p185 e44 In a bar waiters believe that there is a relationship between the amount of consumption of cola and the average daily temperature. To test it a sample of 0 days was drawn and they examined the amount of consumption and the temperature in these days:

Results: y 1,19; x 537; xy 330,159; x 14,597; y 7,505,555; d 179 x d 149,93; d d 4495 y x y Determine the relationship between the temperature and the consumption in case of linear and curvilinear relationship. Day The amount of consumption (l) The maximum daily temperature ( C) 1. 50 5. 534 6 3. 610 8 4. 780 3 5. 708 7 6. 639 5 7. 486 3 8. 43 0 9. 45 10. 597 9 11. 640 30 1. 657 31 13. 678 30 14. 60 7 15. 635 8 16. 610 6 17. 585 5 18. 67 7 19. 608 6 0. 70 30

Exercise (p188 e48) The export and import of Hungary with European countries are the following: d d 1,195,957; x,948 x y y y 3, 071; x, 084, 046; 1, 68,345 Characterize the trade with European countries. Country Export (X) Import (Y) Austria 406 418 Belgium 87 93 Czech Republic 60 95 France 134 17 Holland 100 10 Poland 95 67 Great-Britain 119 136 Germany 19 91 Italy 181 363 Russia 41 68 Switzerland 7 49 Sweden 49 75 Slovakia 54 1 Slovenia 47 53 Ukraine 139 1068

Exercise 3 p188 e48 The table shows the inflation rate (x) and the unemployment rate (y) of Germany between 197 and 1997. Results: x 9.4; y171.8 d 94.54; d 195.44 x y xy 51.9 Determine the relationship between unemployment and inflation rate. Year Inflation rate (%) Unemployment rate (%) 197 5.5 1.1 1973 6.9 1. 1974 7.0.6......... 1996 1.5 11.5 1997 1.8 9.8

Thanks for your attention!