Descriptive Statistics Petra Petrovics
DESCRIPTIVE STATISTICS Definition: Descriptive statistics is concerned only with collecting and describing data Methods: - statistical tables and graphs - descriptive measures Descriptive measure a single number that provides information about a set of data
Definition of a Population I. Central Tendency - mean - mode - median II. Percentiles, Quartiles III. Dispersion IV. Shape calculation location
File / Open / Data Employee data.sav SPSS: Analyze / Descriptive Statistics / Frequencies
I. Central Tendency x Mode (Mo) The value that occurs most frequently. Median (Me) Statistic which has an equal number of variates above and below it.
I.1. Means Arithmetic mean (average) Geometric mean the ratio of any two consecutive numbers is constant e.g. compound interest rate Harmonic mean units of measurement differ between the numerator and denominator e.g. miles per hour Quadratic mean e.g. the form of standard deviation
Arithmetic Mean Typically referred to as mean. The most common measure of central tendency. It is the only common measure in which all the values play an equal role. Symbol: x, called X-bar Raw Data Expressions: x x1 x2... x n n i1 n n x i Frequency Distribution Expressions: n i1 x f i f i x i
Properties of Mean x i d i =x i - x 100-100 150-50 210 +10 240 +40 300 +100 Σ 1000 0 x 200 1. The sum of the differences from the mean is 0. 2. n i=1 n i=1 x - a i 2 i x - x = 0 is minimal, if a= x
Properties of Mean 2. x i x i +50 x i 1,1=y Z=x+y 100 150 110 210 150 200 165 315 210 260 231 441 240 290 264 504 300 350 330 630 Σ 1000 1250 1100 2100 200 250 220 420 3. If you add a constant a to every x i, the mean will be a+ x 4. If you multiply every x i by a constant b, the mean will be b* x 5. x 1, x 2,..., x n y 1, y 2,..., y n x y x 1 + y 1 ;...; x n + y n x y
Advantages of Mean Easy calculation, easy understanding, Always exists, The mean uses every value in the data and hence is a good representative of the data. The irony in this is that most of the times this value never appears in the raw data. Repeated samples drawn from the same population tend to have similar means. Isn t necesarry to know the values of every single observations, the summary could be enough.
Disadvantages of Mean It is sensitive to extreme values/outliers, especially when the sample size is small. Therefore, it is not an appropriate measure of central tendency for skewed distribution. Mean cannot be calculated for nominal or nonnominal ordinal data. Even though mean can be calculated for numerical ordinal data, many times it does not give a meaningful value, e.g. stage of cancer.
Weighted Means x i : observed values f i : weights The value of weighted mean depends on: absolute values of observations, ratois of the weights, weight could be f i /n=g i also.
I.2. Median Statistic which has an equal number of variates above and below it n 1 Raw Data Expressions: ranked value 2 Independent from extreme values Just from data in order The middle term can be calculated for qualitative ordinal data, n f ' me1 Me me 2 h f me me= lower boundary of the median class n = total number of variates in the frequency distribution f me-1 = cumulative frequency of the class below the median class f me = frequency of the median class h = class interval
I.3. Mode The value that occurs most frequently Typical value Mo mo k 1 k mo = the lower class boundary of the mode s class k 1 = the difference between the frequencies of the mode s class and the previous class k 2 = the difference between the frequencies of the mode s class and the next class h = class interval k 1 2 h
II. Percentiles and quartiles The P th percentile of a group of members is that value below which lie P% (P percent) of the numbers in the group. Q 1 (lower quartile): The first quartile is the 25th percentile. It is that point below which lie ¼ of the data. Q 2 (middle quartile): median Q 3 (upper quartile): it is that below which lie 75 percent of the data.
III. Measures of Dispersion 1. Range 2. Interquartile Range 3. Population and Sample Standard Deviation 4. Population and Sample Variance 5. Coefficient of Variation
III.1. Range The range of a set of observations is the difference between the largest observation and the smallest observation. R X X max III.2. IQR min Interquartile range: difference between the first and third quartiles. IQR Q Q 3 1
III.3. Standard Deviation The standard deviation is a measure of dispersion around the mean. A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data are spread out over a large range of values. In a normal distribution, 68% of cases fall within one standard deviation of the mean and 95% of cases fall within 2 standard deviations.
Properties of Standard Deviation 0, if x=constant 0 x N 1 2 2 2 x q x
Properties of Standard Deviation d i 2 x i d i =x i - x y i = x i +50 d i =y i - 100-100 10 000 150-100 150-50 2 500 200-50 210 +10 100 260 +10 240 +40 1 600 290 +40 300 +100 10 000 350 +100 Σ 1 000 0 24 200 1 250 0 x 200 y 250 σ 2 =4 840 σ 2 =4 840 σ=69.6 σ=69.6 If you add a constant a to every x i, the standard deviation will be the same. y
Properties of Standard Deviation x i d i =x i - 2 y i = x i 1.1 d i =y i - 100-100 10 000 110-110 12 100 150-50 2 500 165-55 3 025 210 +10 100 231 +11 121 240 +40 1 600 264 +44 1 936 300 +100 10 000 330 +110 12 100 Σ 1000 0 24 200 1 100 29 282 x x = 200 = 220 d i σ 2 =4 840 σ 2 =5 856.4 σ=69.6 y y 2 d i σ=76.52 If you multiply every x i by a constant b, the standard deviation will be b*σ
III.4. Variance Variance of a set of observations: the average squared deviation of the data points from their mean. Population variance: Sample variance: S n n 2 2 ( X i X ) fi ( X i X ) 2 i1 i1 n n i1 n n 2 2 ( X i X ) fi ( X i X ) 2 i1 i1 n n 1 III.5. Coefficient of Variation The measure of dispersion around the mean in %. s V V X X i1 f i 1 f i
IV. Measures of Shape Skewness is a measure of the degree of asymmetry of a frequency distribution. Kurtosis is a measure of the flatness (versus peakedness) of a frequency distribution.
IV.1.Kurtosis The measure of the extent to which observations cluster around the central point. Positive cluster more and have longer tails Negative cluster less and have shorter tails For a normal distribution, the value of the kurtosis statistic is zero.
IV.2. Skewness A X Mo F ( Q Me) ( Me Q ) 3 1 ( Q Me) ( Me Q ) 3 1 Skewed to the left (long right tail) Symmetry Me Mo X A>0 A<0 Mo Me X Skewed to the right X Me Mo
Statistics Current Salary N Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum Sum Percentiles Valid Missing 5 10 25 50 75 90 95 474 0 $34,419.57 $28,875.00 $30,750 $17,075.661 291578214,453 2,125,112 5,378,224 $119,250 $15,750 $135,000 $16,314,875 $19,200.00 $21,000.00 $24,000.00 $28,875.00 $37,162.50 $59,700.00 $70,218.75
Box Plot The box plot is a set of five summary measures of the distributions of the data: - the median of the data - the lower quartile - the upper quartile - the smallest observation - the largest observation + asymetry
Gazdaságtudományi Kar Gazdaságelméleti és Módszertani Intézet Box&Whiskers Source: Aczel [1996]
Gazdaságtudományi Kar Gazdaságelméleti és Módszertani Intézet Elements of Box Plot Source: Aczel [1996]
Gazdaságtudományi Kar Gazdaságelméleti és Módszertani Intézet Source: Aczel [1996]
Graphs / Boxplot Boxplot of the current salary categorized by the employment category
Box Plot The highest salary The least standard deviation Q 3 Me Q 1
Thanks for your attention! strolsz@uni-miskolc.hu