Introduction to Statistics Petra Petrovics
Statistics Statistics: is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. Practical activity to analyze data Set of data as a result of statistical activity Method Analyzing data Drawing conclusion
Central Statistical Office (HCSO) Independent administrative organization Operating under the direct supervision of the government Main tasks: Designing and conducting surveys, Recording Processing and storing data Data analyses- and dissemination, Protection of individual data.
Statistics Descriptive Statistics Study of how data can be summarized effectively to describe the important aspects of large data sets It turns data into information Data collection & analyzation Statistical Inference It is used when tentative conclusions about a population are drawn on the basis of a sample
Statistical Population All members of a specified group (N) It is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. Discrete population Continuous population (interval)
Statistical Variables = Characteristic of a unit. (1) (2) Quantitative Qualitative Temporal Geographical Common Differential
Quantitative vs. Qualitative Quantitative data measures either how much or how many of something, i.e. a set of observations where any single observation is a number that represents an amount or a count. Qualitative data provide labels, or names, for categories of like items, i.e. a set of observations where any single observation is a word or code that represents a class or category. ~ categorical variable
Types of Quantitative Variables Continuous variables are those variables that have theoretically an infinite number of gradations between two measurements. For example, body weight of individuals, milk yield of cows or buffaloes etc. Most of the variables in biology are of continuous type. Discrete variables do not have continuous gradations but there is a definite gap between two measurements, i.e. they can not be measured in fractions. For example, number of eggs laid by hens, number of children in a family etc.
Scales of Measurement from weakest to strongest - nominal scale - ordinal scale - interval scale - ratio scale
1. Nominal Scale Numbers are labels of groups or classes Simple codes assigned to objects as labels For qualitative data, e.g. professional classification, geographic classification e.g. - blonde: 1, brown: 2, red: 3, black: 4 (a person with red hair does not possess more "hairness" than a person with blonde hair) - female: 1, male: 2
2. Ordinal Scale Data elements may be ordered according to their relative size or quality, the numbers assigned to objects or events represent the rank order (1 st, 2 nd, 3 rd etc.) e.g. top lists of companies
3. Interval Scale Meaning of distances between any two observations The "zero point" is arbitrary Negative values can be used Ratios between numbers on the scale are not meaningful, so operations such as multiplication and division cannot be carried out directly e.g. temperature with the Celsius scale
4. Ratio Scale Strongest scale of measurement Distances between observations and also the ratios of distances have a meaning Contains a meaningful zero e.g. mass, length, time a salary of $50,000 is twice as large as a salary of $25,000
Statistical Rows & Columns Classes Frequencies
Data Set 1. Mass of numerical data discrete values E.g: 11.8, 3.6, 16.6, 13.5, 3.6, 8.3, 8.9, 9.1, 7.7, 2.3, 12.1, 6.1, 10.2, 8.0, 11.4, 6.8, 9.6, 19.5, 15.3, 12.3, 8.5, 15.9, 18.7, 11.7, 6.2, 11.2, 10.4, 7.2, 5.5, 14.5 2. Frequency distribution: method of organising & presenting data Score value Interval of score values: classes Statistical table records the number of observations in each class
Frequency table with score values Class Intervals Approximate class width: largest value - smallest value number of classes 20 2 3 6 Number of class intervals: 2 k > N Gazdaságtudományi Kar Class Limits Frequency 2.3 1 3.6 2 Class Limits Class Frequency Width 2-5 3 3 19.5 1 Total 30 5-8 3 6 8-11 3 8 11-14 3 7 14-17 3 4 17-20 3 2 Total 30
Statistical rows Types of Statistical Rows The main and partial Classifying population Same measures Comparative Generally: cannot add data Same types of data Descriptive Different types and measures of data Qualitative, Quantitative, Temporal, Geographical
Descriptive Rows Name Data Territory (Thousand qkm) 93,0 Population (Million people) 10,04 GDP (Billion Euro) 105,8 CPI (%) 106,1
Comparative Rows Year Hungarian Population (Thousand person) 1960 9 961 1970 10 322 1980 10 709 1990 10 709 Temporal: Time series - Point of date - Discrete population summarize Temporal: Time series - Period - Continuous population We can summarize Year Number of marriage 2002 46 008 2003 46 398 2004 43 791 2005 44 234
Comparative Rows Geographical Country GDP (%) 2001-2005 Hungary 4.2 Romania 5.7 Slovakia 4.6 Slovenia 3.4 Hungary, 2005 Year Expected lifetime (year) Men 68,6 Women 76,9 Source: Statistical Yearbook 2005 Qualitative
Classifying Rows Qualitative Temporal Geographical Product Period Turnover (Th HUF) Country A 3 880 B 4 020 C 3 000 Total 10 900
Types of Quantitative Rows E.g: Water consumption in X village Water consumption (m 3 ) Number of houses f g (%) g (%) S (m 3 ) Z (%) 15 5 5 10 10 50 3 15 25 17 22 34 44 340 24 25 35 15 37 30 74 450 32 35 45 8 45 16 90 320 23 45 5 50 10 100 250 18 Total 50-100 1410 100 Frequency Relative Frequency Cumulative Frequency Cumulative Relative Frequency
Frequency (f): The number of times a value of the data occurs. Cumulative Frequency (f ): The sum of the frequencies for all values that are less than or equal to the given value. Water consumption (m 3 ) Number of houses f 15 5 5 15 25 17 22 25 35 15 37 35 45 8 45 45 5 50 Total 50 - f k i 1 f i N 5+17 5+17+15 5+17+15+8 5+17+15+8+5
Relative Frequency (g): The ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all outcomes. Cumulative Relative Frequency (g ): The term applies to an ordered set of observations from smallest to largest. The Cumulative Relative Frequency is the sum of the relative frequencies for all values that are less than or equal to the given value. Water consumption (m 3 ) g (%) g (%) 15 10 10 15 25 34 44 25 35 30 74 35 45 16 90 45 10 100 Total 100 - g i fi (%) 100 (i 1,...,k) N
Sum of Values (S) x Water consumption Number of houses S (m 3 i i i ) (f) (m 3 ) 15 5 10*5 50 15 25 17 20*17 340 25 35 15 30*15 450 35 45 8 40*8 320 45 5 50*5 250 Total 50 1410 X i lower X i X i 2 x i : discrete value or middle of the class k i 1 S i f 1 X 1 S i 1... fk X k fi X i k f upper
Water consumption (m 3 ) Relative Sum of Values (Z) S (m 3 ) Z (%) 15 50 3 15 25 340 24 25 35 450 32 35 45 320 23 45 250 18 Total S=1410 100 Z i Si (%) 100 (i 1,...,k) S Z 450 1410 100 32 0 Z 1 i k i 1 Z i 1(100%)
Types of Statistical Tables Descriptive / Comparative Row Descriptive / Comparative Row Simple Table Classifying Row Classifying Row Descriptive / Comparative Row Classifying Table Classifying Row Combined Table
Statistical Table Statistical table: set of data arranged in rows and columns; It is important to have: title & source & measurements Signs: if we do not know the data: if there is not any data: 0
Name Data Territory (thousand qkm) 93,0 Population (million people) 10,04 GDP (billion euro) 105,8 CPI (%) 106,1 Source: HCSO (KSH) title measurements Data about Hungary (2008) source 1 dimension
Territory price index in Hungary (2008) Employed people Unemployed people Territory price index (Thousand people) Central Hungary 1245,5 80,5 Central Transdanubia 441,5 40,5 Western Transdanubia 408,2 36,5 Southern Transdanubia 337,4 42,3 Northern Hungary 397,6 69,9 Northern Great Plain 492,1 78,7 Southern Great Plain 474,8 53,3 Total 3797,1 401,7 Source: HCSO (KSH) 2 dimensions
Graphs
The Graphic Presentation of Data It allows to visualize important characteristics. Principals: Perspicuous Homogenous Aim oriented Simple Reconstructable Scaled
The Graphic Presentation of Time Series I Number of Accidents in Hungary Source: HCSO Line chart connects a series of data points together with a line
The Graphic Presentation of Time Series Natural Gas Consumption II Area chart to represent cumulated totals using numbers or percentages (stacked) over time; emphasizes a change in values
The Graphic Presentation of Time Series Change in the number of Employments III Source: Statistical Yearbook, 2005. Bar chart (Stacked) In case of time periods (x-axis: interval)
The Graphic Presentation of Quantitative Rows Source: Statistical Yearbook, 2005. Population Pyramide
The Graphic Presentation of Quantitative Rows Based on Word95.sav Scatterdot: Distribution of data points along one or two dimensions
The Graphic Presentation of Frequency Histogram bar chart grouped into a frequency distribution shows the quantity of points that fall within various numeric ranges Distribution
The Graphic Presentation of Frequency Frequency Polygon Distribution Connects data points through straight lines or higher order graphs x-axis: midpoint of each interval y-axis: absolute frequency Cumulative Distribution Frequency Tends to flatten out
The Graphic Presentation According Pie Chart: to Qualitative Variables Proportional relationships at a point in time Shows percentage values as a slice of a pie Compare part of a whole at a given point in time
The Graphic Presentation According to Territory Variables Cartogram: map, showing quantitative information Pictogram
Thanks for your attention!