Cluster analysis in SPSS
Cluster Analysis Cluster analysis one of the methods of classification, which aims to show that there are groups, which within-group distance is minimal, since cases are more similar to each other than members of other groups. However, the between-group distance is high, that is so create different, independent, homogen clusters. The aim is to identify groups and explore the structure.
Cluster analysis in practice Market segmentation 1. Definition of the relevant market 2. Definition of segmentation bases/variables 3. Segmentation (Factor-, cluster analysis) 4. Characterization of the consumers in each group Market structure analysis (substitutability of competing brands) Identification of the new product opportunities Test market selection Data reduction
Stages of cluster analysis 1 2 3 4 5 6 7 8 9 General Purpose Main Cluster Method Variable Selection Examination of the terms of cluster analysis Similarity and Distance Measures Further Cluster Methods Number of Clusters Validity Tests Name and Characterization of Clusters
Exercise Asked the consumers of a desiccated soup producer company Name : String Cooking: how often cook in a scale from 1 to 7 Domesticated : how much domesticated in a scale from 1 to 7 Gender : 1: male, 2: female Dwelling place: 1:Budapest, 2:county town, 3:other
1. General Purpose Aim of the analysis: Groupping the soup powder customers based on some statistical criteria. Observations: Population: eg.: soup powder customers in Hungary Determine the sample size and the sample design In this casse: n=16 person (no representativity)
Combined use: 1. Hierarchical: ideal number of clusters 2. Filtering outliers 3. Non hierarchical classification Miskolci Egyetem Gazdaságtudományi Kar Hierarchical method We don t know in advance how many clusters want to create It is preferred to use, if: Non hierarchical method High number of sampling units Less dependent on outliers Less dependent on the measure of distance Less dependent on whether in the analysis has been irrelevant variable Disadvantages Sensitive to outliers The number of clusters must be predetermined Selection of the cluster center Depends on the sequence of obsevations
3. Variable Selection Strength of correlation Analyze / Regression/ Linear Multicollinearity
4. Examination of the terms of cluster analysis I. Is the sample representative? Here is NOT we can t make conclusions about the population Managing Outliers An abnormal observations, which are not typical in the population; Underrepresent the size of the group in the population. Analyze / Classify / Hierarchical Cluster / Method: Nearest neighbour
4. Examination of the terms of cluster analysis II. Scales Similar scaling data are comparable Recommended: same unit of measurement (reason: larger deviation shows bigger influence) E.g.: we measure the cooking and the domestic aspect in a different interval; We comparing the income with the cooking etc. If it s different: standardization! If: - the relative importance of the responses compared to each other is relevant, - we re looking for similar profiles, - we don t concern to the respondent s style effect. xi x Mean 0, zi Comparable data s deviation 1 x
Analyze / Classify / Hierarchical Cluster / Method
5. Determination of the measure of similarity and distance Measure of distance Binary variables Measure of similarity Measure of distance Metric variables Measure of similarity Euclidean distance Russel and Rao Euclidean distance Pearson correlation Squared Euclidean distance Simple matching Squared Euclidean distance Variance Jaccard City block Yule Chebychev Analyze / Classify / Hierarchical Cluster / Method
6. Determination of the measure of similarity and distance Cluster Methods Hierarchical Non-hierarchical Agglomerative Divisive Linkage Methods Variance Methods Centroid Methods Single Ward Complete Average
Output Rita Vera The steps of contraction What kind of distances was the base to the contraction of the clusters? Too big step In which steps appears next the new common cluster (the lower number is the registration number) In which steps appears first the stage cluster
Vertical Icecle 3 In the case of large number of items it s difficult to handle. Géza ~ outlier We start the interpretation from the bottom: Where is the biggest line between the names? Vera and Rita 1. making clusters
Dendogram Contracts based on the minimum distance Handling of outliers Géza ~ outlier Abnormal? Should be excluded?
Analyze / Classify / Hierarchical Cluster / Method: Ward Metric variables No outliers No correlation between the variables
7. Determine the number of clusters a. Researcher experience b. Distances c. Scree plot d. Relative measure of clusters
b) Distance ( Dendogram) Where the value of the coefficient increases suddenly But: trying to determine the number of clusters around 5. 2 or 3 clusters
c) Scree plot Create Graph Line
3 clusters (n-1) cases
Graphs / Scatter/Dot
9. Explanation, characterization of clusters Clustercentroids and standard deviations Quantitative (cooking, domesticated) +qualitative (cluster) variables Mixed dependence Analyze / Compare Means / Means
Demographic analysis (gender (nem), residency (lakhely)) Quantitative-qualitative variables association Analyze / Descriptive Statistics / Crosstabs
Quantitative (income ) +qualitative (cluster) variables Mixed dependence (ANOVA) Analyze / Compare Means / Means
9. Characterization of clusters, labeling Variables involved in the cluster analysis Cooking a lot Domesticated Gender 1. cluster 2. cluster 3. cluster No Yes No No Yes No Predomimantly men Predominantly women Women Residency? Big cities County towns Income Low (3000 ) Low (2200 ) Labels Carelesses Housewives High (7667 ) Variables involved only in the characterization Businesswomen
Graphs / Pie
8. Verification of the validity of cluster analysis Different measure of distance Different method of cluster analysis Leave out variables Divide the sample into 2 parts Changing the order of cases Non hierarchical cluster analysis
Non hierarchical cluster analysis in the SPSS
Hierarchical method in SPSS-ben Non hierarchical method Miskolci Egyetem Gazdaságtudományi Kar Helps to determine the number of clusters By changing the number of clusters, the contents of the clusters made earlier will not change Lots of measures of distance Standardization of variables Dendogram Sensitive to outliers Long to find the ideal combination Nominal and metric variables are not combinable K-Means Benefits The number of sample units is high Less dependent on outliers Two Steps Less dependent on the measure of distance Less dependent on whether in the analysis has been irrelevant variable Fastest Nominal and metric variables are combinable Suggest the ideal number of clusters Filtering the outliers Default standardization Disadvantages The number of clusters must be pre-determined Selection of the cluster center Depends on the sequence of obsevations By changing the number of clusters, the contents of the clusters will be different
(Name) (Cooking) (Domesticated) (Gender) (Place) (Income) 1 Béla 1 3 1 3 3000 2 Jenő 2 3 1 1 1500 3 Bea 5 5 2 2 2000 4 Marci 2 4 1 3 1000 5 Ubul 4 4 1 1 7000 6 Zsuzsa 2 7 2 1 8000 7 Rita 2 6 2 2 7000 8 Zoli 3 4 1 3 1500 9 Dávid 2 2 1 1 5000 10 Robi 6 5 1 3 1000 11 Kriszti 3 3 2 3 2000 12 Zsófi 6 6 2 2 4000 13 Géza 7 1 1 2 8000 14 Éva 6 7 2 1 1000 15 Dóra 5 7 2 1 3000 16 Vera 1 6 2 2 6000 TK/286. oldal (Sajtos-Mitev
7. Verification of the validity of cluster analysis K-Means method Analyze / Classify / K-Means Cluster Determination of initial cluster center
Output 3 clusters 3 centers of cluster
7. Hierarchical method Comparison Reliable = Non hierarchical method K-Means
Exercise Classification of consumers based on shopping attitudes: Evaluate the statements in a scale from 1 to 7: V1: The shopping is fun. V2: The shopping is not good for the wallet. V3: I often combine shopping with visiting a restaurant. V4: During shopping I try to do the best purchasing. V5: I don t care about shopping. V6: A lot of money can be saved with the comparison of the prices. Malhotra [2005]: Marketingkutatás 703.o.
Number V1 V2 V3 V4 V5 V6 Miskolci Egyetem Gazdaságtudományi Kar 1 Üzleti Információgazdálkodási 6 4 és Módszertani 7 Intézet 3 2 3 2 2 3 1 4 5 4 3 7 2 6 4 1 3 4 4 6 4 5 3 6 5 1 3 2 2 6 4 6 6 4 6 3 3 4 7 5 3 6 3 3 4 8 7 3 7 4 1 4 9 2 4 3 3 6 3 10 3 5 3 6 4 6 11 1 3 2 3 5 3 12 5 4 5 4 2 4 13 2 2 1 5 4 4 14 4 6 4 6 4 7 15 6 5 4 2 1 4 16 3 5 4 6 4 7 17 4 4 7 2 2 5 18 3 7 2 6 4 3 19 4 6 3 7 2 7 20 2 3 2 4 7 2
Output
The clusters: 1. Entertainment-loving, interested customers 2. Apathetic customers 3. Careful customers