This is an Open Data Science course for analysing and presenting data with R-software.
This year I’m here as a course assistant, and will be doing some modifications to following chapters as we go.
Here is the link to my Github repository, where you can see how all the magic is done: https://github.com/macabset/IODS-project
“This week I have completed DataCamp exercises, peer reviewed Rstudio excersice 1 and surprisingly enough learned for the first time to create regression models in the RStudio excercise 2. This has been an excellent opportunity to learn and interpret regression model many-sidedly and thoroughly.” - January 2017
is the data wrangling exercise, which doesn’t show on my course diary. If you are interested, go and check in my github: https://github.com/macabset/IODS-project/blob/master/data/create_learning2014.R
getwd()
## [1] "C:/Users/murmeli/Documents/murmeli/2017/GitHub/IODS-project"
lrn2014 <- read.csv("learning2014.csv")
head(lrn2014)
## gender age attitude deep stra surf points
## 1 F 53 37 3.750 3.375 2.583333 25
## 2 M 55 31 2.875 2.750 3.166667 12
## 3 F 49 25 3.875 3.625 2.250000 24
## 4 M 53 35 3.500 3.125 2.250000 10
## 5 M 49 37 3.750 3.625 2.833333 22
## 6 F 38 38 4.875 3.625 2.416667 21
str(lrn2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.75 2.88 3.88 3.5 3.75 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(lrn2014)
## [1] 166 7
Here we can see the learning2014 dataset, whih has 7 variables with 166 observations. This questionnary conserning students’s approaches to learning has been collected from a course:" Introduction to Social Statistics, fall 2014" (Finnish). Dataset has been modified so that questions have been set as combined variables (gender, age, attitude, deep, stra, surf and points) and 0 values in variable points have been removed.
summary(lrn2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.625 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.500 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.875 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.796 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.250 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.875 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
library(ggplot2)
In our summary we can see that all variables except gender are numeric with discrete values.
A <- ggplot(lrn2014, aes(x= attitude, y=points, col=gender))
B <- A +geom_point()
C <- B + geom_smooth(method = "lm")
Figure1 <- C + ggtitle("Figure 1. The relationship between attitude and points")
Figure1
Figure 1 shows us how attitude towards statistics correlate to the points of the exam. In this case, such as in many others, motivation partly explains success.
D <- ggplot(lrn2014, aes(x=stra, y=points, col=gender))
D1 <- D + geom_point()
D2 <- D1 + geom_smooth(method="lm")
Figure2 <- D2 + ggtitle("Figure 2. The relationship between strategic learning and points")
Figure2
If we look at Figure 2 there seems to be a very small linear relationship between strategic learning and exam points. If student used strategic learning method in general, he or she might have gained a bit better exam points.
library(GGally)
ggpairs(lrn2014, lower= list(combo = wrap("facethist", bins = 20)))
As we can see in our plot matrix, variables attitude (0.437), stra (0.146) and surf (0.144) correlate the most with the points. Lets try those as explanatory variables in our regression model.
Regression model 1
my_reg <- lm(points ~ attitude + stra + surf, data = lrn2014 )
summary(my_reg)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Attitude seems to be the only one statistically significant in explaining points. This result is compatitive with our result in Figure 2, where correlation between strategic learning and points was bearly noticeable.
Let’s remove first surf and see again our model
Regression model 2
my_reg2 <- lm(points ~ attitude + stra, data = lrn2014)
summary(my_reg2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The strategic learning gives still too high p-value, so we cannot exclude it being significant by chance.
Let’s try our regression model with only attitude explaining it.
Regression model 3
my_reg3 <- lm(points ~ attitude, data=lrn2014)
summary(my_reg3)
##
## Call:
## lm(formula = points ~ attitude, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
It seems that attitude towards statistics is our key explanatory variable to exam points. The more positive attitude toward statistics in general the better outcome in exam.
However, we should consider our model as a whole, so let’s check first our R-squared values.
What I didn’t try last time was the significance of gender. I’m not sure how it reacts, since it is not continuous variable… Regression model 4
my_reg4 <- lm(points ~ attitude + gender, data=lrn2014)
summary(my_reg4)
##
## Call:
## lm(formula = points ~ attitude + gender, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1724 -3.2092 0.3703 4.0197 10.6162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.5098 1.8502 6.221 4.01e-09 ***
## attitude 0.3618 0.0595 6.081 8.21e-09 ***
## genderM -0.4835 0.9157 -0.528 0.598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.331 on 163 degrees of freedom
## Multiple R-squared: 0.1919, Adjusted R-squared: 0.182
## F-statistic: 19.36 on 2 and 163 DF, p-value: 2.863e-08
According to results it would seem that it’s almost 50/50 chance that the negative correlation between points and gender is there only by chance. If it were more significant, it would mean that women score better in test.
Problems with gender as a factor in regression model * the gender-variable isn’t a scale: “being less man gets you better points” isn’t very reasonable answer *how would only gender explain anything? It would be significant, if we were to compare what is the attitude between genders and perhaps reasons behind differences (if there are any)
For statisticians it might be obvious that gender doesn’t work as explanatory variable, but testing with the model itself and interpreting the results might open the problems more clearly than just stating the fact. And it’s fun to try!
R-squared in our regression model tells us how well our model fits to observations. To simplify, the bigger the R-squared the better our model seems to fit to observations.
Our 3 tests of regression model show that our Multiple R-squared lowers each time we drop a variable. That happens, because it increases every time you add a variable, so drop in itself isn’t a bad thing. What we also see is that the 2nd test has the highest adjusted R-square. This should mean that even though we have more variables than in the 3rd one, it seems to be a better model in quality, and not just better fit because we add or lose variables.
I conclude that we still should consider bringing back the strategic learning to our multiple regression model, even though its p-value is over magical line of 0.05. Now we can check if regression model 2 is really viable by checking the residuals in our model.
Let’s check Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage with our regression model 2.
errors <- plot(my_reg2, which= c(1,2,5), par(mfrow= c(2,2)))
In plot 1 we see, that values are reaasonably well scattered through out the area, which means that errors make no pattern, as is expected from a good model.
In plot 2 we see, that our observations lines up to a line, so the errors seem to be normally distributed. This is consistent with our assumption, atlhough our model is less suitable in the both ends of values.
In plot 3 we see, that no single observation has too much weight on the model, leverage line stayes under 0.05 in the whole plot.
errors2 <- plot(my_reg3, which= c(1,2,5), par(mfrow= c(2,2)))
The error models are quite similar with model 2. The biggest difference is in the plot 3: REsiduals vs Leverage. Points are packed in the left end, but all the values are still under 0.05. It means that no single observation wights too much, that is, is not too different from other values to give false idea of the model.
Nevertheless, we concluded in the R-squared comparison that the model 2 is better than model 3, so let’s stick to that, although these errors alone would not support that assumption.
our regression model 2, with explanatory variables attitude and strategic learning, exam points are explained with linear model. Considering the r-squared and checking with residuals, it indeed seems that the 2nd regression model best explains points. That means the more positive attitude towards statistics and the more strategically you wish to learn, the better outcome you get from statistics exam.
This week we’re learning logistic regression, which uses linear model to predict binary variables. How exciting! And very very difficult, since this is an entirely new method for me.
The data Wrangling part can be seen in my github: https://github.com/macabset/IODS-project/blob/master/data/create_alc.R
getwd()
## [1] "C:/Users/murmeli/Documents/murmeli/2017/GitHub/IODS-project"
alc2016 <- read.csv("data/create_alc.csv",sep=",", header= T)
head(alc2016)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason nursery internet guardian traveltime studytime failures
## 1 course yes no mother 2 2 0
## 2 course no yes father 1 2 0
## 3 other yes yes mother 1 2 2
## 4 home yes yes mother 1 3 0
## 5 home yes no father 1 2 0
## 6 reputation yes yes mother 1 2 0
## schoolsup famsup paid activities higher romantic famrel freetime goout
## 1 yes no no no yes no 4 3 4
## 2 no yes no no yes no 5 3 3
## 3 yes no yes no yes no 4 3 2
## 4 no yes yes yes yes yes 3 2 2
## 5 no yes yes no yes no 4 3 2
## 6 no yes yes yes yes no 5 4 2
## Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1 1 1 3 5 2 8 8 1.0 FALSE
## 2 1 1 3 3 7 8 8 1.0 FALSE
## 3 2 3 3 8 10 10 11 2.5 TRUE
## 4 1 1 5 1 14 14 14 1.0 FALSE
## 5 1 2 5 2 8 12 12 1.5 FALSE
## 6 1 2 5 8 14 14 14 1.5 FALSE
dim(alc2016)
## [1] 382 35
str(alc2016)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
This is part of Student Alcohol Consumption Data Set, where we have combined Secondary school students` performance on math course and Portuguese language course.The data has been collected 2008 but released 2016. In our combined dataset we have 182 observations and 35 variables. Most of our variables are discrete numeric values, where you can describe the level of agreement of disagreement in scale. 13 of them are two-level vectors, where eight of those are yes/no questions.
I have created a variable High_use, which is our only logical vector and it declares high use of alcohol consumption. The high use has been created by modifying the data by making the average of weekday and weekend consumption and stating its high end on high use. The variables in our data used here, but not used for joining the datasets, have been combined by averaging.
The point of this excersice is to find variables connected to (high) alcohol consumption. With common sense foreknowledge I will make hypothesis regarding the data and its results.
My personal interest will include gender (sex) as a basis variable when discussion about differences in any behaviour. Are boys any different than girls in alcohol consumptionr? I believe that the more you care about your studies (studytime) the less you drink. Similarly I think that the more you care about your health the less you use alcohol (health), and lastly I’ll pick something to attach family situation (famrel) to the use of alcohol. My hypothesis is the worse you have at home the more you escape to alcohol.
Sex is a binary factor with options male/female, studytime is a 4-level variable according to subjective evaluation of studytime and health and famrel are likert scale variables based on self evaluation.
Following figures are as help to explain high use of alcohol regarding our pre-assumptions.
library(ggplot2)
g1 <- ggplot(alc2016, aes(x = high_use, y = studytime, col=sex))
g2 <- g1 + geom_boxplot() + ggtitle("Figure 1.Studytime by high use of alcohol and gender")
g2
When looking at our Figure 1, it would seem that males and females have very different patterns for how high use of alcohol influences studytime. It seems that whether boys used a lot of alcohol makes no difference to studytime, when regarding girls on the other hand, it looks like that high users of alcohol wont study at all and not high users study lot more than boys not using alcohol.
2.Family relationship
h1 <- ggplot(alc2016, aes( x= high_use, y= famrel, col= sex))
h2 <- h1 + geom_boxplot()+ ylab("family relations")
h3 <- h2 + ggtitle("Figure 2. Family relationship by alcohol consumption and gender")
h3
Our figure 2 supports our assumption that with high alcohol consumption you have worse family relations. We can assume that family relations influence high use of alcohol and not the other way around.
3.Health
library(GGally)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
i1 <- ggally_facetbar(alc2016, ggplot2::aes(x= health, y= high_use, fill =sex))
i1 + ggtitle("Figure 3. High use of alcohol by health and gender")
Our Figure 3 illustrates how many observations there are in both high use of and not high use of alcohol in every level of heatlh (1-5). There is all in all less observations in high use of alcohol than in the upper box not high use of alcohol. When students did state the use of alcohol as high, surprisingly high proportions of them state their health “very good”. In general it would seem that there are more people stating good health if not drinking lot of alcohol, but if you drink a lot of alcohol, you declare a good health more likely than poor health. This was rather surprising, so let’s explore this result in our logistic regression.
Generally, in simple logistic regression model the target variable is binary: in this case high use of alcohol or no high use of alcohol. So our model predicts what is the probability of being a high alcohol user, if person has a certain level of health, famrel, sex and studytime.
m1 <- glm(high_use ~ health + famrel + sex + studytime, data = alc2016, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ health + famrel + sex + studytime, family = "binomial",
## data = alc2016)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4630 -0.8632 -0.6525 1.1891 2.1629
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.64999 0.65100 0.998 0.31806
## health 0.06094 0.08539 0.714 0.47543
## famrel -0.30716 0.12677 -2.423 0.01540 *
## sexM 0.70691 0.24605 2.873 0.00407 **
## studytime -0.46050 0.15662 -2.940 0.00328 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 434.61 on 377 degrees of freedom
## AIC: 444.61
##
## Number of Fisher Scoring iterations: 4
coef(m1)
## (Intercept) health famrel sexM studytime
## 0.64999130 0.06094206 -0.30715569 0.70690830 -0.46049830
It looks like, indeed, health plays no crucial role in high use of alcohol. So let’s calculate our model again, but without health as coefficient.
m2 <- glm(high_use ~ famrel + sex + studytime, data = alc2016, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ famrel + sex + studytime, family = "binomial",
## data = alc2016)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.530 -0.854 -0.666 1.218 2.157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8395 0.5955 1.410 0.15861
## famrel -0.2982 0.1259 -2.368 0.01786 *
## sexM 0.7257 0.2444 2.970 0.00298 **
## studytime -0.4679 0.1563 -2.994 0.00275 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 435.12 on 378 degrees of freedom
## AIC: 443.12
##
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept) famrel sexM studytime
## 0.8394792 -0.2981605 0.7256730 -0.4678942
The coefficients here are famrel, sexM and studytime, which are being compared to high_use (intercept). The coefficients tell that every time the coefficient rises (or drops as in famrel and studytime) a unit, the probability of being a high alcohol user rises. For example, the negaative predictor of studytime means that all other thigs beaing equall, if high alcoholic use, family relationships are less likely to be good. It seems that in our logistic model 2, all coefficient are valid and we can use all of them in the last phase.
Confidence intervals for the coefficients as odds ratio
OR <- coef(m2) %>% exp
CI <- exp(confint(m2))
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 2.3151609 0.7211129 7.5048006
## famrel 0.7421822 0.5789910 0.9501382
## sexM 2.0661211 1.2835027 3.3512991
## studytime 0.6263198 0.4569282 0.8446258
In odds ratio, we will copmare high alcohol use with properties famrel, sex and studytime to not high use of alcohol with the same properties. If the odds ratio (or) is higher than 1, then the property (x) is positively associated with high alcoholic use.
Gender seems to be the only factor to be positively associated with high alcoholic consumption. SexM has the widest gap in confidence intervals.This should mean that there’s 95% probability that odds ratio (OR) of the sexM is between 1.2-3.3.The wider the gap the more inaccurate our mean value 2.066 is. The odds for male to have high alcoholic sonsumption is 2 times larger than for females.
When considering OR under one, it means that there is bigger probability have high alcoholic consumption when there’s less the property famrel, for example. In English: the lower the famrel the higher the consumption of alcohol. This is as predicted in hypothesis.
Let’s see how well our model stands in prediction power when adding new data. In our added data, we predict for every observation what is the likehood of being high alcoholic consumer, using our model 2 as predictioner. Then in order to make it binary, we set the value 0.5: if our model predicts the probability of being high alcoholic consumer more than 0.5, it states High_use as true. Then we compare the prediction to what the actual status of the observation is: is it true that that student is a high consumer if our model predicts so?
probabilities <- predict(m2, type = "response")
alc2016 <- mutate(alc2016, probability = probabilities)
alc2016 <- mutate(alc2016, prediction = probability > 0.5)
table(high_use = alc2016$high_use, prediction = alc2016$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 255 13
## TRUE 102 12
Picturing the prediction
library(dplyr); library(ggplot2)
g <- ggplot(alc2016, aes(x = probability, y = high_use, col=prediction))
g + geom_point()
table(high_use = alc2016$high_use, prediction = alc2016$prediction) %>% prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66753927 0.03403141 0.70157068
## TRUE 0.26701571 0.03141361 0.29842932
## Sum 0.93455497 0.06544503 1.00000000
When prediction predicts a false in high alcoholic use, 66% of the time it really is false, but 27% of the time the real data would give true values instead. Then again when prediction says true on high alcoholic consumption, the real data can be false as often as true.
How well did we do in our prediction?
Testing error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc2016$high_use, prob = alc2016$probability)
## [1] 0.3010471
Here we have calculated the loss of our model. 30% is close to our earlier statement of 27% of false guesses, when our model guesses the high use to be false. As most of the students are non high users of alcohol, it got pretty close to real error. We have 30% “penalty”, which means that 30% of our predictions are not correctly classified compared to actual data.
Considering our original hypotheses, I’d say they have some value. The self evaluated health got discharged at the first phase in our logistic regression: it seems that it has no statistically significant meaning in predicting high use of alcohol. Our regression model 2 supports my original hypothesis: combining the status of your familyrelationship, gender and time you use studying, our model predicts how likely you are to use alcohol 70% of the time correctly. Individually the coefficients works as I theorized: men are more likely to be high consumers than women, the worse your familyrelations are the more likely you are to be a high consumer and the less you use time in your studies the more likely you are a high user of alcohol.
What if we’d guess that when our model gives the prediction, it never guesses right? As you recall, we got 30% error. This means that 70% of the cases when our model guessed a high consumption, there really was high consumption in alcohol use. Now, if it would never guess right, the training error would be 100%. That would mean that everytime our model predicts a person is a high user he/she is not. Then again if our model would be perfect, there would be no such a thing as a training error and everytime our model predicts a high use, we would find that he/she actually was high user in our data.
All together, our model had a bit higher error than in datacamp exercises(0.26, with coefficients failures, absences, sex). Is it a good or a bad thing? Well, I’d suppose that we never get as good results in human sciences as the results in medical diceases or text based researches: people are complicated. 70% is not bad, but I think that with our 35 variables we could do a bit better than that, as is seen compared to coefficients used in the data camp exercise. ***
This week we’ll divide the data with 2 methods using R’s own Boston dataset
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
This week we will use the Boston dataset, which is one of R’s own datasets. Above are shown summary, structure and dimensions of the dataset. It consists of Housing values in the suburbs of Boston, USA. It has 14 variables to explain the values of the 506 observations, which consern about the town.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The summary of our 14 variables shows that all but two of the variables are numeric and continuous. There is binary variables chas, which gives 1 if tract is situated by the Charles river, and there also is rad, which gives and index for accesibility for radial highways.
library(dplyr); library(MASS); library(corrplot)
cor_matrix<-cor(Boston) %>% round(digits=2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
Above are the correlations between all the variables visualized as pairs. Only the upper side is needed, since it’s symmetric. The bigger the dot and stronger the color, the stronger the correlation, too.
The strongest positive correlation is between tax and rad (corr. 0.91). The better contacts with radial highways, the bigger property tax-rate.
Scaling the data
We use scale-function to standardize our data. When variables have a common scale, we can calculate distances between means. We also want our new scales as data frame instead of matrix.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled<- as.data.frame(boston_scaled)
Because we have scaled our data, all observations are between integral 10. Now we can relate all of them with each other and see the real distances of these groups. This will help us in the clustering part.
We are going to divide the data with classification method. This is method where we need to know the groups before discriminating groups, and we will use the crime variable for this.
First we need to adjust our chosen variable and create new rates to our crim variable. From scaled Boston dataset we take the continuous variable crim and make it categorical.
boston_scaled <- scale(Boston)
boston_scaled<- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
crim_quant <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = crim_quant, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
After scaling we are able to make new variable crime to our dataset. The crime variable has categories low, med low, med high and high based on quantiles of the crim. So there are 127 observations in the low category, which are all the observations under 25% of all observations in crime. Since the old crim isn’t necessary anymore, we can drop that out from our boston scaled.
No we’re just about to use our data to divide our data to classes with Linear Discriminant Analysis(LDA). LDA is used to find linear combination of features, that separates classes. At the same time it is a variable reduction technique, so instead of variables explaining the target variable, we have dimensions of combined variables. In this case, we will find, what dimensions predict crimes in Boston.
Before we actually can make a model for classification, we have to make sure that we can test how well our model works: For that we will make a train set to create our model and test set to predict our model to the test set. Next the dataset will be divided, so that 80 % will be used for train set and 20% for test set.
We also want to exclude our crime data from the test dataset, so we can use it as our target variable.
n <- nrow(boston_scaled)
random <- sample(n, size = n * 0.8)
# train set
train <- boston_scaled[random,]
# test set
test <- boston_scaled[-random,]
#save crime from test
correct_classes <- test$crime
# remove crime from test data
test <- dplyr::select(test, -crime)
In our LDA we have crime as target variable against all other variables. Next we will use our train set to create the model. Now we’re ready to run the LDA, and it should create a model to predict crime rates using all other variables.
# linear discriminant analysis
lda.fit <- lda(formula= crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2475248 0.2500000 0.2648515
##
## Group means:
## zn indus chas nox rm
## low 1.02082910 -0.9803018 -0.14929469 -0.8912708 0.50638203
## med_low -0.07176073 -0.3055205 0.04263895 -0.5829083 -0.14328484
## med_high -0.39214634 0.2380667 0.19544522 0.4337793 0.03577751
## high -0.48724019 1.0170108 -0.08835242 1.0802262 -0.38301232
## age dis rad tax ptratio
## low -0.9330567 0.8585870 -0.6995397 -0.7159505 -0.4923682
## med_low -0.3540783 0.3742573 -0.5534930 -0.4744780 -0.1046371
## med_high 0.3931837 -0.3941871 -0.3883071 -0.2722879 -0.2483720
## high 0.8133280 -0.8398076 1.6392096 1.5148289 0.7820356
## black lstat medv
## low 0.38799337 -0.81094105 0.59048319
## med_low 0.31615216 -0.13074572 0.01198134
## med_high 0.05618209 0.08691973 0.10796184
## high -0.88333898 0.84855082 -0.70004794
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11544673 0.72875022 -0.87981346
## indus 0.07643029 -0.63017300 0.30589945
## chas -0.08487136 -0.05040852 0.15272384
## nox 0.38289048 -0.48594368 -1.47411895
## rm -0.08257263 -0.07870965 -0.19206268
## age 0.28430841 -0.24976576 -0.06661798
## dis -0.08428909 -0.25646450 0.21947636
## rad 3.06212855 0.68737388 0.13023495
## tax -0.13791239 0.35956062 0.49993352
## ptratio 0.12803540 0.04455064 -0.36699510
## black -0.12429773 -0.01125710 0.10642584
## lstat 0.19605136 -0.27899847 0.31061888
## medv 0.16728241 -0.36483249 -0.25459217
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9417 0.0436 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
We want to find out what is the formula to predict crime rate in Boston. The red arrows in the middle are the variables that are used to find crime rates. (In our original interpretation of the variables the tax and rad had a strong correlation. It seems that the rad arrow is also in very strong role in our LDA.)
Now lets use our test set to see if we our model really predicted the crime rates correctly.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 15 3 0
## med_low 2 19 5 0
## med_high 0 8 17 0
## high 0 0 0 20
Here is a cross tabulation with crime categories our model predicted against the real categories from test set.
When the real value was low, the model predicted half of the time to low, but almost as many times to med_low and 2 instances to med_high. Almost same situation is also with med_low: half of the time our model is right and half is not. Then again med high is almost the same in correct observations as in our model and high crime rate is nearly perfectly predicted by our model. So high end of crime rate is definitely better predicted than low end.
Now we scale again our data to use it for clustering method. Difference for the LDA is that the classes are not defined before hand. That is, we are not using crime for the base for classes but guess it with k-means.
Again, we need scaling to make distances measurable.
data("Boston")
boston_scaled2 <- scale(Boston)
summary(boston_scaled2)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled2)
## [1] "matrix"
boston_scaled2<- as.data.frame(boston_scaled2)
With clustering we do not know the number of proper classes beforehand, so we need to find them out by using distance method. This time I am using the standard euclidean distance method.
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
With K-means method, observations are combined into certain number of clusters. It clusters by grouping observations to the nearest mean. K-mean tries to find the optimal amount of clusters from the data. With function “pairs” we see a sharp drop when there is only 2 clusters. That seem to be the optimal number of clusters to our data according to k-means method.
library(ggplot2); library(GGally)
set.seed(123)
# euclidean distance matrix
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
boston_scaled2$km <- as.factor(km$cluster)
# plot the Boston dataset with clusters
ggpairs(boston_scaled2, ggplot2::aes(colour=km))
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this exercise we had two different methods to divide our data into groups based on their distance. In our first method (Classification), we divided the data in 4 classes of crime rates. It looked like it was quite accurate in the high end of crime but did not predict so well in low and mid low classes of criminality.
In the second method we used clustering based on no assumption of the classes and tried to use k-means to find those classes. K-means found two classes to be optimal to separate the data, which we can se in our pairs. The only problem is that we do not know without LDA what is the method to separate the two classes,so we do not have a good way of saying if the 2 classes given from k-means are meaningful according the data or not.
This week we aim to reduce dimensions in order to notice better the relations between variables in the data
#Packages we need in this workout
library(GGally)
library(dplyr)
library(corrplot)
library(FactoMineR)
library(ggplot2)
library(tidyr)
First part of our dimensionality reduction conserns with continuous variables. Dimensionality reduction in general aims to find the most essential dimensions that explain the data. In Principal Componen Analysis (PCA), we try to find the most essential components (usually PC1 and PC2) that best explain the data.
This dataset originates from United Nations Development Programme(http://hdr.undp.org/en/content/human-development-index-hdi). Its goal is to compare nations with more sophisticated methods than just looking at GNP - through human capabilities.
Variables
I have created a subset called “human” with 8 variables and 155 observations. Variables modified from original dataset are Country, sex_edu2, Lab_ratio and GNI. For further information about modifications made, you can go and check my github repository: https://github.com/macabset/IODS-project/blob/master/data/create_human.R
Underneath are the names of the 8 variables and their description.
Country: used as rownames, regions not included
Observations
In addition I have excluded missing values from our data set by removing them.
getwd()
## [1] "C:/Users/murmeli/Documents/murmeli/2017/GitHub/IODS-project"
human <- read.csv("data/human.csv", sep=",", header= T, row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Sex_edu2 : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Lab_ratio : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu_expect : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life_expect : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 166 135 156 139 140 137 127 154 134 117 ...
## $ Mom_death : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Young_birth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Present_parl: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
head(human)
## Sex_edu2 Lab_ratio Edu_expect Life_expect GNI Mom_death
## Norway 1.0072389 0.8908297 17.5 81.6 166 4
## Australia 0.9968288 0.8189415 20.2 82.4 135 6
## Switzerland 0.9834369 0.8251001 15.8 83.0 156 6
## Denmark 0.9886128 0.8840361 18.7 80.2 139 5
## Netherlands 0.9690608 0.8286119 17.9 81.6 140 6
## Germany 0.9927835 0.8072289 16.5 80.9 137 7
## Young_birth Present_parl
## Norway 7.8 39.6
## Australia 12.1 30.5
## Switzerland 1.9 28.5
## Denmark 5.1 38.0
## Netherlands 6.2 36.9
## Germany 3.8 36.9
Summaries
summary(human)
## Sex_edu2 Lab_ratio Edu_expect Life_expect
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mom_death Young_birth Present_parl
## Min. : 2.00 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 53.50 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 99.00 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 98.73 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.:143.50 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :194.00 Max. :1100.0 Max. :204.80 Max. :57.50
All our variables are numeric and continuous, which is as hoped for PCA analysis. But before we jump to analyzing with PCA, let’slook what we can say at this stage.
cor(human)
## Sex_edu2 Lab_ratio Edu_expect Life_expect
## Sex_edu2 1.000000000 0.009564039 0.59325156 0.5760299
## Lab_ratio 0.009564039 1.000000000 0.04732183 -0.1400125
## Edu_expect 0.593251562 0.047321827 1.00000000 0.7894392
## Life_expect 0.576029853 -0.140012504 0.78943917 1.0000000
## GNI 0.177158634 -0.012970053 0.12134933 0.1928623
## Mom_death -0.660931770 0.240461075 -0.73570257 -0.8571684
## Young_birth -0.529418415 0.120158862 -0.70356489 -0.7291774
## Present_parl 0.078635285 0.250232608 0.20608156 0.1700863
## GNI Mom_death Young_birth Present_parl
## Sex_edu2 0.177158634 -0.6609318 -0.5294184 0.078635285
## Lab_ratio -0.012970053 0.2404611 0.1201589 0.250232608
## Edu_expect 0.121349332 -0.7357026 -0.7035649 0.206081561
## Life_expect 0.192862316 -0.8571684 -0.7291774 0.170086312
## GNI 1.000000000 -0.1001664 -0.1334416 0.002672262
## Mom_death -0.100166367 1.0000000 0.7586615 -0.089439999
## Young_birth -0.133441591 0.7586615 1.0000000 -0.070878096
## Present_parl 0.002672262 -0.0894400 -0.0708781 1.000000000
ggpairs(human, title = "Table 1. Human data correlation matrix for paired variables")
cor(human) %>% corrplot( method="circle", type="upper", title= "Graph 1. Correlations between variables in human data")
Our correlation plots show that the highest correlation is between Mom_death and Life_expect (-0.86), but closely after are Mom_death and eduexpect, young_birth and Life_expect, Young_birth and Edu_expect.
Whith these preliminary results can be assumed that
All in all, this suggests that if we want our human capacity to be used well, we need to take care of the mothers and allow family planning at young age.
In Principal Component Analysis(PCA), variables are reduced to 2 main components that describe the data. With this method we are able to make our 8-dimensional dataset to 2D dataset when picturing it. At the same time the original variables are left and their relations to principal components can be evaluated.
PCA
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 214.3202186 54.4858892 26.3814123 11.4791149 4.0668732 1.6067062
## [7] 0.1905111 0.1586732
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Sex_edu2 -0.0007468424 4.707363e-04 -0.0001689810 -0.0004252438
## Lab_ratio 0.0002210928 5.783875e-05 -0.0007563442 -0.0047679411
## Edu_expect -0.0098109042 2.441895e-03 -0.0223269972 -0.0369246053
## Life_expect -0.0334485095 1.573401e-02 -0.0303860782 -0.0780112042
## GNI -0.0278166075 9.979413e-01 0.0557767706 -0.0002740466
## Mom_death 0.9879825869 3.620190e-02 -0.1474236772 -0.0068326906
## Young_birth 0.1479114018 -5.046353e-02 0.9867754042 -0.0069374879
## Present_parl -0.0048164618 -1.494485e-03 -0.0026652825 -0.9962093264
## PC5 PC6 PC7 PC8
## Sex_edu2 0.0002964017 -0.0254740423 6.859097e-01 7.272399e-01
## Lab_ratio -0.0034895150 -0.0320024527 7.265303e-01 -6.863629e-01
## Edu_expect -0.1947276781 -0.9790080609 -4.054956e-02 3.992920e-03
## Life_expect -0.9759219723 0.1979058277 5.461806e-03 2.081465e-03
## GNI 0.0150268685 -0.0014692010 -3.631899e-04 -3.767644e-04
## Mom_death -0.0282554922 -0.0005722709 1.213189e-04 8.299757e-04
## Young_birth -0.0393039689 -0.0160315111 -4.359137e-05 -9.463203e-05
## Present_parl 0.0841200882 0.0210694313 -2.695182e-03 2.658636e-03
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col= c("grey40", "deeppink2"), main = "Grpagh 2. PCA with human data, unscaled")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
First I have made the PCA and biplotted it without scaling the data. In our biplot we see observations on grey and correlation arrows on pink. How observations and arrows situate in our plot is defined by the principal components PC1 and PC2. The arrows combine both the original data and its features to the principal components.
Here the variables are not measured in the same units. We have to scale them in order to give the same influence to each one. And now let’s do it properly with standardized variables.
Scaled PCA
human.std <- scale(human)
pca_human2 <- prcomp(human.std)
biplot(pca_human2, choices= 1:2, cex = c(0.8, 1), col= c("grey40", "deeppink2"), main="Graph 3. PCA with human data, scaled")
If we compare the scaled and the unscaled one, we clearly see that the correlation arrows look quite different. Scaled version is better so, that our variables are comparable.
Next, we want to modify our plot, so the percentages of the variance by the principal components can be seen. PCA expects that the more variance the better the PC.
Scaled and modified PCA
s <- summary(pca_human2)
pca_pr <- round(100*s$importance[2, ], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main="Graph 4. PCA with human data, scaled, %")
And underneath we have a graph from the same scaled PCA with correlations arrows showing in a different graph than observations, since they are a bit hard to see in the original picture.
Scaled PCA with correlation arrows
#Draw a pca with only arrows
es.pca = PCA(human, scale.unit= TRUE, ncp=5)
The two Principal components together resume 64% of all variance of the dataset.
A high positive correlation can be seen between the ratio of second level education to expected level of education. That is not very surprising to think that expected education level also predicts what is the ratio between men and women in education. It is also quite understandable that Gross national index represents a good predictor for expected age of it’s citizens: the wealthier the country, the longer its citizen’s live. All these are also strongly correlated to PC1, which explains almost half of the variation. These variables seems to the most important when trying to understand the wellbeing of a nation.
The more mothers die the more young people give birth to children. This on the other hand is negatively correlated to PC1: the more mothers die and young people give birth to children, the worse the state of nation is.
The correlation between ratio of women to men working and women present in parliamen is visible but not as strong as previous connections. Nevertheless, the more women at labour markets, the better presentation they have in the parliament. These are positively correlated to PC2, which means that the better women are represented at the labour markets and at the parliament, the better state the nation has.
In this section we need another data to work with Multiple Correspondense Analysis(MCA). That is a multidimensional reduction technique with descrete variables: it’s designed for quantitative or categorical variables, and that’s why we cannot use our human data set as it is now.
Our goal is to find out if groups are close or far from each other. The closer the variables are to each other the more the same people have answered to both categories.Also, the higher numbers in the x or y axis, the stronger the correlation to dimensions it corresponds to.
The next part of reducing dimensions happens with R’s own dataset “Tea”. It’s a questionnaire about how people drink tea and about their drinking habbits. For more information, you can check for website:http://factominer.free.fr/classical-methods/multiple-correspondence-analysis.html.
The data set has 300 observations and 36 variables. In the data we have either factor variables with varying levels or logical variables such as “sugar” or “no sugar”, albeit age which is here both categorical and continuous.
#ggplot2, dplyr, tidyr and FactoMineR available
data(tea)
my_tea <- data.frame(tea)
dim(my_tea)
## [1] 300 36
str(my_tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(my_tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
To find ou which might be the most interesting variables, let’s make some pictures!
Picuring the tea data
Bar plot
gather(my_tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
for the sake of simplicity. I’ll pick a sub dataset my_tea1 with 8 variables breakfast, sex, price, health, spirituality, tearoom, Tea and friends.
#Columns for our subset my_tea1
keep_columns <- c("breakfast", "sex", "price", "healthy", "spirituality", "friends", "Tea", "tearoom")
my_tea1 <- dplyr::select(my_tea, one_of(keep_columns))
#Picturing with bars
gather(my_tea1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The idea is to find out whether these groups have anything incommon with each other. MCA examines the links between variables based on the variance of the observations. Does spirituality accure with expencive tea? Does gender have anything to do with assumptions whether tea is healthy or not? Now, we’re ready to do our MCA with our my_tea1.
MCA graph 1
mca <- MCA(my_tea1, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = my_tea1, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.193 0.165 0.150 0.147 0.139 0.128
## % of var. 11.885 10.164 9.218 9.031 8.544 7.859
## Cumulative % of var. 11.885 22.050 31.268 40.299 48.843 56.702
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.121 0.115 0.110 0.102 0.096 0.087
## % of var. 7.429 7.054 6.784 6.297 5.928 5.371
## Cumulative % of var. 64.131 71.185 77.970 84.267 90.195 95.565
## Dim.13
## Variance 0.072
## % of var. 4.435
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.767 1.015 0.144 | 0.347 0.242 0.029 | 0.713
## 2 | 0.261 0.117 0.057 | 0.285 0.164 0.069 | -0.327
## 3 | -0.400 0.276 0.233 | -0.236 0.113 0.081 | 0.005
## 4 | 0.057 0.006 0.003 | -0.199 0.080 0.034 | 0.209
## 5 | -0.005 0.000 0.000 | 0.067 0.009 0.003 | 0.484
## 6 | 0.641 0.709 0.171 | -0.080 0.013 0.003 | 0.234
## 7 | -0.151 0.039 0.028 | 0.019 0.001 0.000 | 0.020
## 8 | 0.262 0.118 0.059 | 0.161 0.052 0.022 | -0.149
## 9 | 0.533 0.491 0.180 | 0.465 0.437 0.137 | 0.375
## 10 | 0.503 0.436 0.118 | 1.130 2.577 0.594 | -0.477
## ctr cos2
## 1 1.132 0.125 |
## 2 0.238 0.090 |
## 3 0.000 0.000 |
## 4 0.097 0.037 |
## 5 0.521 0.164 |
## 6 0.122 0.023 |
## 7 0.001 0.001 |
## 8 0.050 0.019 |
## 9 0.313 0.089 |
## 10 0.505 0.106 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## breakfast | -0.002 0.000 0.000 -0.030 | 0.211 1.613
## Not.breakfast | 0.002 0.000 0.000 0.030 | -0.194 1.489
## F | -0.357 4.897 0.186 -7.458 | -0.173 1.347
## M | 0.521 7.144 0.186 7.458 | 0.253 1.965
## p_branded | 0.289 1.708 0.039 3.398 | -0.960 22.087
## p_cheap | 0.678 0.694 0.011 1.812 | 0.329 0.191
## p_private label | 0.723 2.369 0.039 3.431 | 0.274 0.398
## p_unknown | 0.192 0.095 0.002 0.677 | -0.040 0.005
## p_upscale | 0.563 3.621 0.068 4.507 | 1.183 18.709
## p_variable | -0.710 12.168 0.300 -9.471 | 0.187 0.986
## cos2 v.test Dim.3 ctr cos2 v.test
## breakfast 0.041 3.500 | -0.286 3.276 0.075 -4.751 |
## Not.breakfast 0.041 -3.500 | 0.264 3.024 0.075 4.751 |
## F 0.044 -3.618 | -0.242 2.905 0.086 -5.059 |
## M 0.044 3.618 | 0.353 4.238 0.086 5.059 |
## p_branded 0.427 -11.300 | -0.341 3.069 0.054 -4.012 |
## p_cheap 0.003 0.880 | 1.264 3.110 0.038 3.378 |
## p_private label 0.006 1.301 | 0.150 0.131 0.002 0.709 |
## p_unknown 0.000 -0.142 | 2.649 23.430 0.292 9.351 |
## p_upscale 0.300 9.475 | -0.267 1.049 0.015 -2.137 |
## p_variable 0.021 2.493 | 0.024 0.019 0.000 0.326 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.000 0.041 0.075 |
## sex | 0.186 0.044 0.086 |
## price | 0.319 0.560 0.369 |
## healthy | 0.010 0.044 0.413 |
## spirituality | 0.082 0.019 0.000 |
## friends | 0.412 0.000 0.000 |
## Tea | 0.272 0.340 0.162 |
## tearoom | 0.264 0.273 0.092 |
# visualize MCA, individual points are invisible showing only variables.
plot(mca, invisible=c("ind"), habillage="quali")
the first two dimensions retain 22% of the variance of the data. With so low variance retained it is not surprising that the squared correlation between either of the first 2 dimensions and our variables is at most 0.4 (between dim.1 and friends, at categorical variables). This can be seen in our graph, since most of our groups locate to center of the graph showing no strong relation to either of the dimensions. There seems to be no strong links between variables and dimensions.
Nevertheless, we can make some simple interpretations from our data.
Regarding dimension 1 and 2, it’s hard to see clearly separated groups out from the centre. At the centre we can see from the picture that men drink cheap and private label tea and women drink with friends and accosiate tea with health and spirituality.
Also healthy and not.breakfast seem to be the closest groups to each other, but at the same time almost at the zero point between dimensions, showing not great variance between observations. Maybe Those who skip breakfast think tea is so healthy, they don’t even need one.
Somehow black tea -users are related to high prices of tea, although it is most often not the most expencive sort of tea on the markets (at least in Finland).
MCA graph 2
res.mca=MCA(my_tea1)
plotellipses(res.mca,keepvar=3)
With our additional graph 2 and its third plot we see that price, tea and tearoom most strongly correlate to dimension 2 and friends to dimension 1. This we can see also from the different categories showing in the first MCA graph: Drinking tea in tearooms, higher prices and black tea are those which are most positively connected to dimension 2 and most negatively correlated to dimension 2 is branded tea. Green tea and not drinking with firends are most positively correlated to dimension 1 and drinking in tearoom and price in general are negatively correlated to dimension 1.
Groups that are close and show highest correlation to dimensions are to be considered the most influential ones. Thus, although interesting, the separation between males and females is not as trustworthy than linking high prices with black tea. ***