About the project

This is an Open Data Science course for analysing and presenting data with R-software.

This year I’m here as a course assistant, and will be doing some modifications to following chapters as we go.

Here is the link to my Github repository, where you can see how all the magic is done: https://github.com/macabset/IODS-project


Regression analysis - why attitude matters

This week I have completed DataCamp exercises, peer reviewed Rstudio excersice 1 and surprisingly enough learned for the first time to create regression models in the RStudio excercise 2. This has been an excellent opportunity to learn and interpret regression model many-sidedly and thoroughly.” - January 2017

Part 1

is the data wrangling exercise, which doesn’t show on my course diary. If you are interested, go and check in my github: https://github.com/macabset/IODS-project/blob/master/data/create_learning2014.R

getwd()
## [1] "C:/Users/murmeli/Documents/murmeli/2017/GitHub/IODS-project"
lrn2014 <- read.csv("learning2014.csv")

head(lrn2014)
##   gender age attitude  deep  stra     surf points
## 1      F  53       37 3.750 3.375 2.583333     25
## 2      M  55       31 2.875 2.750 3.166667     12
## 3      F  49       25 3.875 3.625 2.250000     24
## 4      M  53       35 3.500 3.125 2.250000     10
## 5      M  49       37 3.750 3.625 2.833333     22
## 6      F  38       38 4.875 3.625 2.416667     21

Part 2: Analysis (max 15p)

2.1: Exploring dataset

str(lrn2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.75 2.88 3.88 3.5 3.75 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(lrn2014)
## [1] 166   7

Here we can see the learning2014 dataset, whih has 7 variables with 166 observations. This questionnary conserning students’s approaches to learning has been collected from a course:" Introduction to Social Statistics, fall 2014" (Finnish). Dataset has been modified so that questions have been set as combined variables (gender, age, attitude, deep, stra, surf and points) and 0 values in variable points have been removed.

2.2 Describing dataset

Summary

summary(lrn2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.625   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.500   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.875   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.796   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.250   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.875   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
library(ggplot2)

In our summary we can see that all variables except gender are numeric with discrete values.

Picturing our dataset

A <- ggplot(lrn2014, aes(x= attitude, y=points, col=gender))
B <- A +geom_point()
C <- B + geom_smooth(method = "lm")
Figure1 <- C + ggtitle("Figure 1. The relationship between attitude and points")
Figure1

Figure 1 shows us how attitude towards statistics correlate to the points of the exam. In this case, such as in many others, motivation partly explains success.

EXTRA: Combinig strategic learning to points.

D <- ggplot(lrn2014, aes(x=stra, y=points, col=gender))
D1 <- D + geom_point()
D2 <- D1 + geom_smooth(method="lm")
Figure2 <- D2 + ggtitle("Figure 2. The relationship between strategic learning and points")
Figure2

If we look at Figure 2 there seems to be a very small linear relationship between strategic learning and exam points. If student used strategic learning method in general, he or she might have gained a bit better exam points.

2.3 Regression model

Choosing explaining variables

library(GGally)
ggpairs(lrn2014, lower= list(combo = wrap("facethist", bins = 20)))

As we can see in our plot matrix, variables attitude (0.437), stra (0.146) and surf (0.144) correlate the most with the points. Lets try those as explanatory variables in our regression model.

Explaining points with attitude, strategic learning and surface learning

Regression model 1

my_reg <- lm(points ~ attitude + stra + surf, data = lrn2014 )
summary(my_reg)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Attitude seems to be the only one statistically significant in explaining points. This result is compatitive with our result in Figure 2, where correlation between strategic learning and points was bearly noticeable.

Let’s remove first surf and see again our model

Regression model 2

my_reg2 <- lm(points ~ attitude + stra, data = lrn2014)
summary(my_reg2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

The strategic learning gives still too high p-value, so we cannot exclude it being significant by chance.

Let’s try our regression model with only attitude explaining it.

Regression model 3

my_reg3 <- lm(points ~ attitude, data=lrn2014)
summary(my_reg3)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

It seems that attitude towards statistics is our key explanatory variable to exam points. The more positive attitude toward statistics in general the better outcome in exam.

However, we should consider our model as a whole, so let’s check first our R-squared values.

How about gender?

What I didn’t try last time was the significance of gender. I’m not sure how it reacts, since it is not continuous variable… Regression model 4

my_reg4 <- lm(points ~ attitude + gender, data=lrn2014)
summary(my_reg4)
## 
## Call:
## lm(formula = points ~ attitude + gender, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1724  -3.2092   0.3703   4.0197  10.6162 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.5098     1.8502   6.221 4.01e-09 ***
## attitude      0.3618     0.0595   6.081 8.21e-09 ***
## genderM      -0.4835     0.9157  -0.528    0.598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.331 on 163 degrees of freedom
## Multiple R-squared:  0.1919, Adjusted R-squared:  0.182 
## F-statistic: 19.36 on 2 and 163 DF,  p-value: 2.863e-08

According to results it would seem that it’s almost 50/50 chance that the negative correlation between points and gender is there only by chance. If it were more significant, it would mean that women score better in test.

Problems with gender as a factor in regression model * the gender-variable isn’t a scale: “being less man gets you better points” isn’t very reasonable answer *how would only gender explain anything? It would be significant, if we were to compare what is the attitude between genders and perhaps reasons behind differences (if there are any)

For statisticians it might be obvious that gender doesn’t work as explanatory variable, but testing with the model itself and interpreting the results might open the problems more clearly than just stating the fact. And it’s fun to try!

2.4 Interpreting R-squared

R-squared in our regression model tells us how well our model fits to observations. To simplify, the bigger the R-squared the better our model seems to fit to observations.

Our 3 tests of regression model show that our Multiple R-squared lowers each time we drop a variable. That happens, because it increases every time you add a variable, so drop in itself isn’t a bad thing. What we also see is that the 2nd test has the highest adjusted R-square. This should mean that even though we have more variables than in the 3rd one, it seems to be a better model in quality, and not just better fit because we add or lose variables.

I conclude that we still should consider bringing back the strategic learning to our multiple regression model, even though its p-value is over magical line of 0.05. Now we can check if regression model 2 is really viable by checking the residuals in our model.

2.5 Errors in our model

Let’s check Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage with our regression model 2.

errors <- plot(my_reg2, which= c(1,2,5), par(mfrow= c(2,2)))

In plot 1 we see, that values are reaasonably well scattered through out the area, which means that errors make no pattern, as is expected from a good model.

In plot 2 we see, that our observations lines up to a line, so the errors seem to be normally distributed. This is consistent with our assumption, atlhough our model is less suitable in the both ends of values.

In plot 3 we see, that no single observation has too much weight on the model, leverage line stayes under 0.05 in the whole plot.

Comparing with model 3

errors2 <- plot(my_reg3, which= c(1,2,5), par(mfrow= c(2,2)))

The error models are quite similar with model 2. The biggest difference is in the plot 3: REsiduals vs Leverage. Points are packed in the left end, but all the values are still under 0.05. It means that no single observation wights too much, that is, is not too different from other values to give false idea of the model.

Nevertheless, we concluded in the R-squared comparison that the model 2 is better than model 3, so let’s stick to that, although these errors alone would not support that assumption.

2.6 To conclude

our regression model 2, with explanatory variables attitude and strategic learning, exam points are explained with linear model. Considering the r-squared and checking with residuals, it indeed seems that the 2nd regression model best explains points. That means the more positive attitude towards statistics and the more strategically you wish to learn, the better outcome you get from statistics exam.


Logistic regression

This week we’re learning logistic regression, which uses linear model to predict binary variables. How exciting! And very very difficult, since this is an entirely new method for me.

Part 1

The data Wrangling part can be seen in my github: https://github.com/macabset/IODS-project/blob/master/data/create_alc.R

getwd()
## [1] "C:/Users/murmeli/Documents/murmeli/2017/GitHub/IODS-project"
alc2016 <- read.csv("data/create_alc.csv",sep=",", header= T)
head(alc2016)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason nursery internet guardian traveltime studytime failures
## 1     course     yes       no   mother          2         2        0
## 2     course      no      yes   father          1         2        0
## 3      other     yes      yes   mother          1         2        2
## 4       home     yes      yes   mother          1         3        0
## 5       home     yes       no   father          1         2        0
## 6 reputation     yes      yes   mother          1         2        0
##   schoolsup famsup paid activities higher romantic famrel freetime goout
## 1       yes     no   no         no    yes       no      4        3     4
## 2        no    yes   no         no    yes       no      5        3     3
## 3       yes     no  yes         no    yes       no      4        3     2
## 4        no    yes  yes        yes    yes      yes      3        2     2
## 5        no    yes  yes         no    yes       no      4        3     2
## 6        no    yes  yes        yes    yes       no      5        4     2
##   Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1    1    1      3        5  2  8  8     1.0    FALSE
## 2    1    1      3        3  7  8  8     1.0    FALSE
## 3    2    3      3        8 10 10 11     2.5     TRUE
## 4    1    1      5        1 14 14 14     1.0    FALSE
## 5    1    2      5        2  8 12 12     1.5    FALSE
## 6    1    2      5        8 14 14 14     1.5    FALSE

Part 2

2.1 Our data

dim(alc2016)
## [1] 382  35
str(alc2016)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This is part of Student Alcohol Consumption Data Set, where we have combined Secondary school students` performance on math course and Portuguese language course.The data has been collected 2008 but released 2016. In our combined dataset we have 182 observations and 35 variables. Most of our variables are discrete numeric values, where you can describe the level of agreement of disagreement in scale. 13 of them are two-level vectors, where eight of those are yes/no questions.

I have created a variable High_use, which is our only logical vector and it declares high use of alcohol consumption. The high use has been created by modifying the data by making the average of weekday and weekend consumption and stating its high end on high use. The variables in our data used here, but not used for joining the datasets, have been combined by averaging.

2.2 Hypothesis - Four variables explaining alcohol consumption

The point of this excersice is to find variables connected to (high) alcohol consumption. With common sense foreknowledge I will make hypothesis regarding the data and its results.

My personal interest will include gender (sex) as a basis variable when discussion about differences in any behaviour. Are boys any different than girls in alcohol consumptionr? I believe that the more you care about your studies (studytime) the less you drink. Similarly I think that the more you care about your health the less you use alcohol (health), and lastly I’ll pick something to attach family situation (famrel) to the use of alcohol. My hypothesis is the worse you have at home the more you escape to alcohol.

Sex is a binary factor with options male/female, studytime is a 4-level variable according to subjective evaluation of studytime and health and famrel are likert scale variables based on self evaluation.

2.3Exploring our data

Following figures are as help to explain high use of alcohol regarding our pre-assumptions.

  1. studytime
library(ggplot2)
g1 <- ggplot(alc2016, aes(x = high_use, y = studytime, col=sex))
g2 <- g1 + geom_boxplot() + ggtitle("Figure 1.Studytime by high use of alcohol and gender")
g2

When looking at our Figure 1, it would seem that males and females have very different patterns for how high use of alcohol influences studytime. It seems that whether boys used a lot of alcohol makes no difference to studytime, when regarding girls on the other hand, it looks like that high users of alcohol wont study at all and not high users study lot more than boys not using alcohol.

2.Family relationship

h1 <- ggplot(alc2016, aes( x= high_use, y= famrel, col= sex))
h2 <- h1 + geom_boxplot()+ ylab("family relations")
h3 <- h2 + ggtitle("Figure 2. Family relationship by alcohol consumption and gender")
h3

Our figure 2 supports our assumption that with high alcohol consumption you have worse family relations. We can assume that family relations influence high use of alcohol and not the other way around.

3.Health

library(GGally)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
i1 <- ggally_facetbar(alc2016, ggplot2::aes(x= health, y= high_use, fill =sex))
i1 + ggtitle("Figure 3. High use of alcohol by health and gender")

Our Figure 3 illustrates how many observations there are in both high use of and not high use of alcohol in every level of heatlh (1-5). There is all in all less observations in high use of alcohol than in the upper box not high use of alcohol. When students did state the use of alcohol as high, surprisingly high proportions of them state their health “very good”. In general it would seem that there are more people stating good health if not drinking lot of alcohol, but if you drink a lot of alcohol, you declare a good health more likely than poor health. This was rather surprising, so let’s explore this result in our logistic regression.

2.4 Logistic regression model

Generally, in simple logistic regression model the target variable is binary: in this case high use of alcohol or no high use of alcohol. So our model predicts what is the probability of being a high alcohol user, if person has a certain level of health, famrel, sex and studytime.

m1 <- glm(high_use ~ health + famrel + sex + studytime, data = alc2016, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ health + famrel + sex + studytime, family = "binomial", 
##     data = alc2016)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4630  -0.8632  -0.6525   1.1891   2.1629  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.64999    0.65100   0.998  0.31806   
## health       0.06094    0.08539   0.714  0.47543   
## famrel      -0.30716    0.12677  -2.423  0.01540 * 
## sexM         0.70691    0.24605   2.873  0.00407 **
## studytime   -0.46050    0.15662  -2.940  0.00328 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 434.61  on 377  degrees of freedom
## AIC: 444.61
## 
## Number of Fisher Scoring iterations: 4
coef(m1)
## (Intercept)      health      famrel        sexM   studytime 
##  0.64999130  0.06094206 -0.30715569  0.70690830 -0.46049830

It looks like, indeed, health plays no crucial role in high use of alcohol. So let’s calculate our model again, but without health as coefficient.

m2 <- glm(high_use ~ famrel + sex + studytime, data = alc2016, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ famrel + sex + studytime, family = "binomial", 
##     data = alc2016)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.530  -0.854  -0.666   1.218   2.157  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.8395     0.5955   1.410  0.15861   
## famrel       -0.2982     0.1259  -2.368  0.01786 * 
## sexM          0.7257     0.2444   2.970  0.00298 **
## studytime    -0.4679     0.1563  -2.994  0.00275 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 435.12  on 378  degrees of freedom
## AIC: 443.12
## 
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept)      famrel        sexM   studytime 
##   0.8394792  -0.2981605   0.7256730  -0.4678942

The coefficients here are famrel, sexM and studytime, which are being compared to high_use (intercept). The coefficients tell that every time the coefficient rises (or drops as in famrel and studytime) a unit, the probability of being a high alcohol user rises. For example, the negaative predictor of studytime means that all other thigs beaing equall, if high alcoholic use, family relationships are less likely to be good. It seems that in our logistic model 2, all coefficient are valid and we can use all of them in the last phase.

Confidence intervals for the coefficients as odds ratio

OR <- coef(m2) %>% exp
CI <- exp(confint(m2))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 2.3151609 0.7211129 7.5048006
## famrel      0.7421822 0.5789910 0.9501382
## sexM        2.0661211 1.2835027 3.3512991
## studytime   0.6263198 0.4569282 0.8446258

In odds ratio, we will copmare high alcohol use with properties famrel, sex and studytime to not high use of alcohol with the same properties. If the odds ratio (or) is higher than 1, then the property (x) is positively associated with high alcoholic use.

Gender seems to be the only factor to be positively associated with high alcoholic consumption. SexM has the widest gap in confidence intervals.This should mean that there’s 95% probability that odds ratio (OR) of the sexM is between 1.2-3.3.The wider the gap the more inaccurate our mean value 2.066 is. The odds for male to have high alcoholic sonsumption is 2 times larger than for females.

When considering OR under one, it means that there is bigger probability have high alcoholic consumption when there’s less the property famrel, for example. In English: the lower the famrel the higher the consumption of alcohol. This is as predicted in hypothesis.

2.5 Predictiong power of the model

Let’s see how well our model stands in prediction power when adding new data. In our added data, we predict for every observation what is the likehood of being high alcoholic consumer, using our model 2 as predictioner. Then in order to make it binary, we set the value 0.5: if our model predicts the probability of being high alcoholic consumer more than 0.5, it states High_use as true. Then we compare the prediction to what the actual status of the observation is: is it true that that student is a high consumer if our model predicts so?

probabilities <- predict(m2, type = "response")

alc2016 <- mutate(alc2016, probability = probabilities)

alc2016 <- mutate(alc2016, prediction = probability > 0.5)

table(high_use = alc2016$high_use, prediction = alc2016$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   255   13
##    TRUE    102   12

Picturing the prediction

library(dplyr); library(ggplot2)

g <- ggplot(alc2016, aes(x = probability, y = high_use, col=prediction))

g + geom_point()

table(high_use = alc2016$high_use, prediction = alc2016$prediction) %>% prop.table()%>%addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66753927 0.03403141 0.70157068
##    TRUE  0.26701571 0.03141361 0.29842932
##    Sum   0.93455497 0.06544503 1.00000000

When prediction predicts a false in high alcoholic use, 66% of the time it really is false, but 27% of the time the real data would give true values instead. Then again when prediction says true on high alcoholic consumption, the real data can be false as often as true.

How well did we do in our prediction?

Testing error

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc2016$high_use, prob = alc2016$probability)
## [1] 0.3010471

Here we have calculated the loss of our model. 30% is close to our earlier statement of 27% of false guesses, when our model guesses the high use to be false. As most of the students are non high users of alcohol, it got pretty close to real error. We have 30% “penalty”, which means that 30% of our predictions are not correctly classified compared to actual data.

2.6 Conclusion - how good is our model?

Considering our original hypotheses, I’d say they have some value. The self evaluated health got discharged at the first phase in our logistic regression: it seems that it has no statistically significant meaning in predicting high use of alcohol. Our regression model 2 supports my original hypothesis: combining the status of your familyrelationship, gender and time you use studying, our model predicts how likely you are to use alcohol 70% of the time correctly. Individually the coefficients works as I theorized: men are more likely to be high consumers than women, the worse your familyrelations are the more likely you are to be a high consumer and the less you use time in your studies the more likely you are a high user of alcohol.

What if we’d guess that when our model gives the prediction, it never guesses right? As you recall, we got 30% error. This means that 70% of the cases when our model guessed a high consumption, there really was high consumption in alcohol use. Now, if it would never guess right, the training error would be 100%. That would mean that everytime our model predicts a person is a high user he/she is not. Then again if our model would be perfect, there would be no such a thing as a training error and everytime our model predicts a high use, we would find that he/she actually was high user in our data.

All together, our model had a bit higher error than in datacamp exercises(0.26, with coefficients failures, absences, sex). Is it a good or a bad thing? Well, I’d suppose that we never get as good results in human sciences as the results in medical diceases or text based researches: people are complicated. 70% is not bad, but I think that with our 35 variables we could do a bit better than that, as is seen compared to coefficients used in the data camp exercise. ***

Clustering and classification

This week we’ll divide the data with 2 methods using R’s own Boston dataset

1. Our data

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

This week we will use the Boston dataset, which is one of R’s own datasets. Above are shown summary, structure and dimensions of the dataset. It consists of Housing values in the suburbs of Boston, USA. It has 14 variables to explain the values of the 506 observations, which consern about the town.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The summary of our 14 variables shows that all but two of the variables are numeric and continuous. There is binary variables chas, which gives 1 if tract is situated by the Charles river, and there also is rad, which gives and index for accesibility for radial highways.

1.1 Visualizing the correlation

library(dplyr); library(MASS); library(corrplot)
cor_matrix<-cor(Boston) %>% round(digits=2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

Above are the correlations between all the variables visualized as pairs. Only the upper side is needed, since it’s symmetric. The bigger the dot and stronger the color, the stronger the correlation, too.

  • Strong negative correlation seems to be between:
    1. medv and lstat. This means that the less there is lower status inhabitants the higher the value of the owner-occupied homes
    1. age and dis: the younger the people the closer the Boston employment centres are
    1. nox and dis also, the closer the region is to the employment centres the less nitrogen oxides
    1. indus and dis: the closer to employment centre the less industrial workers in the area

The strongest positive correlation is between tax and rad (corr. 0.91). The better contacts with radial highways, the bigger property tax-rate.

1.2 Adjusting the data

Scaling the data

We use scale-function to standardize our data. When variables have a common scale, we can calculate distances between means. We also want our new scales as data frame instead of matrix.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled<- as.data.frame(boston_scaled)

Because we have scaled our data, all observations are between integral 10. Now we can relate all of them with each other and see the real distances of these groups. This will help us in the clustering part.

2. Classification

We are going to divide the data with classification method. This is method where we need to know the groups before discriminating groups, and we will use the crime variable for this.

First we need to adjust our chosen variable and create new rates to our crim variable. From scaled Boston dataset we take the continuous variable crim and make it categorical.

boston_scaled <- scale(Boston)
boston_scaled<- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
crim_quant <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = crim_quant, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled <- dplyr::select(boston_scaled, -crim)

boston_scaled <- data.frame(boston_scaled, crime)

After scaling we are able to make new variable crime to our dataset. The crime variable has categories low, med low, med high and high based on quantiles of the crim. So there are 127 observations in the low category, which are all the observations under 25% of all observations in crime. Since the old crim isn’t necessary anymore, we can drop that out from our boston scaled.

2.1 Linear Discriminant Analysis

No we’re just about to use our data to divide our data to classes with Linear Discriminant Analysis(LDA). LDA is used to find linear combination of features, that separates classes. At the same time it is a variable reduction technique, so instead of variables explaining the target variable, we have dimensions of combined variables. In this case, we will find, what dimensions predict crimes in Boston.

Before we actually can make a model for classification, we have to make sure that we can test how well our model works: For that we will make a train set to create our model and test set to predict our model to the test set. Next the dataset will be divided, so that 80 % will be used for train set and 20% for test set.

We also want to exclude our crime data from the test dataset, so we can use it as our target variable.

n <- nrow(boston_scaled)
random <- sample(n,  size = n * 0.8)
# train set
train <- boston_scaled[random,]
# test set 
test <- boston_scaled[-random,]
#save crime from test
correct_classes <- test$crime
# remove crime from test data
test <- dplyr::select(test, -crime)

2.2 Creating the model

In our LDA we have crime as target variable against all other variables. Next we will use our train set to create the model. Now we’re ready to run the LDA, and it should create a model to predict crime rates using all other variables.

# linear discriminant analysis
lda.fit <- lda(formula= crime ~ ., data = train)

lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2475248 0.2500000 0.2648515 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       1.02082910 -0.9803018 -0.14929469 -0.8912708  0.50638203
## med_low  -0.07176073 -0.3055205  0.04263895 -0.5829083 -0.14328484
## med_high -0.39214634  0.2380667  0.19544522  0.4337793  0.03577751
## high     -0.48724019  1.0170108 -0.08835242  1.0802262 -0.38301232
##                 age        dis        rad        tax    ptratio
## low      -0.9330567  0.8585870 -0.6995397 -0.7159505 -0.4923682
## med_low  -0.3540783  0.3742573 -0.5534930 -0.4744780 -0.1046371
## med_high  0.3931837 -0.3941871 -0.3883071 -0.2722879 -0.2483720
## high      0.8133280 -0.8398076  1.6392096  1.5148289  0.7820356
##                black       lstat        medv
## low       0.38799337 -0.81094105  0.59048319
## med_low   0.31615216 -0.13074572  0.01198134
## med_high  0.05618209  0.08691973  0.10796184
## high     -0.88333898  0.84855082 -0.70004794
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11544673  0.72875022 -0.87981346
## indus    0.07643029 -0.63017300  0.30589945
## chas    -0.08487136 -0.05040852  0.15272384
## nox      0.38289048 -0.48594368 -1.47411895
## rm      -0.08257263 -0.07870965 -0.19206268
## age      0.28430841 -0.24976576 -0.06661798
## dis     -0.08428909 -0.25646450  0.21947636
## rad      3.06212855  0.68737388  0.13023495
## tax     -0.13791239  0.35956062  0.49993352
## ptratio  0.12803540  0.04455064 -0.36699510
## black   -0.12429773 -0.01125710  0.10642584
## lstat    0.19605136 -0.27899847  0.31061888
## medv     0.16728241 -0.36483249 -0.25459217
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9417 0.0436 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

We want to find out what is the formula to predict crime rate in Boston. The red arrows in the middle are the variables that are used to find crime rates. (In our original interpretation of the variables the tax and rad had a strong correlation. It seems that the rad arrow is also in very strong role in our LDA.)

Now lets use our test set to see if we our model really predicted the crime rates correctly.

2.3 Predicting data

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      15        3    0
##   med_low    2      19        5    0
##   med_high   0       8       17    0
##   high       0       0        0   20

Here is a cross tabulation with crime categories our model predicted against the real categories from test set.

When the real value was low, the model predicted half of the time to low, but almost as many times to med_low and 2 instances to med_high. Almost same situation is also with med_low: half of the time our model is right and half is not. Then again med high is almost the same in correct observations as in our model and high crime rate is nearly perfectly predicted by our model. So high end of crime rate is definitely better predicted than low end.

3. Clustering

Now we scale again our data to use it for clustering method. Difference for the LDA is that the classes are not defined before hand. That is, we are not using crime for the base for classes but guess it with k-means.

Again, we need scaling to make distances measurable.

data("Boston")
boston_scaled2 <- scale(Boston)
summary(boston_scaled2)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled2)
## [1] "matrix"
boston_scaled2<- as.data.frame(boston_scaled2)

3.1 Distances

With clustering we do not know the number of proper classes beforehand, so we need to find them out by using distance method. This time I am using the standard euclidean distance method.

dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

3.2 K-means

With K-means method, observations are combined into certain number of clusters. It clusters by grouping observations to the nearest mean. K-mean tries to find the optimal amount of clusters from the data. With function “pairs” we see a sharp drop when there is only 2 clusters. That seem to be the optimal number of clusters to our data according to k-means method.

library(ggplot2); library(GGally)
set.seed(123)

# euclidean distance matrix

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})

# visualize the results
plot(1:k_max, twcss, type='b')

# k-means clustering
km <-kmeans(dist_eu, centers = 2)

boston_scaled2$km <- as.factor(km$cluster)

# plot the Boston dataset with clusters
ggpairs(boston_scaled2, ggplot2::aes(colour=km))
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4. Conclusions

In this exercise we had two different methods to divide our data into groups based on their distance. In our first method (Classification), we divided the data in 4 classes of crime rates. It looked like it was quite accurate in the high end of crime but did not predict so well in low and mid low classes of criminality.

In the second method we used clustering based on no assumption of the classes and tried to use k-means to find those classes. K-means found two classes to be optimal to separate the data, which we can se in our pairs. The only problem is that we do not know without LDA what is the method to separate the two classes,so we do not have a good way of saying if the 2 classes given from k-means are meaningful according the data or not.


Dimensionality reduction

This week we aim to reduce dimensions in order to notice better the relations between variables in the data

1. Principal component analysis

#Packages we need in this workout
library(GGally)
library(dplyr)
library(corrplot)
library(FactoMineR)
library(ggplot2)
library(tidyr)

First part of our dimensionality reduction conserns with continuous variables. Dimensionality reduction in general aims to find the most essential dimensions that explain the data. In Principal Componen Analysis (PCA), we try to find the most essential components (usually PC1 and PC2) that best explain the data.

1.1 Our data

This dataset originates from United Nations Development Programme(http://hdr.undp.org/en/content/human-development-index-hdi). Its goal is to compare nations with more sophisticated methods than just looking at GNP - through human capabilities.

Variables

I have created a subset called “human” with 8 variables and 155 observations. Variables modified from original dataset are Country, sex_edu2, Lab_ratio and GNI. For further information about modifications made, you can go and check my github repository: https://github.com/macabset/IODS-project/blob/master/data/create_human.R

Underneath are the names of the 8 variables and their description.

Country: used as rownames, regions not included

  • Sex_edu2: ratio of females to males in at least secondary education
  • Lab-ratio: ratio of female labour force to male labour force
  • Edu_expect: Expected years of education
  • Life_expect: Life expectancy at birth
  • GNI: Gross National Income per capita
  • Mom_death: Maternal mortality ratio
  • Young_birth: Adolescent birth rate
  • Present_parl: Percentage of female representatives in parliament

Observations

In addition I have excluded missing values from our data set by removing them.

getwd()
## [1] "C:/Users/murmeli/Documents/murmeli/2017/GitHub/IODS-project"
human <- read.csv("data/human.csv", sep=",", header= T, row.names = 1)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Sex_edu2    : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Lab_ratio   : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu_expect  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life_expect : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI         : int  166 135 156 139 140 137 127 154 134 117 ...
##  $ Mom_death   : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Young_birth : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Present_parl: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8
head(human)
##              Sex_edu2 Lab_ratio Edu_expect Life_expect GNI Mom_death
## Norway      1.0072389 0.8908297       17.5        81.6 166         4
## Australia   0.9968288 0.8189415       20.2        82.4 135         6
## Switzerland 0.9834369 0.8251001       15.8        83.0 156         6
## Denmark     0.9886128 0.8840361       18.7        80.2 139         5
## Netherlands 0.9690608 0.8286119       17.9        81.6 140         6
## Germany     0.9927835 0.8072289       16.5        80.9 137         7
##             Young_birth Present_parl
## Norway              7.8         39.6
## Australia          12.1         30.5
## Switzerland         1.9         28.5
## Denmark             5.1         38.0
## Netherlands         6.2         36.9
## Germany             3.8         36.9

Summaries

summary(human)
##     Sex_edu2        Lab_ratio        Edu_expect     Life_expect   
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI           Mom_death       Young_birth      Present_parl  
##  Min.   :  2.00   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.: 53.50   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 99.00   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 98.73   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.:143.50   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :194.00   Max.   :1100.0   Max.   :204.80   Max.   :57.50

All our variables are numeric and continuous, which is as hoped for PCA analysis. But before we jump to analyzing with PCA, let’slook what we can say at this stage.

1.2 Picturing human

cor(human)
##                  Sex_edu2    Lab_ratio  Edu_expect Life_expect
## Sex_edu2      1.000000000  0.009564039  0.59325156   0.5760299
## Lab_ratio     0.009564039  1.000000000  0.04732183  -0.1400125
## Edu_expect    0.593251562  0.047321827  1.00000000   0.7894392
## Life_expect   0.576029853 -0.140012504  0.78943917   1.0000000
## GNI           0.177158634 -0.012970053  0.12134933   0.1928623
## Mom_death    -0.660931770  0.240461075 -0.73570257  -0.8571684
## Young_birth  -0.529418415  0.120158862 -0.70356489  -0.7291774
## Present_parl  0.078635285  0.250232608  0.20608156   0.1700863
##                       GNI  Mom_death Young_birth Present_parl
## Sex_edu2      0.177158634 -0.6609318  -0.5294184  0.078635285
## Lab_ratio    -0.012970053  0.2404611   0.1201589  0.250232608
## Edu_expect    0.121349332 -0.7357026  -0.7035649  0.206081561
## Life_expect   0.192862316 -0.8571684  -0.7291774  0.170086312
## GNI           1.000000000 -0.1001664  -0.1334416  0.002672262
## Mom_death    -0.100166367  1.0000000   0.7586615 -0.089439999
## Young_birth  -0.133441591  0.7586615   1.0000000 -0.070878096
## Present_parl  0.002672262 -0.0894400  -0.0708781  1.000000000
ggpairs(human, title = "Table 1. Human data correlation matrix for paired variables")

cor(human) %>% corrplot( method="circle", type="upper", title= "Graph 1. Correlations between variables in human data")

Our correlation plots show that the highest correlation is between Mom_death and Life_expect (-0.86), but closely after are Mom_death and eduexpect, young_birth and Life_expect, Young_birth and Edu_expect.

  • Whith these preliminary results can be assumed that

    1. the higher the mothers’ mortal rate is the lower the life expectancies at birth.
    1. the higher the mothers’ mortal rate is the lower the expected education levels
    1. the more there are adolescent births the lower the life expectancy at birth.
    1. and the more there are young people giving births the lower the expected level of education.

All in all, this suggests that if we want our human capacity to be used well, we need to take care of the mothers and allow family planning at young age.

Principal component analysis

In Principal Component Analysis(PCA), variables are reduced to 2 main components that describe the data. With this method we are able to make our 8-dimensional dataset to 2D dataset when picturing it. At the same time the original variables are left and their relations to principal components can be evaluated.

PCA

pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 214.3202186  54.4858892  26.3814123  11.4791149   4.0668732   1.6067062
## [7]   0.1905111   0.1586732
## 
## Rotation (n x k) = (8 x 8):
##                        PC1           PC2           PC3           PC4
## Sex_edu2     -0.0007468424  4.707363e-04 -0.0001689810 -0.0004252438
## Lab_ratio     0.0002210928  5.783875e-05 -0.0007563442 -0.0047679411
## Edu_expect   -0.0098109042  2.441895e-03 -0.0223269972 -0.0369246053
## Life_expect  -0.0334485095  1.573401e-02 -0.0303860782 -0.0780112042
## GNI          -0.0278166075  9.979413e-01  0.0557767706 -0.0002740466
## Mom_death     0.9879825869  3.620190e-02 -0.1474236772 -0.0068326906
## Young_birth   0.1479114018 -5.046353e-02  0.9867754042 -0.0069374879
## Present_parl -0.0048164618 -1.494485e-03 -0.0026652825 -0.9962093264
##                        PC5           PC6           PC7           PC8
## Sex_edu2      0.0002964017 -0.0254740423  6.859097e-01  7.272399e-01
## Lab_ratio    -0.0034895150 -0.0320024527  7.265303e-01 -6.863629e-01
## Edu_expect   -0.1947276781 -0.9790080609 -4.054956e-02  3.992920e-03
## Life_expect  -0.9759219723  0.1979058277  5.461806e-03  2.081465e-03
## GNI           0.0150268685 -0.0014692010 -3.631899e-04 -3.767644e-04
## Mom_death    -0.0282554922 -0.0005722709  1.213189e-04  8.299757e-04
## Young_birth  -0.0393039689 -0.0160315111 -4.359137e-05 -9.463203e-05
## Present_parl  0.0841200882  0.0210694313 -2.695182e-03  2.658636e-03
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col= c("grey40", "deeppink2"), main = "Grpagh 2. PCA with human data, unscaled") 
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

First I have made the PCA and biplotted it without scaling the data. In our biplot we see observations on grey and correlation arrows on pink. How observations and arrows situate in our plot is defined by the principal components PC1 and PC2. The arrows combine both the original data and its features to the principal components.

Here the variables are not measured in the same units. We have to scale them in order to give the same influence to each one. And now let’s do it properly with standardized variables.

Scaled PCA

human.std <- scale(human)
pca_human2 <- prcomp(human.std)
biplot(pca_human2, choices= 1:2, cex = c(0.8, 1), col= c("grey40", "deeppink2"), main="Graph 3. PCA with human data, scaled")

If we compare the scaled and the unscaled one, we clearly see that the correlation arrows look quite different. Scaled version is better so, that our variables are comparable.

Next, we want to modify our plot, so the percentages of the variance by the principal components can be seen. PCA expects that the more variance the better the PC.

Scaled and modified PCA

s <- summary(pca_human2)
pca_pr <- round(100*s$importance[2, ], digits = 1)

# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main="Graph 4. PCA with human data, scaled, %")

And underneath we have a graph from the same scaled PCA with correlations arrows showing in a different graph than observations, since they are a bit hard to see in the original picture.

Scaled PCA with correlation arrows

#Draw a pca with only arrows

es.pca = PCA(human, scale.unit= TRUE, ncp=5)

The two Principal components together resume 64% of all variance of the dataset.

A high positive correlation can be seen between the ratio of second level education to expected level of education. That is not very surprising to think that expected education level also predicts what is the ratio between men and women in education. It is also quite understandable that Gross national index represents a good predictor for expected age of it’s citizens: the wealthier the country, the longer its citizen’s live. All these are also strongly correlated to PC1, which explains almost half of the variation. These variables seems to the most important when trying to understand the wellbeing of a nation.

The more mothers die the more young people give birth to children. This on the other hand is negatively correlated to PC1: the more mothers die and young people give birth to children, the worse the state of nation is.

The correlation between ratio of women to men working and women present in parliamen is visible but not as strong as previous connections. Nevertheless, the more women at labour markets, the better presentation they have in the parliament. These are positively correlated to PC2, which means that the better women are represented at the labour markets and at the parliament, the better state the nation has.

2. Multiple Correspondence Analysis

In this section we need another data to work with Multiple Correspondense Analysis(MCA). That is a multidimensional reduction technique with descrete variables: it’s designed for quantitative or categorical variables, and that’s why we cannot use our human data set as it is now.

Our goal is to find out if groups are close or far from each other. The closer the variables are to each other the more the same people have answered to both categories.Also, the higher numbers in the x or y axis, the stronger the correlation to dimensions it corresponds to.

Data - Tea time!

The next part of reducing dimensions happens with R’s own dataset “Tea”. It’s a questionnaire about how people drink tea and about their drinking habbits. For more information, you can check for website:http://factominer.free.fr/classical-methods/multiple-correspondence-analysis.html.

The data set has 300 observations and 36 variables. In the data we have either factor variables with varying levels or logical variables such as “sugar” or “no sugar”, albeit age which is here both categorical and continuous.

#ggplot2, dplyr, tidyr and FactoMineR available
data(tea)
my_tea <- data.frame(tea)
dim(my_tea)
## [1] 300  36
str(my_tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(my_tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

To find ou which might be the most interesting variables, let’s make some pictures!

Picuring the tea data

Bar plot

gather(my_tea) %>% ggplot(aes(value))  + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

for the sake of simplicity. I’ll pick a sub dataset my_tea1 with 8 variables breakfast, sex, price, health, spirituality, tearoom, Tea and friends.

#Columns for our subset my_tea1
keep_columns <- c("breakfast", "sex", "price", "healthy", "spirituality", "friends", "Tea", "tearoom")
my_tea1 <- dplyr::select(my_tea, one_of(keep_columns))
#Picturing with bars
gather(my_tea1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Multiple Correspondence Analysis

The idea is to find out whether these groups have anything incommon with each other. MCA examines the links between variables based on the variance of the observations. Does spirituality accure with expencive tea? Does gender have anything to do with assumptions whether tea is healthy or not? Now, we’re ready to do our MCA with our my_tea1.

MCA graph 1

mca <- MCA(my_tea1, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = my_tea1, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.193   0.165   0.150   0.147   0.139   0.128
## % of var.             11.885  10.164   9.218   9.031   8.544   7.859
## Cumulative % of var.  11.885  22.050  31.268  40.299  48.843  56.702
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.121   0.115   0.110   0.102   0.096   0.087
## % of var.              7.429   7.054   6.784   6.297   5.928   5.371
## Cumulative % of var.  64.131  71.185  77.970  84.267  90.195  95.565
##                       Dim.13
## Variance               0.072
## % of var.              4.435
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1               |  0.767  1.015  0.144 |  0.347  0.242  0.029 |  0.713
## 2               |  0.261  0.117  0.057 |  0.285  0.164  0.069 | -0.327
## 3               | -0.400  0.276  0.233 | -0.236  0.113  0.081 |  0.005
## 4               |  0.057  0.006  0.003 | -0.199  0.080  0.034 |  0.209
## 5               | -0.005  0.000  0.000 |  0.067  0.009  0.003 |  0.484
## 6               |  0.641  0.709  0.171 | -0.080  0.013  0.003 |  0.234
## 7               | -0.151  0.039  0.028 |  0.019  0.001  0.000 |  0.020
## 8               |  0.262  0.118  0.059 |  0.161  0.052  0.022 | -0.149
## 9               |  0.533  0.491  0.180 |  0.465  0.437  0.137 |  0.375
## 10              |  0.503  0.436  0.118 |  1.130  2.577  0.594 | -0.477
##                    ctr   cos2  
## 1                1.132  0.125 |
## 2                0.238  0.090 |
## 3                0.000  0.000 |
## 4                0.097  0.037 |
## 5                0.521  0.164 |
## 6                0.122  0.023 |
## 7                0.001  0.001 |
## 8                0.050  0.019 |
## 9                0.313  0.089 |
## 10               0.505  0.106 |
## 
## Categories (the 10 first)
##                     Dim.1     ctr    cos2  v.test     Dim.2     ctr
## breakfast       |  -0.002   0.000   0.000  -0.030 |   0.211   1.613
## Not.breakfast   |   0.002   0.000   0.000   0.030 |  -0.194   1.489
## F               |  -0.357   4.897   0.186  -7.458 |  -0.173   1.347
## M               |   0.521   7.144   0.186   7.458 |   0.253   1.965
## p_branded       |   0.289   1.708   0.039   3.398 |  -0.960  22.087
## p_cheap         |   0.678   0.694   0.011   1.812 |   0.329   0.191
## p_private label |   0.723   2.369   0.039   3.431 |   0.274   0.398
## p_unknown       |   0.192   0.095   0.002   0.677 |  -0.040   0.005
## p_upscale       |   0.563   3.621   0.068   4.507 |   1.183  18.709
## p_variable      |  -0.710  12.168   0.300  -9.471 |   0.187   0.986
##                    cos2  v.test     Dim.3     ctr    cos2  v.test  
## breakfast         0.041   3.500 |  -0.286   3.276   0.075  -4.751 |
## Not.breakfast     0.041  -3.500 |   0.264   3.024   0.075   4.751 |
## F                 0.044  -3.618 |  -0.242   2.905   0.086  -5.059 |
## M                 0.044   3.618 |   0.353   4.238   0.086   5.059 |
## p_branded         0.427 -11.300 |  -0.341   3.069   0.054  -4.012 |
## p_cheap           0.003   0.880 |   1.264   3.110   0.038   3.378 |
## p_private label   0.006   1.301 |   0.150   0.131   0.002   0.709 |
## p_unknown         0.000  -0.142 |   2.649  23.430   0.292   9.351 |
## p_upscale         0.300   9.475 |  -0.267   1.049   0.015  -2.137 |
## p_variable        0.021   2.493 |   0.024   0.019   0.000   0.326 |
## 
## Categorical variables (eta2)
##                   Dim.1 Dim.2 Dim.3  
## breakfast       | 0.000 0.041 0.075 |
## sex             | 0.186 0.044 0.086 |
## price           | 0.319 0.560 0.369 |
## healthy         | 0.010 0.044 0.413 |
## spirituality    | 0.082 0.019 0.000 |
## friends         | 0.412 0.000 0.000 |
## Tea             | 0.272 0.340 0.162 |
## tearoom         | 0.264 0.273 0.092 |
# visualize MCA, individual points are invisible showing only variables.
plot(mca, invisible=c("ind"),  habillage="quali")

the first two dimensions retain 22% of the variance of the data. With so low variance retained it is not surprising that the squared correlation between either of the first 2 dimensions and our variables is at most 0.4 (between dim.1 and friends, at categorical variables). This can be seen in our graph, since most of our groups locate to center of the graph showing no strong relation to either of the dimensions. There seems to be no strong links between variables and dimensions.

Nevertheless, we can make some simple interpretations from our data.

Regarding dimension 1 and 2, it’s hard to see clearly separated groups out from the centre. At the centre we can see from the picture that men drink cheap and private label tea and women drink with friends and accosiate tea with health and spirituality.

Also healthy and not.breakfast seem to be the closest groups to each other, but at the same time almost at the zero point between dimensions, showing not great variance between observations. Maybe Those who skip breakfast think tea is so healthy, they don’t even need one.

Somehow black tea -users are related to high prices of tea, although it is most often not the most expencive sort of tea on the markets (at least in Finland).

MCA graph 2

res.mca=MCA(my_tea1)

plotellipses(res.mca,keepvar=3)

With our additional graph 2 and its third plot we see that price, tea and tearoom most strongly correlate to dimension 2 and friends to dimension 1. This we can see also from the different categories showing in the first MCA graph: Drinking tea in tearooms, higher prices and black tea are those which are most positively connected to dimension 2 and most negatively correlated to dimension 2 is branded tea. Green tea and not drinking with firends are most positively correlated to dimension 1 and drinking in tearoom and price in general are negatively correlated to dimension 1.

Groups that are close and show highest correlation to dimensions are to be considered the most influential ones. Thus, although interesting, the separation between males and females is not as trustworthy than linking high prices with black tea. ***