# Introduction ¶

• We will briefly talk in this lesson about the Scientific Library for Python (SciPy), which is the scientific toolbox for Python.
• We will get a brief overview of the statistics subpackage and we will use it to perform many statistical calculations, including calculations of probabilities, probability distributions, and confidence intervals.
• We will also perform hypothesis testing on a real-world dataset, and we will be able to state conclusions that go beyond the sample that we have.

In this lesson we will show how to use SciPy Library to perform

• Statistics
• Probabilities
• Hypothesis testing
• Performing statistical tests

## Introduction to SciPy ¶

SciPy

• is a tool for doing scientific computing in Python.
• is a Python-based ecosystem that is an open source software for math, science, and engineering.
• contains various toolboxes dedicated to common issues in scientific computing.

Let us first import the libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import sqrt, arange
from scipy import stats
%matplotlib inline


In this lesson we use a dataset containing information about Portuguese students from two public schools. This is a real world dataset that was collected in order to study alcohol consumption in young people and its effects on students' academic performance. The dataset was built from two sources: school reports and questionnaires.

Attributes:

• 1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
• 2 sex - student's sex (binary: 'F' - female or 'M' - male)
• 3 age - student's age (numeric: from 15 to 22)
• 4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
• 5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
• 6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
• 7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 -secondary education or 4 - higher education)
• 8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 -secondary education or 4 - higher education)
• 9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
• 10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
• 11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
• 12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
• 13 traveltime - home to school travel time (numeric: 1 <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
• 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
• 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
• 16 schoolsup - extra educational support (binary: yes or no)
• 17 famsup - family educational support (binary: yes or no)
• 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
• 19 activities - extra-curricular activities (binary: yes or no)
• 20 nursery - attended nursery school (binary: yes or no)
• 21 higher - wants to take higher education (binary: yes or no)
• 22 internet - Internet access at home (binary: yes or no)
• 23 romantic - with a romantic relationship (binary: yes or no)
• 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
• 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
• 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
• 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
• 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
• 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
• 30 absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject:

• 31 G1 - first period grade (numeric: from 0 to 20)
• 31 G2 - second period grade (numeric: from 0 to 20)
• 32 G3 - final grade (numeric: from 0 to 20, output target)

## Importing the data into Python ¶

In [2]:
student = pd.read_csv("/Users/dhafermalouche/Documents/Teaching_2018_2019/Teaching_python/student.csv", sep=";")


A preview of the data

In [3]:
student.head()

Out[3]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 4 0 11 11
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 2 9 11 11
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 6 12 13 12
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 0 14 14 14
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 0 11 13 13

5 rows × 33 columns

# Confidence Intervals ¶

G3 is the variable that gives the final grade (numeric: from 0 to 20, output target). We will show how to compute the confidence interval of the mean.

The sample size

In [4]:
sample_size = student.shape[0]
sample_size

Out[4]:
649

Because we have a sample size that is much greater than 30, we can use the Central Limit Theorem to calculate confidence intervals. According to this theorem we can calculate a confidence interval for the mean using the normal distribution.

To get the confidence interval for the mean we need three numbers:

1. Sample mean
2. Standard error
3. Confidence level

Formula for the standard error:

$$SE = \frac{s}{\sqrt n}$$

In [5]:
sample_mean_grade = student['G3'].mean()

Out[5]:
11.906009244992296
In [6]:
std_error_grades = student['G3'].std()/sqrt(sample_size)

Out[6]:
0.126814350307949
In [7]:
stats.norm.interval(0.95, loc=sample_mean_grade, scale=std_error_grades)

Out[7]:
(11.65745768566587, 12.154560804318722)

Let us compare the confidence intervals of G3 between girls and boys

Girls data

In [8]:
SF = student.loc[(student.sex == 'F')]

In [9]:
SF.head()

Out[9]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 4 0 11 11
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 2 9 11 11
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 6 12 13 12
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 0 14 14 14
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 0 11 13 13

5 rows × 33 columns

Boys data

In [10]:
SM = student.loc[(student.sex == 'M')]

In [11]:
SM.head()

Out[11]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
5 GP M 16 U LE3 T 4 3 services other ... 5 4 2 1 2 5 6 12 12 13
6 GP M 16 U LE3 T 2 2 other other ... 4 4 4 1 1 3 0 13 12 13
8 GP M 15 U LE3 A 3 2 services other ... 4 2 2 1 1 1 0 15 16 17
9 GP M 15 U GT3 T 3 4 other other ... 5 5 1 1 1 5 0 12 12 13
12 GP M 15 U LE3 T 4 4 health services ... 4 3 3 1 3 5 0 12 13 12

5 rows × 33 columns

In [12]:
sample_mean_grade_F =SF['G3'].mean()

Out[12]:
12.2532637075718
In [13]:
sample_mean_grade_M =SM['G3'].mean()

Out[13]:
11.406015037593985
In [14]:
sample_size_F = SF.shape[0]
sample_size_F

Out[14]:
383
In [15]:
sample_size_M = SM.shape[0]
sample_size_M

Out[15]:
266
In [16]:
std_error_grades_F = SF['G3'].std()/sqrt(sample_size_F)

Out[16]:
0.1596364743648972
In [17]:
std_error_grades_M = SM['G3'].std()/sqrt(sample_size_M)

Out[17]:
0.20360457170514298

Girls Confidence Interval

In [18]:
stats.norm.interval(0.95, loc=sample_mean_grade_F, scale=std_error_grades_F)

Out[18]:
(11.94038196719765, 12.566145447945951)

Boys Confidence Interval

In [19]:
stats.norm.interval(0.95, loc=sample_mean_grade_M, scale=std_error_grades_M)

Out[19]:
(11.006957409964201, 11.805072665223769)

Let us show this graphicaly then

In [20]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='sex', y='G3', data=student, ax=axes[0])
sns.pointplot(x='sex', y='G3', data=student, ax=axes[1]);

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
box_data = remove_na(group_data)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
stat_data = remove_na(group_data)


Let us try to answer to the question: Does alcohol consumption affect academic performance?

We can then draw the same graph as above by comparing the final grades between the five categories of drinker in the weekday. We will be using then Dalc variable

In [21]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='Dalc', y='G3', data=student, ax=axes[0])
sns.pointplot(x='Dalc', y='G3', data=student, ax=axes[1]);

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
box_data = remove_na(group_data)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
stat_data = remove_na(group_data)


# Hypothesis testing ¶

Some Statistical test from scipy.stats

• kurtosistest(a[, axis, nan_policy]) Tests whether a dataset has normal kurtosis
• normaltest(a[, axis, nan_policy]) Tests whether a sample differs from a normal distribution.
• skewtest(a[, axis, nan_policy]) Tests whether the skew is different from the normal distribution.
• pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
• ttest_1samp(a, popmean[, axis, nan_policy]) Calculates the T-test for the mean of ONE group of scores.
• ttest_1samp(a, popmean[, axis, nan_policy]) Calculates the T-test for the mean of ONE group of scores.
• ttest_ind(a, b[, axis, equal_var, nan_policy]) Calculates the T-test for the means of two independent samples of scores.
• ttest_ind_from_stats(mean1, std1, nobs1, ...) T-test for means of two independent samples from descriptive statistics.
• ttest_rel(a, b[, axis, nan_policy]) Calculates the T-test on TWO RELATED samples of scores, a and b.
• kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
• chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
• ansari(x, y) Perform the Ansari-Bradley test for equal scale parameters
• bartlett(*args) Perform Bartlett’s test for equal variances
• levene(*args, **kwds) Perform Levene test for equal variances.
• shapiro(x[, a, reta]) Perform the Shapiro-Wilk test for normality.
• anderson(x[, dist]) Anderson-Darling test for data coming from a particular distribution
• anderson_ksamp(samples[, midrank]) The Anderson-Darling test for k-samples.

## Normality test ¶

Let's perform the Normality test whose Null Hypothesis is that the data is sampled from a Normal distribution. We will use a significance level of 5.0%

In [22]:
stats.kurtosistest(a=student['G3'])

Out[22]:
KurtosistestResult(statistic=6.754184473055633, pvalue=1.436408963570327e-11)

## Correlation test ¶

Let's perform the Correlation test whose Null Hypothesis is that the correlation coefficient is equal to zero. We will use a significance level of 5.0%

In [23]:
stats.pearsonr(student['G1'],student['G2'])

Out[23]:
(0.8649816303085822, 6.373794780789756e-196)

## Test of Equal variances ¶

Let's perform the Bartetts's test whose Null Hypothesis is that the variances are equal. We will use a significance level of 5.0%

In [24]:
student.groupby('sex')['G3'].var()

Out[24]:
sex
F     9.760297
M    11.026983
Name: G3, dtype: float64
In [25]:
grades_girls = student['G3'][student['sex']=='F']

Out[25]:
BartlettResult(statistic=1.1708204423803006, pvalue=0.2792327258200368)

According to the test we cannot reject the Null hypothesis of equal variances, so we will use assume that the two groups are samples from a population with the same variances. This information will be useful in our next test.

Let's make again a graph showing the confidence intervals of the mean of the final grade between Boys and Girls

In [26]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='sex', y='G3', data=student, ax=axes[0])
sns.pointplot(x='sex', y='G3', data=student, ax=axes[1]);

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
box_data = remove_na(group_data)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
stat_data = remove_na(group_data)


## Comparing means ¶

The visualizations suggest there is a difference between the means of the final grade of the two groups. Now we will perform a formal statistical test to confirm the hypothesis that girls have higher final scores than the boys.

Null Hypothesis: for both groups (Boys and Girls) the population means of the final grade are equal.

Alternative Hypothesis: The population means of the final grades are different.

A common test to apply in for these cases is the two-sample t-test, which is used to determine if two population means are equal.

The assumptions of this test are:

1. Independent samples: we will assume that the method for collecting the data assured that the answers given by the students are independent.
2. Large enough sample size or observations come from a normally-distributed population: this assumption is required if we are working with small samples (less than 30), since in the smaller group we have 166 observations we can say that we have a "large enough" sample.
3. Variances are equal

In addition this test have two versions: one assuming equal variances and the other assuming unequal variances.

In [27]:
stats.ttest_ind(grades_girls, grades_boys, equal_var=True)

Out[27]:
Ttest_indResult(statistic=3.310937693029702, pvalue=0.000981528706137396)

Since we got such a low p-value we can reject the Null hypothesis of equal means for the two groups at a level of significance of 5%.

There is then a statistical significant difference between the grades in the two analyzed groups. The results suggest that the girls have higher grades than boys.

Do male teenagers drink more than female teenagers?

Let's first create two categorical variables Dalc_c and Walc_c from the variables Dalc and Walc by considering three categories of drinker:

• Low when the scores are 1 or 2
• Med when the scores are equal to 3
• High when the scores are 4 or 5
In [28]:
Dalc_cat = {
1: 'Low',
2: 'Low',
3: 'Med',
4: 'High',
5: 'High'
}

In [29]:
student['Dalcc'] = student['Dalc'].map(Dalc_cat)

Out[29]:
0    Low
1    Low
2    Low
3    Low
4    Low
Name: Dalcc, dtype: object
In [30]:
student['Dalcc'].value_counts()

Out[30]:
Low     572
Med      43
High     34
Name: Dalcc, dtype: int64

Make it as an ordered variable

In [31]:
student['Dalcc'] = student['Dalcc'].astype(dtype='category',
categories=['Low', 'Med', 'High'],
ordered=True)

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
This is separate from the ipykernel package so we can avoid doing imports until

In [32]:
student['Dalcc'].value_counts(sort=False)

Out[32]:
Low     572
Med      43
High     34
Name: Dalcc, dtype: int64
In [33]:
gender_Dalc_table = pd.crosstab(student['Dalcc'], student['sex'])
gender_Dalc_table

Out[33]:
sex F M
Dalcc
Low 363 209
Med 11 32
High 9 25
In [34]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
student['sex'].value_counts().plot(kind='bar', ax=axes[0], title='Gender')
student['Dalcc'].value_counts().plot(kind='bar', ax=axes[1], title='Alcohol Consumption Level');


And the plot of the cross-table

Given the alcohol consumption

In [35]:
100*(gender_Dalc_table.T/gender_Dalc_table.apply(sum, axis=1)).T

Out[35]:
sex F M
Dalcc
Low 63.461538 36.538462
Med 25.581395 74.418605
High 26.470588 73.529412

And Given the sex variable

In [36]:
100*(gender_Dalc_table/gender_Dalc_table.apply(sum, axis=0)).T

Out[36]:
Dalcc Low Med High
sex
F 94.778068 2.872063 2.349869
M 78.571429 12.030075 9.398496
In [37]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
gender_Dalc_table.plot(kind='bar', stacked=True, ax=axes[0]);
(100*(gender_Dalc_table.T/gender_Dalc_table.apply(sum, axis=1)).T).plot(kind='bar', stacked=True, ax=axes[1]);

In [38]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
(gender_Dalc_table.T).plot(kind='bar', stacked=True, ax=axes[0]);
(100*(gender_Dalc_table/gender_Dalc_table.apply(sum, axis=0)).T).plot(kind='bar', stacked=True, ax=axes[1]);


Chi-square test of independence of variables in a contingency table.

This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table.

In [39]:
chi_stat, p_value, dof, expected = stats.chi2_contingency(gender_Dalc_table)

In [40]:
chi_stat

Out[40]:
39.435980582823575
In [41]:
 p_value

Out[41]:
2.732660141395978e-09
In [42]:
dof

Out[42]:
2

# One-Way Anova ¶

The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of independent groups.

The Tukey Test (or Tukey procedure), also called Tukey’s Honest Significant Difference test, is a post-hoc test based on the studentized range distribution.

An ANOVA test can tell you if your results are significant overall, but it will not tell you exactly where those differences lie. After you have run an ANOVA and found significant results, then you can run Tukey’s HSD to find out which specific groups’s means (compared with each other) are different. The test compares all possible pairs of means.

As an example we will compare the final grades between the group of Alcohol consumption on the whole sample and we will do the same on the sub-samples of Girls and Boys.

Let us first import the libraries

In [43]:
import pandas as pd
import scipy.stats as stats
import researchpy as rp
import statsmodels.api as sm
from statsmodels.formula.api import ols

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools

In [44]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison


We will now compute the confidence interval of the variable G3 and the confidence intervals by groups

In [45]:
rp.summary_cont(student['G3'])



Out[45]:
Variable N Mean SD SE 95% Conf. Interval
0 G3 649.0 11.906009 3.230656 0.126814 11.656993 12.155026
In [46]:
rp.summary_cont(student['G3'].groupby(student['Dalcc']))



Out[46]:
N Mean SD SE 95% Conf. Interval
Dalcc
Low 572 12.101399 3.172263 0.132639 11.841199 12.361599
Med 43 11.139535 2.252844 0.343555 10.458197 11.820873
High 34 9.588235 4.171459 0.715399 8.164966 11.011505

This is a first method to perform an ANOVA

In [47]:
stats.f_oneway(student['G3'][student['Dalcc'] == 'High'],
student['G3'][student['Dalcc'] == 'Med'],
student['G3'][student['Dalcc'] == 'Low'])

Out[47]:
F_onewayResult(statistic=11.358219553765995, pvalue=1.4188530997129723e-05)

Second method

We need to create a new variable with type object

In [48]:
student['Dalcc'].head()

Out[48]:
0    Low
1    Low
2    Low
3    Low
4    Low
Name: Dalcc, dtype: category
Categories (3, object): [Low < Med < High]
In [49]:
student['Dalcc2'] = student['Dalcc'].astype(dtype='object')

In [50]:
student['Dalcc2'].head()

Out[50]:
0    Low
1    Low
2    Low
3    Low
4    Low
Name: Dalcc2, dtype: object
In [51]:
results = ols('G3 ~ C(Dalcc2)', data=student).fit()
results.summary()

Out[51]:
Dep. Variable: R-squared: G3 0.034 OLS 0.031 Least Squares 11.36 Sun, 04 Nov 2018 1.42e-05 20:01:02 -1670.2 649 3346. 646 3360. 2 nonrobust
coef std err t P>|t| [0.025 0.975] 9.5882 0.545 17.580 0.000 8.517 10.659 2.5132 0.561 4.477 0.000 1.411 3.616 1.5513 0.730 2.126 0.034 0.118 2.984
 Omnibus: Durbin-Watson: 109.962 1.608 0 269.472 -0.884 3.05e-59 5.615 10.8

Let's us now compare the One-way ANOVA results between the girls and the boys sub-sample. To do so we will first consider these two sub-samples and perform two one-way ANOVA. We will then compare the coefficients and measure the difference of the impact of the consumption of Alcohol between Girls and Boys teenagers.

Girls sub-samples

In [52]:
student_girls=student[student['sex']=='F']

In [53]:
student_girls.head()

Out[53]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... goout Dalc Walc health absences G1 G2 G3 Dalcc Dalcc2
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 1 1 3 4 0 11 11 Low Low
1 GP F 17 U GT3 T 1 1 at_home other ... 3 1 1 3 2 9 11 11 Low Low
2 GP F 15 U LE3 T 1 1 at_home other ... 2 2 3 3 6 12 13 12 Low Low
3 GP F 15 U GT3 T 4 2 health services ... 2 1 1 5 0 14 14 14 Low Low
4 GP F 16 U GT3 T 3 3 other other ... 2 1 2 5 0 11 13 13 Low Low

5 rows × 35 columns

Let's then perform the One way-ANOVA

In [54]:
results_girls = ols('G3 ~ C(Dalcc2)', data=student_girls).fit()
results_girls.summary()

Out[54]:
Dep. Variable: R-squared: G3 0.008 OLS 0.003 Least Squares 1.496 Sun, 04 Nov 2018 0.225 20:01:07 -977.75 383 1961. 380 1973. 2 nonrobust
coef std err t P>|t| [0.025 0.975] 10.5556 1.040 10.149 0.000 8.511 12.600 1.7530 1.053 1.665 0.097 -0.317 3.823 1.2626 1.402 0.900 0.369 -1.495 4.020
 Omnibus: Durbin-Watson: 62.566 1.544 0 144.827 -0.829 3.56e-32 5.516 16.5

Boys sub-samples

In [55]:
student_boys=student[student['sex']=='M']

In [56]:
student_boys.head()

Out[56]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... goout Dalc Walc health absences G1 G2 G3 Dalcc Dalcc2
5 GP M 16 U LE3 T 4 3 services other ... 2 1 2 5 6 12 12 13 Low Low
6 GP M 16 U LE3 T 2 2 other other ... 4 1 1 3 0 13 12 13 Low Low
8 GP M 15 U LE3 A 3 2 services other ... 2 1 1 1 0 15 16 17 Low Low
9 GP M 15 U GT3 T 3 4 other other ... 1 1 1 5 0 12 12 13 Low Low
12 GP M 15 U LE3 T 4 4 health services ... 3 1 3 5 0 12 13 12 Low Low

5 rows × 35 columns

Let's then perform the One way-ANOVA

In [57]:
results_boys = ols('G3 ~ C(Dalcc2)', data=student_boys).fit()
results_boys.summary()

Out[57]:
Dep. Variable: R-squared: G3 0.051 OLS 0.044 Least Squares 7.057 Sun, 04 Nov 2018 0.00103 20:01:10 -689.23 266 1384. 263 1395. 2 nonrobust
coef std err t P>|t| [0.025 0.975] 9.2400 0.649 14.227 0.000 7.961 10.519 2.5016 0.687 3.640 0.000 1.148 3.855 1.6662 0.867 1.922 0.056 -0.040 3.373
 Omnibus: Durbin-Watson: 54.017 1.64 0 124.841 -0.967 7.78e-28 5.743 7.75

Comparison of the coefficients

I will draw in a same figure a plot with the confidence intervals of the coefficients of the Boys and Girls models. We will first create a dataframe containing the estimation of the models.

In [58]:
x=['Boys','Girls']
coef_df = pd.DataFrame()
for i, mod in enumerate([results_boys, results_girls]):
err_series = mod.params - mod.conf_int()[0]
coef_df = coef_df.append(pd.DataFrame({'coef': mod.params.values[1:],
'err': err_series.values[1:],
'varname': err_series.index.values[1:],
'model':  x[i]
})
)
coef_df

Out[58]:
coef err varname model
0 2.501627 1.353127 C(Dalcc2)[T.Low] Boys
1 1.666250 1.706736 C(Dalcc2)[T.Med] Boys
0 1.752984 2.070134 C(Dalcc2)[T.Low] Girls
1 1.262626 2.757395 C(Dalcc2)[T.Med] Girls
In [59]:
## marker to use
marker_list = 'so'
width=0.05
## 5 covariates in total
base_x = pd.np.arange(2) - 0.2
base_x

Out[59]:
array([-0.2,  0.8])
In [60]:
from matplotlib import pyplot as plt
from matplotlib.lines import Line2D
%matplotlib inline

fig, ax = plt.subplots(figsize=(8, 5))
for i, mod in enumerate(coef_df.model.unique()):
mod_df = coef_df[coef_df.model == mod]
mod_df = mod_df.set_index('varname').reindex(coef_df['varname'].unique())
## offset x posistions
X = base_x + width*i
ax.bar(X, mod_df['coef'],
color='none',yerr=mod_df['err'])
## remove axis labels
ax.set_ylabel('')
ax.set_xlabel('')
ax.scatter(x=X,
marker=marker_list[i], s=120,
y=mod_df['coef'], color='black')
ax.axhline(y=0, linestyle='--', color='black', linewidth=4)
ax.xaxis.set_ticks_position('none')
_ = ax.set_xticklabels(['','','Low','','','','Med'],
rotation=0, fontsize=16)

## finally, build customized legend
legend_elements = [Line2D([0], [0], marker=m,
label=x[i],
color = 'k',
markersize=10)
for i, m in enumerate(marker_list)
]
_ = ax.legend(handles=legend_elements, loc=2,
prop={'size': 15}, labelspacing=1.2)


# Tukey’s HSD Post-hoc comparison ¶

The overall model was significant, now to test which groups differ. Deciding which groups to compare should be theory driven. There are a few different techniques that can be used. Each of these techniques have different ways of controlling for familywise error rate. We will show here how to perform one them:

In [61]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

mc_girls = MultiComparison(student_girls['G3'], student_girls['Dalcc2'])
mc_results_girls = mc_girls.tukeyhsd()
print(mc_results_girls)

Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
High   Low    1.753   -0.7244 4.2304 False
High   Med    1.2626  -2.0372 4.5625 False
Low    Med   -0.4904  -2.7373 1.7565 False
--------------------------------------------

In [62]:
mc_boys = MultiComparison(student_boys['G3'], student_boys['Dalcc2'])
mc_results_boys = mc_boys.tukeyhsd()
print(mc_results_girls)

Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
High   Low    1.753   -0.7244 4.2304 False
High   Med    1.2626  -2.0372 4.5625 False
Low    Med   -0.4904  -2.7373 1.7565 False
--------------------------------------------