Introduction

  • We will briefly talk in this lesson about the Scientific Library for Python (SciPy), which is the scientific toolbox for Python.
  • We will get a brief overview of the statistics subpackage and we will use it to perform many statistical calculations, including calculations of probabilities, probability distributions, and confidence intervals.
  • We will also perform hypothesis testing on a real-world dataset, and we will be able to state conclusions that go beyond the sample that we have.

In this lesson we will show how to use SciPy Library to perform

  • Statistics
  • Probabilities
  • Hypothesis testing
  • Performing statistical tests

Introduction to SciPy

SciPy

  • is a tool for doing scientific computing in Python.
  • is a Python-based ecosystem that is an open source software for math, science, and engineering.
  • contains various toolboxes dedicated to common issues in scientific computing.

Let us first import the libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import sqrt, arange
from scipy import stats
%matplotlib inline

In this lesson we use a dataset containing information about Portuguese students from two public schools. This is a real world dataset that was collected in order to study alcohol consumption in young people and its effects on students' academic performance. The dataset was built from two sources: school reports and questionnaires.

Attributes:

  • 1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
  • 2 sex - student's sex (binary: 'F' - female or 'M' - male)
  • 3 age - student's age (numeric: from 15 to 22)
  • 4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
  • 5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
  • 6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
  • 7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 -secondary education or 4 - higher education)
  • 8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 -secondary education or 4 - higher education)
  • 9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
  • 10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
  • 11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
  • 12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
  • 13 traveltime - home to school travel time (numeric: 1 <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  • 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
  • 16 schoolsup - extra educational support (binary: yes or no)
  • 17 famsup - family educational support (binary: yes or no)
  • 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • 19 activities - extra-curricular activities (binary: yes or no)
  • 20 nursery - attended nursery school (binary: yes or no)
  • 21 higher - wants to take higher education (binary: yes or no)
  • 22 internet - Internet access at home (binary: yes or no)
  • 23 romantic - with a romantic relationship (binary: yes or no)
  • 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
  • 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
  • 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
  • 30 absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject:

  • 31 G1 - first period grade (numeric: from 0 to 20)
  • 31 G2 - second period grade (numeric: from 0 to 20)
  • 32 G3 - final grade (numeric: from 0 to 20, output target)

Importing the data into Python

In [2]:
student = pd.read_csv("/Users/dhafermalouche/Documents/Teaching_2018_2019/Teaching_python/student.csv", sep=";")

A preview of the data

In [3]:
student.head()
Out[3]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 4 0 11 11
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 2 9 11 11
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 6 12 13 12
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 0 14 14 14
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 0 11 13 13

5 rows × 33 columns

Confidence Intervals

G3 is the variable that gives the final grade (numeric: from 0 to 20, output target). We will show how to compute the confidence interval of the mean.

The sample size

In [4]:
sample_size = student.shape[0]
sample_size
Out[4]:
649

Because we have a sample size that is much greater than 30, we can use the Central Limit Theorem to calculate confidence intervals. According to this theorem we can calculate a confidence interval for the mean using the normal distribution.

To get the confidence interval for the mean we need three numbers:

  1. Sample mean
  2. Standard error
  3. Confidence level

Formula for the standard error:

$$ SE = \frac{s}{\sqrt n} $$

In [5]:
sample_mean_grade = student['G3'].mean()
sample_mean_grade
Out[5]:
11.906009244992296
In [6]:
std_error_grades = student['G3'].std()/sqrt(sample_size)
std_error_grades
Out[6]:
0.126814350307949
In [7]:
stats.norm.interval(0.95, loc=sample_mean_grade, scale=std_error_grades)
Out[7]:
(11.65745768566587, 12.154560804318722)

Let us compare the confidence intervals of G3 between girls and boys

Girls data

In [8]:
SF = student.loc[(student.sex == 'F')]
In [9]:
SF.head()
Out[9]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 4 0 11 11
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 2 9 11 11
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 6 12 13 12
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 0 14 14 14
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 0 11 13 13

5 rows × 33 columns

Boys data

In [10]:
SM = student.loc[(student.sex == 'M')]
In [11]:
SM.head()
Out[11]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
5 GP M 16 U LE3 T 4 3 services other ... 5 4 2 1 2 5 6 12 12 13
6 GP M 16 U LE3 T 2 2 other other ... 4 4 4 1 1 3 0 13 12 13
8 GP M 15 U LE3 A 3 2 services other ... 4 2 2 1 1 1 0 15 16 17
9 GP M 15 U GT3 T 3 4 other other ... 5 5 1 1 1 5 0 12 12 13
12 GP M 15 U LE3 T 4 4 health services ... 4 3 3 1 3 5 0 12 13 12

5 rows × 33 columns

In [12]:
sample_mean_grade_F =SF['G3'].mean()
sample_mean_grade_F 
Out[12]:
12.2532637075718
In [13]:
sample_mean_grade_M =SM['G3'].mean()
sample_mean_grade_M 
Out[13]:
11.406015037593985
In [14]:
sample_size_F = SF.shape[0]
sample_size_F 
Out[14]:
383
In [15]:
sample_size_M = SM.shape[0]
sample_size_M
Out[15]:
266
In [16]:
std_error_grades_F = SF['G3'].std()/sqrt(sample_size_F)
std_error_grades_F
Out[16]:
0.1596364743648972
In [17]:
std_error_grades_M = SM['G3'].std()/sqrt(sample_size_M)
std_error_grades_M
Out[17]:
0.20360457170514298

Girls Confidence Interval

In [18]:
stats.norm.interval(0.95, loc=sample_mean_grade_F, scale=std_error_grades_F)
Out[18]:
(11.94038196719765, 12.566145447945951)

Boys Confidence Interval

In [19]:
stats.norm.interval(0.95, loc=sample_mean_grade_M, scale=std_error_grades_M)
Out[19]:
(11.006957409964201, 11.805072665223769)

Let us show this graphicaly then

In [20]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='sex', y='G3', data=student, ax=axes[0])
sns.pointplot(x='sex', y='G3', data=student, ax=axes[1]);
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data)

Let us try to answer to the question: Does alcohol consumption affect academic performance?

We can then draw the same graph as above by comparing the final grades between the five categories of drinker in the weekday. We will be using then Dalc variable

In [21]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='Dalc', y='G3', data=student, ax=axes[0])
sns.pointplot(x='Dalc', y='G3', data=student, ax=axes[1]);
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data)

Hypothesis testing

Some Statistical test from scipy.stats

  • kurtosistest(a[, axis, nan_policy]) Tests whether a dataset has normal kurtosis
  • normaltest(a[, axis, nan_policy]) Tests whether a sample differs from a normal distribution.
  • skewtest(a[, axis, nan_policy]) Tests whether the skew is different from the normal distribution.
  • pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
  • ttest_1samp(a, popmean[, axis, nan_policy]) Calculates the T-test for the mean of ONE group of scores.
  • ttest_1samp(a, popmean[, axis, nan_policy]) Calculates the T-test for the mean of ONE group of scores.
  • ttest_ind(a, b[, axis, equal_var, nan_policy]) Calculates the T-test for the means of two independent samples of scores.
  • ttest_ind_from_stats(mean1, std1, nobs1, ...) T-test for means of two independent samples from descriptive statistics.
  • ttest_rel(a, b[, axis, nan_policy]) Calculates the T-test on TWO RELATED samples of scores, a and b.
  • kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
  • chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
  • ansari(x, y) Perform the Ansari-Bradley test for equal scale parameters
  • bartlett(*args) Perform Bartlett’s test for equal variances
  • levene(*args, **kwds) Perform Levene test for equal variances.
  • shapiro(x[, a, reta]) Perform the Shapiro-Wilk test for normality.
  • anderson(x[, dist]) Anderson-Darling test for data coming from a particular distribution
  • anderson_ksamp(samples[, midrank]) The Anderson-Darling test for k-samples.

Normality test

Let's perform the Normality test whose Null Hypothesis is that the data is sampled from a Normal distribution. We will use a significance level of 5.0%

In [22]:
stats.kurtosistest(a=student['G3'])
Out[22]:
KurtosistestResult(statistic=6.754184473055633, pvalue=1.436408963570327e-11)

Correlation test

Let's perform the Correlation test whose Null Hypothesis is that the correlation coefficient is equal to zero. We will use a significance level of 5.0%

In [23]:
stats.pearsonr(student['G1'],student['G2'])
Out[23]:
(0.8649816303085822, 6.373794780789756e-196)

Test of Equal variances

Let's perform the Bartetts's test whose Null Hypothesis is that the variances are equal. We will use a significance level of 5.0%

In [24]:
student.groupby('sex')['G3'].var()
Out[24]:
sex
F     9.760297
M    11.026983
Name: G3, dtype: float64
In [25]:
grades_girls = student['G3'][student['sex']=='F']
grades_boys = student['G3'][student['sex']=='M']
stats.bartlett(grades_girls, grades_boys)
Out[25]:
BartlettResult(statistic=1.1708204423803006, pvalue=0.2792327258200368)

According to the test we cannot reject the Null hypothesis of equal variances, so we will use assume that the two groups are samples from a population with the same variances. This information will be useful in our next test.

Let's make again a graph showing the confidence intervals of the mean of the final grade between Boys and Girls

In [26]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='sex', y='G3', data=student, ax=axes[0])
sns.pointplot(x='sex', y='G3', data=student, ax=axes[1]);
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data)

Comparing means

The visualizations suggest there is a difference between the means of the final grade of the two groups. Now we will perform a formal statistical test to confirm the hypothesis that girls have higher final scores than the boys.

Null Hypothesis: for both groups (Boys and Girls) the population means of the final grade are equal.

Alternative Hypothesis: The population means of the final grades are different.

A common test to apply in for these cases is the two-sample t-test, which is used to determine if two population means are equal.

The assumptions of this test are:

  1. Independent samples: we will assume that the method for collecting the data assured that the answers given by the students are independent.
  2. Large enough sample size or observations come from a normally-distributed population: this assumption is required if we are working with small samples (less than 30), since in the smaller group we have 166 observations we can say that we have a "large enough" sample.
  3. Variances are equal

In addition this test have two versions: one assuming equal variances and the other assuming unequal variances.

In [27]:
stats.ttest_ind(grades_girls, grades_boys, equal_var=True)
Out[27]:
Ttest_indResult(statistic=3.310937693029702, pvalue=0.000981528706137396)

Since we got such a low p-value we can reject the Null hypothesis of equal means for the two groups at a level of significance of 5%.

There is then a statistical significant difference between the grades in the two analyzed groups. The results suggest that the girls have higher grades than boys.

Do male teenagers drink more than female teenagers?

Let's first create two categorical variables Dalc_c and Walc_c from the variables Dalc and Walc by considering three categories of drinker:

  • Low when the scores are 1 or 2
  • Med when the scores are equal to 3
  • High when the scores are 4 or 5
In [28]:
Dalc_cat = {
    1: 'Low',
    2: 'Low',
    3: 'Med',
    4: 'High',
    5: 'High'
}
In [29]:
student['Dalcc'] = student['Dalc'].map(Dalc_cat)
student['Dalcc'].head()
Out[29]:
0    Low
1    Low
2    Low
3    Low
4    Low
Name: Dalcc, dtype: object
In [30]:
student['Dalcc'].value_counts()
Out[30]:
Low     572
Med      43
High     34
Name: Dalcc, dtype: int64

Make it as an ordered variable

In [31]:
student['Dalcc'] = student['Dalcc'].astype(dtype='category', 
                               categories=['Low', 'Med', 'High'],
                               ordered=True)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until
In [32]:
student['Dalcc'].value_counts(sort=False)
Out[32]:
Low     572
Med      43
High     34
Name: Dalcc, dtype: int64
In [33]:
gender_Dalc_table = pd.crosstab(student['Dalcc'], student['sex'])
gender_Dalc_table
Out[33]:
sex F M
Dalcc
Low 363 209
Med 11 32
High 9 25
In [34]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
student['sex'].value_counts().plot(kind='bar', ax=axes[0], title='Gender')
student['Dalcc'].value_counts().plot(kind='bar', ax=axes[1], title='Alcohol Consumption Level');

And the plot of the cross-table

Given the alcohol consumption

In [35]:
100*(gender_Dalc_table.T/gender_Dalc_table.apply(sum, axis=1)).T
Out[35]:
sex F M
Dalcc
Low 63.461538 36.538462
Med 25.581395 74.418605
High 26.470588 73.529412

And Given the sex variable

In [36]:
100*(gender_Dalc_table/gender_Dalc_table.apply(sum, axis=0)).T
Out[36]:
Dalcc Low Med High
sex
F 94.778068 2.872063 2.349869
M 78.571429 12.030075 9.398496
In [37]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
gender_Dalc_table.plot(kind='bar', stacked=True, ax=axes[0]);
(100*(gender_Dalc_table.T/gender_Dalc_table.apply(sum, axis=1)).T).plot(kind='bar', stacked=True, ax=axes[1]);
In [38]:
fig, axes = plt.subplots(1,2, figsize = (14,4))
(gender_Dalc_table.T).plot(kind='bar', stacked=True, ax=axes[0]);
(100*(gender_Dalc_table/gender_Dalc_table.apply(sum, axis=0)).T).plot(kind='bar', stacked=True, ax=axes[1]);

Chi-square test of independence of variables in a contingency table.

This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table.

In [39]:
chi_stat, p_value, dof, expected = stats.chi2_contingency(gender_Dalc_table)
In [40]:
chi_stat
Out[40]:
39.435980582823575
In [41]:
 p_value
Out[41]:
2.732660141395978e-09
In [42]:
dof
Out[42]:
2

One-Way Anova

The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of independent groups.

The Tukey Test (or Tukey procedure), also called Tukey’s Honest Significant Difference test, is a post-hoc test based on the studentized range distribution.

An ANOVA test can tell you if your results are significant overall, but it will not tell you exactly where those differences lie. After you have run an ANOVA and found significant results, then you can run Tukey’s HSD to find out which specific groups’s means (compared with each other) are different. The test compares all possible pairs of means.

As an example we will compare the final grades between the group of Alcohol consumption on the whole sample and we will do the same on the sub-samples of Girls and Boys.

Let us first import the libraries

In [43]:
import pandas as pd
import scipy.stats as stats
import researchpy as rp
import statsmodels.api as sm
from statsmodels.formula.api import ols
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [44]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

We will now compute the confidence interval of the variable G3 and the confidence intervals by groups

In [45]:
rp.summary_cont(student['G3'])

Out[45]:
Variable N Mean SD SE 95% Conf. Interval
0 G3 649.0 11.906009 3.230656 0.126814 11.656993 12.155026
In [46]:
rp.summary_cont(student['G3'].groupby(student['Dalcc']))

Out[46]:
N Mean SD SE 95% Conf. Interval
Dalcc
Low 572 12.101399 3.172263 0.132639 11.841199 12.361599
Med 43 11.139535 2.252844 0.343555 10.458197 11.820873
High 34 9.588235 4.171459 0.715399 8.164966 11.011505

This is a first method to perform an ANOVA

In [47]:
stats.f_oneway(student['G3'][student['Dalcc'] == 'High'], 
             student['G3'][student['Dalcc'] == 'Med'],
             student['G3'][student['Dalcc'] == 'Low'])
Out[47]:
F_onewayResult(statistic=11.358219553765995, pvalue=1.4188530997129723e-05)

Second method

We need to create a new variable with type object

In [48]:
student['Dalcc'].head()
Out[48]:
0    Low
1    Low
2    Low
3    Low
4    Low
Name: Dalcc, dtype: category
Categories (3, object): [Low < Med < High]
In [49]:
student['Dalcc2'] = student['Dalcc'].astype(dtype='object')
In [50]:
student['Dalcc2'].head()
Out[50]:
0    Low
1    Low
2    Low
3    Low
4    Low
Name: Dalcc2, dtype: object
In [51]:
results = ols('G3 ~ C(Dalcc2)', data=student).fit()
results.summary()
Out[51]:
OLS Regression Results
Dep. Variable: G3 R-squared: 0.034
Model: OLS Adj. R-squared: 0.031
Method: Least Squares F-statistic: 11.36
Date: Sun, 04 Nov 2018 Prob (F-statistic): 1.42e-05
Time: 20:01:02 Log-Likelihood: -1670.2
No. Observations: 649 AIC: 3346.
Df Residuals: 646 BIC: 3360.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.5882 0.545 17.580 0.000 8.517 10.659
C(Dalcc2)[T.Low] 2.5132 0.561 4.477 0.000 1.411 3.616
C(Dalcc2)[T.Med] 1.5513 0.730 2.126 0.034 0.118 2.984
Omnibus: 109.962 Durbin-Watson: 1.608
Prob(Omnibus): 0.000 Jarque-Bera (JB): 269.472
Skew: -0.884 Prob(JB): 3.05e-59
Kurtosis: 5.615 Cond. No. 10.8

Let's us now compare the One-way ANOVA results between the girls and the boys sub-sample. To do so we will first consider these two sub-samples and perform two one-way ANOVA. We will then compare the coefficients and measure the difference of the impact of the consumption of Alcohol between Girls and Boys teenagers.

Girls sub-samples

In [52]:
student_girls=student[student['sex']=='F']
In [53]:
student_girls.head()
Out[53]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... goout Dalc Walc health absences G1 G2 G3 Dalcc Dalcc2
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 1 1 3 4 0 11 11 Low Low
1 GP F 17 U GT3 T 1 1 at_home other ... 3 1 1 3 2 9 11 11 Low Low
2 GP F 15 U LE3 T 1 1 at_home other ... 2 2 3 3 6 12 13 12 Low Low
3 GP F 15 U GT3 T 4 2 health services ... 2 1 1 5 0 14 14 14 Low Low
4 GP F 16 U GT3 T 3 3 other other ... 2 1 2 5 0 11 13 13 Low Low

5 rows × 35 columns

Let's then perform the One way-ANOVA

In [54]:
results_girls = ols('G3 ~ C(Dalcc2)', data=student_girls).fit()
results_girls.summary()
Out[54]:
OLS Regression Results
Dep. Variable: G3 R-squared: 0.008
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 1.496
Date: Sun, 04 Nov 2018 Prob (F-statistic): 0.225
Time: 20:01:07 Log-Likelihood: -977.75
No. Observations: 383 AIC: 1961.
Df Residuals: 380 BIC: 1973.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 10.5556 1.040 10.149 0.000 8.511 12.600
C(Dalcc2)[T.Low] 1.7530 1.053 1.665 0.097 -0.317 3.823
C(Dalcc2)[T.Med] 1.2626 1.402 0.900 0.369 -1.495 4.020
Omnibus: 62.566 Durbin-Watson: 1.544
Prob(Omnibus): 0.000 Jarque-Bera (JB): 144.827
Skew: -0.829 Prob(JB): 3.56e-32
Kurtosis: 5.516 Cond. No. 16.5

Boys sub-samples

In [55]:
student_boys=student[student['sex']=='M']
In [56]:
student_boys.head()
Out[56]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... goout Dalc Walc health absences G1 G2 G3 Dalcc Dalcc2
5 GP M 16 U LE3 T 4 3 services other ... 2 1 2 5 6 12 12 13 Low Low
6 GP M 16 U LE3 T 2 2 other other ... 4 1 1 3 0 13 12 13 Low Low
8 GP M 15 U LE3 A 3 2 services other ... 2 1 1 1 0 15 16 17 Low Low
9 GP M 15 U GT3 T 3 4 other other ... 1 1 1 5 0 12 12 13 Low Low
12 GP M 15 U LE3 T 4 4 health services ... 3 1 3 5 0 12 13 12 Low Low

5 rows × 35 columns

Let's then perform the One way-ANOVA

In [57]:
results_boys = ols('G3 ~ C(Dalcc2)', data=student_boys).fit()
results_boys.summary()
Out[57]:
OLS Regression Results
Dep. Variable: G3 R-squared: 0.051
Model: OLS Adj. R-squared: 0.044
Method: Least Squares F-statistic: 7.057
Date: Sun, 04 Nov 2018 Prob (F-statistic): 0.00103
Time: 20:01:10 Log-Likelihood: -689.23
No. Observations: 266 AIC: 1384.
Df Residuals: 263 BIC: 1395.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.2400 0.649 14.227 0.000 7.961 10.519
C(Dalcc2)[T.Low] 2.5016 0.687 3.640 0.000 1.148 3.855
C(Dalcc2)[T.Med] 1.6662 0.867 1.922 0.056 -0.040 3.373
Omnibus: 54.017 Durbin-Watson: 1.640
Prob(Omnibus): 0.000 Jarque-Bera (JB): 124.841
Skew: -0.967 Prob(JB): 7.78e-28
Kurtosis: 5.743 Cond. No. 7.75

Comparison of the coefficients

I will draw in a same figure a plot with the confidence intervals of the coefficients of the Boys and Girls models. We will first create a dataframe containing the estimation of the models.

In [58]:
x=['Boys','Girls']
coef_df = pd.DataFrame()
for i, mod in enumerate([results_boys, results_girls]):
    err_series = mod.params - mod.conf_int()[0]
    coef_df = coef_df.append(pd.DataFrame({'coef': mod.params.values[1:],
                                           'err': err_series.values[1:],
                                           'varname': err_series.index.values[1:],
                                           'model':  x[i]
                                          })
                            )
coef_df
Out[58]:
coef err varname model
0 2.501627 1.353127 C(Dalcc2)[T.Low] Boys
1 1.666250 1.706736 C(Dalcc2)[T.Med] Boys
0 1.752984 2.070134 C(Dalcc2)[T.Low] Girls
1 1.262626 2.757395 C(Dalcc2)[T.Med] Girls
In [59]:
## marker to use
marker_list = 'so'
width=0.05
## 5 covariates in total
base_x = pd.np.arange(2) - 0.2
base_x
Out[59]:
array([-0.2,  0.8])
In [60]:
from matplotlib import pyplot as plt
from matplotlib.lines import Line2D
%matplotlib inline

fig, ax = plt.subplots(figsize=(8, 5))
for i, mod in enumerate(coef_df.model.unique()):
    mod_df = coef_df[coef_df.model == mod]
    mod_df = mod_df.set_index('varname').reindex(coef_df['varname'].unique())
    ## offset x posistions
    X = base_x + width*i
    ax.bar(X, mod_df['coef'],  
           color='none',yerr=mod_df['err'])
    ## remove axis labels
    ax.set_ylabel('')
    ax.set_xlabel('')
    ax.scatter(x=X, 
               marker=marker_list[i], s=120, 
               y=mod_df['coef'], color='black')
    ax.axhline(y=0, linestyle='--', color='black', linewidth=4)
    ax.xaxis.set_ticks_position('none')
    _ = ax.set_xticklabels(['','','Low','','','','Med'], 
                           rotation=0, fontsize=16)


       
## finally, build customized legend
legend_elements = [Line2D([0], [0], marker=m,
                          label=x[i],
                          color = 'k',
                          markersize=10)
                   for i, m in enumerate(marker_list)
                  ]
_ = ax.legend(handles=legend_elements, loc=2, 
              prop={'size': 15}, labelspacing=1.2)

Tukey’s HSD Post-hoc comparison

The overall model was significant, now to test which groups differ. Deciding which groups to compare should be theory driven. There are a few different techniques that can be used. Each of these techniques have different ways of controlling for familywise error rate. We will show here how to perform one them:

In [61]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

mc_girls = MultiComparison(student_girls['G3'], student_girls['Dalcc2'])
mc_results_girls = mc_girls.tukeyhsd()
print(mc_results_girls)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 High   Low    1.753   -0.7244 4.2304 False 
 High   Med    1.2626  -2.0372 4.5625 False 
 Low    Med   -0.4904  -2.7373 1.7565 False 
--------------------------------------------
In [62]:
mc_boys = MultiComparison(student_boys['G3'], student_boys['Dalcc2'])
mc_results_boys = mc_boys.tukeyhsd()
print(mc_results_girls)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 High   Low    1.753   -0.7244 4.2304 False 
 High   Med    1.2626  -2.0372 4.5625 False 
 Low    Med   -0.4904  -2.7373 1.7565 False 
--------------------------------------------