SciPy
), which is the scientific toolbox for Python. In this lesson we will show how to use SciPy
Library to perform
SciPy
Let us first import the libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import sqrt, arange
from scipy import stats
%matplotlib inline
In this lesson we use a dataset containing information about Portuguese students from two public schools. This is a real world dataset that was collected in order to study alcohol consumption in young people and its effects on students' academic performance. The dataset was built from two sources: school reports and questionnaires.
Attributes:
these grades are related with the course subject:
student = pd.read_csv("/Users/dhafermalouche/Documents/Teaching_2018_2019/Teaching_python/student.csv", sep=";")
A preview of the data
student.head()
G3
is the variable that gives the final grade (numeric: from 0 to 20, output target). We will show how to compute the confidence interval of the mean.
The sample size
sample_size = student.shape[0]
sample_size
Because we have a sample size that is much greater than 30, we can use the Central Limit Theorem to calculate confidence intervals. According to this theorem we can calculate a confidence interval for the mean using the normal distribution.
To get the confidence interval for the mean we need three numbers:
Formula for the standard error:
$$ SE = \frac{s}{\sqrt n} $$
sample_mean_grade = student['G3'].mean()
sample_mean_grade
std_error_grades = student['G3'].std()/sqrt(sample_size)
std_error_grades
stats.norm.interval(0.95, loc=sample_mean_grade, scale=std_error_grades)
Let us compare the confidence intervals of G3
between girls and boys
Girls data
SF = student.loc[(student.sex == 'F')]
SF.head()
Boys data
SM = student.loc[(student.sex == 'M')]
SM.head()
sample_mean_grade_F =SF['G3'].mean()
sample_mean_grade_F
sample_mean_grade_M =SM['G3'].mean()
sample_mean_grade_M
sample_size_F = SF.shape[0]
sample_size_F
sample_size_M = SM.shape[0]
sample_size_M
std_error_grades_F = SF['G3'].std()/sqrt(sample_size_F)
std_error_grades_F
std_error_grades_M = SM['G3'].std()/sqrt(sample_size_M)
std_error_grades_M
Girls Confidence Interval
stats.norm.interval(0.95, loc=sample_mean_grade_F, scale=std_error_grades_F)
Boys Confidence Interval
stats.norm.interval(0.95, loc=sample_mean_grade_M, scale=std_error_grades_M)
Let us show this graphicaly then
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='sex', y='G3', data=student, ax=axes[0])
sns.pointplot(x='sex', y='G3', data=student, ax=axes[1]);
Let us try to answer to the question: Does alcohol consumption affect academic performance?
We can then draw the same graph as above by comparing the final grades between the five categories of drinker in the weekday. We will be using then Dalc
variable
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='Dalc', y='G3', data=student, ax=axes[0])
sns.pointplot(x='Dalc', y='G3', data=student, ax=axes[1]);
Some Statistical test from scipy.stats
Let's perform the Normality test whose Null Hypothesis is that the data is sampled from a Normal distribution. We will use a significance level of 5.0%
stats.kurtosistest(a=student['G3'])
Let's perform the Correlation test whose Null Hypothesis is that the correlation coefficient is equal to zero. We will use a significance level of 5.0%
stats.pearsonr(student['G1'],student['G2'])
Let's perform the Bartetts's test whose Null Hypothesis is that the variances are equal. We will use a significance level of 5.0%
student.groupby('sex')['G3'].var()
grades_girls = student['G3'][student['sex']=='F']
grades_boys = student['G3'][student['sex']=='M']
stats.bartlett(grades_girls, grades_boys)
According to the test we cannot reject the Null hypothesis of equal variances, so we will use assume that the two groups are samples from a population with the same variances. This information will be useful in our next test.
Let's make again a graph showing the confidence intervals of the mean of the final grade between Boys and Girls
fig, axes = plt.subplots(1,2, figsize = (14,4))
sns.boxplot(x='sex', y='G3', data=student, ax=axes[0])
sns.pointplot(x='sex', y='G3', data=student, ax=axes[1]);
The visualizations suggest there is a difference between the means of the final grade of the two groups. Now we will perform a formal statistical test to confirm the hypothesis that girls have higher final scores than the boys.
Null Hypothesis: for both groups (Boys and Girls) the population means of the final grade are equal.
Alternative Hypothesis: The population means of the final grades are different.
A common test to apply in for these cases is the two-sample t-test, which is used to determine if two population means are equal.
The assumptions of this test are:
In addition this test have two versions: one assuming equal variances and the other assuming unequal variances.
stats.ttest_ind(grades_girls, grades_boys, equal_var=True)
Since we got such a low p-value we can reject the Null hypothesis of equal means for the two groups at a level of significance of 5%.
There is then a statistical significant difference between the grades in the two analyzed groups. The results suggest that the girls have higher grades than boys.
Do male teenagers drink more than female teenagers?
Let's first create two categorical variables Dalc_c
and Walc_c
from the variables Dalc
and Walc
by considering three categories of drinker:
Dalc_cat = {
1: 'Low',
2: 'Low',
3: 'Med',
4: 'High',
5: 'High'
}
student['Dalcc'] = student['Dalc'].map(Dalc_cat)
student['Dalcc'].head()
student['Dalcc'].value_counts()
Make it as an ordered variable
student['Dalcc'] = student['Dalcc'].astype(dtype='category',
categories=['Low', 'Med', 'High'],
ordered=True)
student['Dalcc'].value_counts(sort=False)
gender_Dalc_table = pd.crosstab(student['Dalcc'], student['sex'])
gender_Dalc_table
fig, axes = plt.subplots(1,2, figsize = (14,4))
student['sex'].value_counts().plot(kind='bar', ax=axes[0], title='Gender')
student['Dalcc'].value_counts().plot(kind='bar', ax=axes[1], title='Alcohol Consumption Level');
And the plot of the cross-table
Given the alcohol consumption
100*(gender_Dalc_table.T/gender_Dalc_table.apply(sum, axis=1)).T
And Given the sex variable
100*(gender_Dalc_table/gender_Dalc_table.apply(sum, axis=0)).T
fig, axes = plt.subplots(1,2, figsize = (14,4))
gender_Dalc_table.plot(kind='bar', stacked=True, ax=axes[0]);
(100*(gender_Dalc_table.T/gender_Dalc_table.apply(sum, axis=1)).T).plot(kind='bar', stacked=True, ax=axes[1]);
fig, axes = plt.subplots(1,2, figsize = (14,4))
(gender_Dalc_table.T).plot(kind='bar', stacked=True, ax=axes[0]);
(100*(gender_Dalc_table/gender_Dalc_table.apply(sum, axis=0)).T).plot(kind='bar', stacked=True, ax=axes[1]);
Chi-square test of independence of variables in a contingency table.
This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table.
chi_stat, p_value, dof, expected = stats.chi2_contingency(gender_Dalc_table)
chi_stat
p_value
dof
The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of independent groups.
The Tukey Test (or Tukey procedure), also called Tukey’s Honest Significant Difference test, is a post-hoc test based on the studentized range distribution.
An ANOVA test can tell you if your results are significant overall, but it will not tell you exactly where those differences lie. After you have run an ANOVA and found significant results, then you can run Tukey’s HSD to find out which specific groups’s means (compared with each other) are different. The test compares all possible pairs of means.
As an example we will compare the final grades between the group of Alcohol consumption on the whole sample and we will do the same on the sub-samples of Girls and Boys.
Let us first import the libraries
import pandas as pd
import scipy.stats as stats
import researchpy as rp
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
We will now compute the confidence interval of the variable G3
and the confidence intervals by groups
rp.summary_cont(student['G3'])
rp.summary_cont(student['G3'].groupby(student['Dalcc']))
This is a first method to perform an ANOVA
stats.f_oneway(student['G3'][student['Dalcc'] == 'High'],
student['G3'][student['Dalcc'] == 'Med'],
student['G3'][student['Dalcc'] == 'Low'])
Second method
We need to create a new variable with type object
student['Dalcc'].head()
student['Dalcc2'] = student['Dalcc'].astype(dtype='object')
student['Dalcc2'].head()
results = ols('G3 ~ C(Dalcc2)', data=student).fit()
results.summary()
Let's us now compare the One-way ANOVA results between the girls and the boys sub-sample. To do so we will first consider these two sub-samples and perform two one-way ANOVA. We will then compare the coefficients and measure the difference of the impact of the consumption of Alcohol between Girls and Boys teenagers.
Girls sub-samples
student_girls=student[student['sex']=='F']
student_girls.head()
Let's then perform the One way-ANOVA
results_girls = ols('G3 ~ C(Dalcc2)', data=student_girls).fit()
results_girls.summary()
Boys sub-samples
student_boys=student[student['sex']=='M']
student_boys.head()
Let's then perform the One way-ANOVA
results_boys = ols('G3 ~ C(Dalcc2)', data=student_boys).fit()
results_boys.summary()
Comparison of the coefficients
I will draw in a same figure a plot with the confidence intervals of the coefficients of the Boys
and Girls
models. We will first create a dataframe containing the estimation of the models.
x=['Boys','Girls']
coef_df = pd.DataFrame()
for i, mod in enumerate([results_boys, results_girls]):
err_series = mod.params - mod.conf_int()[0]
coef_df = coef_df.append(pd.DataFrame({'coef': mod.params.values[1:],
'err': err_series.values[1:],
'varname': err_series.index.values[1:],
'model': x[i]
})
)
coef_df
## marker to use
marker_list = 'so'
width=0.05
## 5 covariates in total
base_x = pd.np.arange(2) - 0.2
base_x
from matplotlib import pyplot as plt
from matplotlib.lines import Line2D
%matplotlib inline
fig, ax = plt.subplots(figsize=(8, 5))
for i, mod in enumerate(coef_df.model.unique()):
mod_df = coef_df[coef_df.model == mod]
mod_df = mod_df.set_index('varname').reindex(coef_df['varname'].unique())
## offset x posistions
X = base_x + width*i
ax.bar(X, mod_df['coef'],
color='none',yerr=mod_df['err'])
## remove axis labels
ax.set_ylabel('')
ax.set_xlabel('')
ax.scatter(x=X,
marker=marker_list[i], s=120,
y=mod_df['coef'], color='black')
ax.axhline(y=0, linestyle='--', color='black', linewidth=4)
ax.xaxis.set_ticks_position('none')
_ = ax.set_xticklabels(['','','Low','','','','Med'],
rotation=0, fontsize=16)
## finally, build customized legend
legend_elements = [Line2D([0], [0], marker=m,
label=x[i],
color = 'k',
markersize=10)
for i, m in enumerate(marker_list)
]
_ = ax.legend(handles=legend_elements, loc=2,
prop={'size': 15}, labelspacing=1.2)
The overall model was significant, now to test which groups differ. Deciding which groups to compare should be theory driven. There are a few different techniques that can be used. Each of these techniques have different ways of controlling for familywise error rate. We will show here how to perform one them:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
mc_girls = MultiComparison(student_girls['G3'], student_girls['Dalcc2'])
mc_results_girls = mc_girls.tukeyhsd()
print(mc_results_girls)
mc_boys = MultiComparison(student_boys['G3'], student_boys['Dalcc2'])
mc_results_boys = mc_boys.tukeyhsd()
print(mc_results_girls)