From the Pandas documentation:
pandas is a data analysis library providing fast, flexible, and expressive data structures designed to work with relational or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python.
pandas is well suited for:
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast [HDF5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
import pandas as pd
The Series data structure in Pandas is a one-dimensional labeled array.
temperature = [34, 56, 15, -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
# create series
series_from_list = pd.Series(temperature, index=days)
series_from_list
The series should contains homogeneous types
temperature = [34, 56, 'a', -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
# create series
series_from_list = pd.Series(temperature, index=days)
series_from_list
my_dict = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
series_from_dict = pd.Series(my_dict)
series_from_dict
import numpy as np
help(np.linspace)
my_array = np.linspace(0,10,15)
my_array
series_from_ndarray = pd.Series(my_array)
series_from_ndarray
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
You can create a DataFrame from:
Reading the data
Sample data: HR Employee Attrition and Performance You can get it from here:
https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/
file_xlsx = "/Users/dhafermalouche/Documents/Teaching_2018_2019/Teaching_python/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
Importing the xlsx
file by considering the variable EmployeeNumber
as an Index variable
data = pd.read_excel(io=file_xlsx, sheetname=0, index_col='EmployeeNumber')
data.head()
Column names
data.columns
Preview on one variable
data['Attrition'].head()
Extracting several columns
data[['Age', 'Gender','YearsAtCompany']].head()
Adding a column
data['AgeInMonths'] = 12*data['Age']
data['AgeInMonths'].head()
Deleting a column
del data['AgeInMonths']
data.columns
Extract observations from a variable
data['BusinessTravel'][10:15]
Extract the rows 10 to 15
data[10:15]
Select by labels
selected_EmployeeNumbers = [15, 94, 337, 1120]
data['YearsAtCompany'].loc[selected_EmployeeNumbers]
data.loc[selected_EmployeeNumbers]
Extracting a single value
data.loc[94,'YearsAtCompany']
Let us answer to some questions from the data
How many employees are there by department in the dataset?
data['Department'].value_counts()
data['Department'].value_counts().plot(kind='barh', title='Department')
or a pie chart
data['Department'].value_counts().plot(kind='pie', title='Department')
What is the overall attrition rate?
data['Attrition'].value_counts()
We can compute the proportions
data['Attrition'].value_counts(normalize=True)
What is the average hourly rate?
data['HourlyRate'].mean()
What's the overall statisfaction of the Employees
data['JobSatisfaction'].head()
Let us change the levels of the variable
JobSatisfaction_cat = {
1: 'Low',
2: 'Medium',
3: 'High',
4: 'Very High'
}
data['JobSatisfaction'] = data['JobSatisfaction'].map(JobSatisfaction_cat)
data['JobSatisfaction'].head()
data['JobSatisfaction'].value_counts()
100*data['JobSatisfaction'].value_counts(normalize=True)
data['JobSatisfaction'].value_counts(normalize=True).plot(kind='pie', title='Department')
Transfor the JobStatisfaction
variable into an ordered variable
data['JobSatisfaction'] = data['JobSatisfaction'].astype(dtype='category',
categories=['Low', 'Medium', 'High', 'Very High'],
ordered=True)
data['JobSatisfaction'].head()
data['JobSatisfaction'].value_counts().plot(kind='barh', title='Department')
data['JobSatisfaction'].value_counts(sort=False).plot(kind='barh', title='Department')
Here the list of the employees with Low level of JobSatisfaction?
data['JobSatisfaction'] == 'Low'
We can get their index
data.loc[data['JobSatisfaction'] == 'Low'].index
Give us the list of employees with low statistfaction and low job involement
data['JobInvolvement'].head()
data.loc[(data['JobSatisfaction'] == 'Low') & (data['JobInvolvement'] == 'Low')].index.values
We will compare the employees with Low and Very High JobSatisfaction across the following variables: Age, Department, DistanceFromHome, HourlyRate, MonthlyIncome and YearsAtCompany.
We will create a new DataFrame containing only the employees with "Low" and "Very High" JobSatisfaction levels.
subset_of_interest = data.loc[(data['JobSatisfaction'] == "Low") | (data['JobSatisfaction'] == "Very High")]
subset_of_interest.shape
subset_of_interest['JobSatisfaction'].value_counts()
Let's then remove the categories or levels that we won't use
subset_of_interest['JobSatisfaction'].cat.remove_unused_categories(inplace=True)
subset_of_interest['JobSatisfaction'].value_counts()
Let's split our new DataFrame into groups.
grouped = subset_of_interest.groupby('JobSatisfaction')
grouped.groups
The Low
statisfaction group
grouped.get_group('Low').head()
and the Very High
satisfaction group
grouped.get_group('Very High').head()
The average of the Age of each group
grouped['Age']
grouped['Age'].mean()
grouped['Age'].describe()
grouped['Age'].describe().unstack()
Let's compare the densities
grouped['Age'].plot(kind='density', title='Age')
By Department
grouped['Department'].value_counts().unstack()
We can normalize it
grouped['Department'].value_counts(normalize=True).unstack()
grouped['Department'].value_counts().unstack().plot(kind="barh")
grouped['Department'].value_counts(normalize=True).unstack().plot(kind="barh")
We can compare it with the whole sample
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
But the colors and the order don't match with the other bar chart. We need to reorder the Department
variable
data['Department'] = data['Department'].astype(dtype='category',
categories=['Human Resources', 'Research & Development', 'Sales'],
ordered=True)
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
DistanceFromHome
grouped['DistanceFromHome'].describe().unstack()
grouped['DistanceFromHome'].plot(kind='density', title='Distance From Home',legend=True)
Hourly Rate
grouped['HourlyRate'].describe()
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)
Monthly Income
grouped['MonthlyIncome'].describe()
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)
Project 1
Can you write the same but with R
?
Project 2
Can you do the same with other dataset collected from Surveys like the Afrobarometer or the World Value Survey?