pandas is a data analysis library providing fast, flexible, and expressive data structures designed to work with relational or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python. pandas is well suited for: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily fixed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
import pandas as pd
The Series data structure in Pandas is a one-dimensional labeled array.
Creating a Panda Serie:
From a list
temperature = [34, 56, 15, -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
# create series
series_from_list = pd.Series(temperature, index=days)
series_from_list
The series should contains homogeneous types
temperature = [34, 56, 'a', -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
We create series
series_from_list = pd.Series(temperature, index=days)
series_from_list
from a dictionary
my_dict = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
my_dict
series_from_dict = pd.Series(my_dict)
series_from_dict
From a numpy array
import numpy as np
my_array = np.linspace(0,10,15)
my_array
series_from_ndarray = pd.Series(my_array)
series_from_ndarray
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. You can create a DataFrame from:
Reading the data.
Sample data: HR Employee Attrition and Performance You can get it from here and add it to your working directory:
https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/
Importing the xlsx file by considering the variable EmployeeNumber as an Index variable
data = pd.read_excel(io="WA_Fn-UseC_-HR-Employee-Attrition.xlsx", sheetname=0, index_col='EmployeeNumber')
data.head()
data.columns
data['Attrition'].head()
data[['Age', 'Gender','YearsAtCompany']].head()
data['AgeInMonths'] = 12*data['Age']
data['AgeInMonths'].head()
del data['AgeInMonths']
data.columns
data['BusinessTravel'][10:15]
data[10:15]
selected_EmployeeNumbers = [15, 94, 337, 1120]
data['YearsAtCompany'].loc[selected_EmployeeNumbers]
data.loc[selected_EmployeeNumbers]
data.loc[94,'YearsAtCompany']
data['Department'].value_counts()
data['Department'].value_counts().plot(kind='barh', title='Department')
data['Department'].value_counts().plot(kind='pie', title='Department')
data['Attrition'].value_counts()
data['Attrition'].value_counts(normalize=True)
data['HourlyRate'].mean()
What's the overall statisfaction of the Employees?
data['JobSatisfaction'].head()
Let us change the levels of the variable satisfaction
JobSatisfaction_cat = {
1: 'Low',
2: 'Medium',
3: 'High',
4: 'Very High'
}
data['JobSatisfaction'] = data['JobSatisfaction'].map(JobSatisfaction_cat)
data['JobSatisfaction'].head()
data['JobSatisfaction'].value_counts()
100*data['JobSatisfaction'].value_counts(normalize=True)
data['JobSatisfaction'].value_counts(normalize=True).plot(kind='pie', title='Department')
data['JobSatisfaction'] = data['JobSatisfaction'].astype(dtype='category',
categories=['Low', 'Medium', 'High', 'Very High'],
ordered=True)
data['JobSatisfaction'].head()
data['JobSatisfaction'].value_counts().plot(kind='barh', title='Department')
data['JobSatisfaction'].value_counts(sort=False).plot(kind='barh', title='Department')
data['JobSatisfaction'] == 'Low'
data.loc[data['JobSatisfaction'] == 'Low'].index
data['JobInvolvement'].head()
subset_of_interest = data.loc[(data['JobSatisfaction'] == "Low") | (data['JobSatisfaction'] == "Very High")]
subset_of_interest.shape
subset_of_interest['JobSatisfaction'].value_counts()
Let's then remove the categories or levels that we won't use
subset_of_interest['JobSatisfaction'].cat.remove_unused_categories(inplace=True)
grouped = subset_of_interest.groupby('JobSatisfaction')
grouped.groups
The Low statisfaction group
grouped.get_group('Low').head()
and the Very High satisfaction group
grouped.get_group('Very High').head()
The average of the Age of each group
grouped['Age']
grouped['Age'].mean()
grouped['Age'].describe()
grouped['Age'].describe().unstack()
Comparing densities
grouped['Age'].plot(kind='density', title='Age')
By Department
grouped['Department'].value_counts().unstack()
We can normalize it
grouped['Department'].value_counts(normalize=True).unstack()
grouped['Department'].value_counts().unstack().plot(kind="barh")
grouped['Department'].value_counts(normalize=True).unstack().plot(kind="barh")
We can compare it with the whole sample
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
But the colors and the order don't match with the other bar chart. We need to reorder the Department variable
data['Department'] = data['Department'].astype(dtype='category',
categories=['Human Resources', 'Research & Development', 'Sales'],
ordered=True)
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
grouped['DistanceFromHome'].describe().unstack()
grouped['DistanceFromHome'].plot(kind='density', title='Distance From Home',legend=True)
grouped['HourlyRate'].describe()
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)
grouped['MonthlyIncome'].describe()
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)