The Pandas Library¶

https://dhafermalouche.net

pandas is a data analysis library providing fast, flexible, and expressive data structures designed to work with relational or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python. pandas is well suited for: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily fixed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes (possible to have multiple labels per tick)
Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Series and DataFrames¶

import pandas as pd

The Panda Series¶

The Series data structure in Pandas is a one-dimensional labeled array.

Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.).
Data within the array is homogeneous
Pandas Series objects always have an index: this gives them both ndarray-like and dict-like properties.

Creating a Panda Serie:

Creation from a list
Creation from a dictionary
Creation from a ndarray
From an external source like a file

From a list

temperature = [34, 56, 15, -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
series_from_list = pd.Series(temperature, index=days)
series_from_list

Mon     34
Tue     56
Wed     15
Thu     -9
Fri   -121
Sat     -5
Sun     39
dtype: int64

The series should contains homogeneous types

temperature = [34, 56, 'a', -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

We create series

series_from_list = pd.Series(temperature, index=days)
series_from_list

Mon      34
Tue      56
Wed       a
Thu      -9
Fri    -121
Sat      -5
Sun      39
dtype: object

from a dictionary

my_dict = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
my_dict

{'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}

series_from_dict = pd.Series(my_dict)
series_from_dict

Mon    33
Tue    19
Wed    15
Thu    89
Fri    11
Sat    -5
Sun     9
dtype: int64

From a numpy array

import numpy as np

my_array = np.linspace(0,10,15)
my_array

array([ 0.        ,  0.71428571,  1.42857143,  2.14285714,  2.85714286,
        3.57142857,  4.28571429,  5.        ,  5.71428571,  6.42857143,
        7.14285714,  7.85714286,  8.57142857,  9.28571429, 10.        ])

series_from_ndarray = pd.Series(my_array)
series_from_ndarray

0      0.000000
1      0.714286
2      1.428571
3      2.142857
4      2.857143
5      3.571429
6      4.285714
7      5.000000
8      5.714286
9      6.428571
10     7.142857
11     7.857143
12     8.571429
13     9.285714
14    10.000000
dtype: float64

Pandas DataFrames¶

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. You can create a DataFrame from:

Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
From text, CSV, Excel files or databases
Many other ways

Reading the data.

Sample data: HR Employee Attrition and Performance You can get it from here and add it to your working directory:

https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

Importing the xlsx file by considering the variable EmployeeNumber as an Index variable

data = pd.read_excel(io="WA_Fn-UseC_-HR-Employee-Attrition.xlsx", sheetname=0, index_col='EmployeeNumber')

/anaconda3/lib/python3.7/site-packages/pandas/io/excel.py:329: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  **kwds)

data.head()

data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

data['Attrition'].head()

EmployeeNumber
1    Yes
2     No
4    Yes
5     No
7     No
Name: Attrition, dtype: object

Data Manipulation¶

data[['Age', 'Gender','YearsAtCompany']].head()

data['AgeInMonths'] = 12*data['Age']
data['AgeInMonths'].head()

EmployeeNumber
1    492
2    588
4    444
5    396
7    324
Name: AgeInMonths, dtype: int64

del data['AgeInMonths']

data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

data['BusinessTravel'][10:15]

EmployeeNumber
14    Travel_Rarely
15    Travel_Rarely
16    Travel_Rarely
18    Travel_Rarely
19    Travel_Rarely
Name: BusinessTravel, dtype: object

data[10:15]

selected_EmployeeNumbers = [15, 94, 337, 1120]

data['YearsAtCompany'].loc[selected_EmployeeNumbers]

EmployeeNumber
15      9
94      5
337     2
1120    7
Name: YearsAtCompany, dtype: int64

data.loc[selected_EmployeeNumbers]

data.loc[94,'YearsAtCompany']

5

data['Department'].value_counts()

Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

data['Department'].value_counts().plot(kind='barh', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x11e83cc18>

data['Department'].value_counts().plot(kind='pie', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x1210450f0>

data['Attrition'].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

data['Attrition'].value_counts(normalize=True)

No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

data['HourlyRate'].mean()

65.89115646258503

What's the overall statisfaction of the Employees?

data['JobSatisfaction'].head()

EmployeeNumber
1    4
2    2
4    3
5    3
7    2
Name: JobSatisfaction, dtype: int64

Let us change the levels of the variable satisfaction

JobSatisfaction_cat = {
    1: 'Low',
    2: 'Medium',
    3: 'High',
    4: 'Very High'
}

data['JobSatisfaction'] = data['JobSatisfaction'].map(JobSatisfaction_cat)
data['JobSatisfaction'].head()

EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: object

data['JobSatisfaction'].value_counts()

Very High    459
High         442
Low          289
Medium       280
Name: JobSatisfaction, dtype: int64

100*data['JobSatisfaction'].value_counts(normalize=True)

Very High    31.224490
High         30.068027
Low          19.659864
Medium       19.047619
Name: JobSatisfaction, dtype: float64

data['JobSatisfaction'].value_counts(normalize=True).plot(kind='pie', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x1211472b0>

data['JobSatisfaction'] = data['JobSatisfaction'].astype(dtype='category', 
                               categories=['Low', 'Medium', 'High', 'Very High'],
                               ordered=True)

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until

data['JobSatisfaction'].head()

EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: category
Categories (4, object): [Low < Medium < High < Very High]

data['JobSatisfaction'].value_counts().plot(kind='barh', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x1211889b0>

data['JobSatisfaction'].value_counts(sort=False).plot(kind='barh', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x121235358>

data['JobSatisfaction'] == 'Low'

EmployeeNumber
1       False
2       False
4       False
5       False
7       False
8       False
10       True
11      False
12      False
13      False
14      False
15      False
16      False
18      False
19      False
20       True
21      False
22      False
23      False
24      False
26      False
27       True
28      False
30      False
31       True
32      False
33       True
35      False
36      False
38       True
        ...  
2025    False
2026    False
2027    False
2031    False
2032    False
2034    False
2035    False
2036    False
2037    False
2038     True
2040    False
2041    False
2044    False
2045    False
2046    False
2048    False
2049    False
2051    False
2052    False
2053    False
2054     True
2055     True
2056    False
2057     True
2060    False
2061    False
2062     True
2064    False
2065    False
2068    False
Name: JobSatisfaction, Length: 1470, dtype: bool

data.loc[data['JobSatisfaction'] == 'Low'].index

Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
            ...
            1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
           dtype='int64', name='EmployeeNumber', length=289)

data['JobInvolvement'].head()

EmployeeNumber
1    3
2    2
4    2
5    3
7    3
Name: JobInvolvement, dtype: int64

subset_of_interest = data.loc[(data['JobSatisfaction'] == "Low") | (data['JobSatisfaction'] == "Very High")]
subset_of_interest.shape

(748, 34)

subset_of_interest['JobSatisfaction'].value_counts()

Very High    459
Low          289
High           0
Medium         0
Name: JobSatisfaction, dtype: int64

Let's then remove the categories or levels that we won't use

subset_of_interest['JobSatisfaction'].cat.remove_unused_categories(inplace=True)

grouped = subset_of_interest.groupby('JobSatisfaction')

grouped.groups

{'Low': Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
             ...
             1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
            dtype='int64', name='EmployeeNumber', length=289),
 'Very High': Int64Index([   1,    8,   18,   22,   23,   24,   30,   36,   39,   40,
             ...
             2022, 2024, 2027, 2036, 2040, 2041, 2045, 2052, 2056, 2061],
            dtype='int64', name='EmployeeNumber', length=459)}

The Low statisfaction group

grouped.get_group('Low').head()

and the Very High satisfaction group

grouped.get_group('Very High').head()

The average of the Age of each group

grouped['Age']

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x1212e59b0>

grouped['Age'].mean()

JobSatisfaction
Low          36.916955
Very High    36.795207
Name: Age, dtype: float64

grouped['Age'].describe()

grouped['Age'].describe().unstack()

       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                 36.916955
       Very High           36.795207
std    Low                  9.245496
       Very High            9.125609
min    Low                 19.000000
       Very High           18.000000
25%    Low                 30.000000
       Very High           30.000000
50%    Low                 36.000000
       Very High           35.000000
75%    Low                 42.000000
       Very High           43.000000
max    Low                 60.000000
       Very High           60.000000
dtype: float64

Comparing densities

grouped['Age'].plot(kind='density', title='Age')

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: Age, dtype: object

By Department

grouped['Department'].value_counts().unstack()

We can normalize it

grouped['Department'].value_counts(normalize=True).unstack()

grouped['Department'].value_counts().unstack().plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x1a22e99cf8>

grouped['Department'].value_counts(normalize=True).unstack().plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x1a22f84470>

We can compare it with the whole sample

data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x121026e10>

But the colors and the order don't match with the other bar chart. We need to reorder the Department variable

data['Department'] = data['Department'].astype(dtype='category', 
                               categories=['Human Resources', 'Research & Development', 'Sales'],
                               ordered=True)

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until

data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x121026828>

grouped['DistanceFromHome'].describe().unstack()

       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                  9.190311
       Very High            9.030501
std    Low                  8.045127
       Very High            8.257004
min    Low                  1.000000
       Very High            1.000000
25%    Low                  2.000000
       Very High            2.000000
50%    Low                  7.000000
       Very High            7.000000
75%    Low                 14.000000
       Very High           14.000000
max    Low                 29.000000
       Very High           29.000000
dtype: float64

grouped['DistanceFromHome'].plot(kind='density', title='Distance From Home',legend=True)

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: DistanceFromHome, dtype: object

grouped['HourlyRate'].describe()

grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object

grouped['MonthlyIncome'].describe()

grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object

	count	mean	std	min	25%	50%	75%	max
JobSatisfaction
Low	289.0	6561.570934	4645.170134	1091.0	3072.0	4968.0	8564.0	19943.0
Very High	459.0	6472.732026	4573.906428	1051.0	2927.5	5126.0	7908.0	19845.0

The Pandas Library¶

Table of Contents

Description from the Pandas documentation:¶

Series and DataFrames¶

The Panda Series¶

Pandas DataFrames¶

Data Manipulation¶

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EnvironmentSatisfaction	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
EmployeeNumber
1	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	2	...	1	80	0	8	0	1	6	4	0	5
2	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	3	...	4	80	1	10	3	3	10	7	1	7
4	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
5	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	4	...	3	80	0	8	3	3	8	7	3	0
7	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	1	...	4	80	1	6	3	3	2	2	2	2

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EnvironmentSatisfaction	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
EmployeeNumber
10	59	No	Travel_Rarely	1324	Research & Development	3	3	Medical	1	3	...	1	80	3	12	3	2	1	0	0	0
20	29	No	Travel_Rarely	1389	Research & Development	21	4	Life Sciences	1	2	...	3	80	1	10	1	3	10	9	8	8
27	36	Yes	Travel_Rarely	1218	Sales	9	4	Life Sciences	1	3	...	2	80	0	10	4	3	5	3	0	3
31	34	Yes	Travel_Rarely	699	Research & Development	6	1	Medical	1	2	...	3	80	0	8	2	3	4	2	1	3
33	32	Yes	Travel_Frequently	1125	Research & Development	16	1	Life Sciences	1	2	...	2	80	0	10	5	3	10	2	6	7

	count	mean	std	min	25%	50%	75%	max
JobSatisfaction
Low	289.0	36.916955	9.245496	19.0	30.0	36.0	42.0	60.0
Very High	459.0	36.795207	9.125609	18.0	30.0	35.0	43.0	60.0

Department	Human Resources	Research & Development	Sales
JobSatisfaction
Low	0.038062	0.664360	0.297578
Very High	0.037037	0.642702	0.320261

	count	mean	std	min	25%	50%	75%	max
JobSatisfaction
Low	289.0	68.636678	20.439515	30.0	52.0	72.0	86.0	100.0
Very High	459.0	64.681917	20.647571	30.0	47.0	64.0	82.5	100.0

	Age	Gender	YearsAtCompany
EmployeeNumber
1	41	Female	6
2	49	Male	10
4	37	Male	0
5	33	Female	8
7	27	Male	2