The Pandas Library

Description from the Pandas documentation:

pandas is a data analysis library providing fast, flexible, and expressive data structures designed to work with relational or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python. pandas is well suited for: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily fixed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Series and DataFrames

In [1]:
import pandas as pd

The Panda Series

The Series data structure in Pandas is a one-dimensional labeled array.

  • Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.).
  • Data within the array is homogeneous
  • Pandas Series objects always have an index: this gives them both ndarray-like and dict-like properties.

Creating a Panda Serie:

  • Creation from a list
  • Creation from a dictionary
  • Creation from a ndarray
  • From an external source like a file

From a list

In [2]:
temperature = [34, 56, 15, -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
series_from_list = pd.Series(temperature, index=days)
series_from_list
Out[2]:
Mon     34
Tue     56
Wed     15
Thu     -9
Fri   -121
Sat     -5
Sun     39
dtype: int64

The series should contains homogeneous types

In [3]:
temperature = [34, 56, 'a', -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

We create series

In [4]:
series_from_list = pd.Series(temperature, index=days)
series_from_list
Out[4]:
Mon      34
Tue      56
Wed       a
Thu      -9
Fri    -121
Sat      -5
Sun      39
dtype: object

from a dictionary

In [5]:
my_dict = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
my_dict
Out[5]:
{'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
In [6]:
series_from_dict = pd.Series(my_dict)
series_from_dict
Out[6]:
Mon    33
Tue    19
Wed    15
Thu    89
Fri    11
Sat    -5
Sun     9
dtype: int64

From a numpy array

In [7]:
import numpy as np
In [8]:
my_array = np.linspace(0,10,15)
my_array
Out[8]:
array([ 0.        ,  0.71428571,  1.42857143,  2.14285714,  2.85714286,
        3.57142857,  4.28571429,  5.        ,  5.71428571,  6.42857143,
        7.14285714,  7.85714286,  8.57142857,  9.28571429, 10.        ])
In [9]:
series_from_ndarray = pd.Series(my_array)
series_from_ndarray
Out[9]:
0      0.000000
1      0.714286
2      1.428571
3      2.142857
4      2.857143
5      3.571429
6      4.285714
7      5.000000
8      5.714286
9      6.428571
10     7.142857
11     7.857143
12     8.571429
13     9.285714
14    10.000000
dtype: float64

Pandas DataFrames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. You can create a DataFrame from:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • From text, CSV, Excel files or databases
  • Many other ways

Reading the data.

Sample data: HR Employee Attrition and Performance You can get it from here and add it to your working directory:

https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

Importing the xlsx file by considering the variable EmployeeNumber as an Index variable

In [10]:
data = pd.read_excel(io="WA_Fn-UseC_-HR-Employee-Attrition.xlsx", sheetname=0, index_col='EmployeeNumber')
/anaconda3/lib/python3.7/site-packages/pandas/io/excel.py:329: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  **kwds)
In [11]:
data.head()
Out[11]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
1 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 2 ... 1 80 0 8 0 1 6 4 0 5
2 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 3 ... 4 80 1 10 3 3 10 7 1 7
4 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
5 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 4 ... 3 80 0 8 3 3 8 7 3 0
7 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 1 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 34 columns

In [12]:
data.columns
Out[12]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
In [13]:
data['Attrition'].head()
Out[13]:
EmployeeNumber
1    Yes
2     No
4    Yes
5     No
7     No
Name: Attrition, dtype: object

Data Manipulation

In [14]:
data[['Age', 'Gender','YearsAtCompany']].head()
Out[14]:
Age Gender YearsAtCompany
EmployeeNumber
1 41 Female 6
2 49 Male 10
4 37 Male 0
5 33 Female 8
7 27 Male 2
In [15]:
data['AgeInMonths'] = 12*data['Age']
data['AgeInMonths'].head()
Out[15]:
EmployeeNumber
1    492
2    588
4    444
5    396
7    324
Name: AgeInMonths, dtype: int64
In [16]:
del data['AgeInMonths']
In [17]:
data.columns
Out[17]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
In [18]:
data['BusinessTravel'][10:15]
Out[18]:
EmployeeNumber
14    Travel_Rarely
15    Travel_Rarely
16    Travel_Rarely
18    Travel_Rarely
19    Travel_Rarely
Name: BusinessTravel, dtype: object
In [19]:
data[10:15]
Out[19]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
14 35 No Travel_Rarely 809 Research & Development 16 3 Medical 1 1 ... 3 80 1 6 5 3 5 4 0 3
15 29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1 4 ... 4 80 0 10 3 3 9 5 0 8
16 31 No Travel_Rarely 670 Research & Development 26 1 Life Sciences 1 1 ... 4 80 1 5 1 2 5 2 4 3
18 34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1 2 ... 3 80 1 3 2 3 2 2 1 2
19 28 Yes Travel_Rarely 103 Research & Development 24 3 Life Sciences 1 3 ... 2 80 0 6 4 3 4 2 0 3

5 rows × 34 columns

In [20]:
selected_EmployeeNumbers = [15, 94, 337, 1120]
In [21]:
data['YearsAtCompany'].loc[selected_EmployeeNumbers]
Out[21]:
EmployeeNumber
15      9
94      5
337     2
1120    7
Name: YearsAtCompany, dtype: int64
In [22]:
data.loc[selected_EmployeeNumbers]
Out[22]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
15 29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1 4 ... 4 80 0 10 3 3 9 5 0 8
94 29 No Travel_Rarely 1328 Research & Development 2 3 Life Sciences 1 3 ... 4 80 1 6 3 3 5 4 0 4
337 31 No Travel_Frequently 1327 Research & Development 3 4 Medical 1 2 ... 1 80 1 9 3 3 2 2 2 2
1120 29 No Travel_Rarely 1107 Research & Development 28 4 Life Sciences 1 3 ... 1 80 1 11 1 3 7 5 1 7

4 rows × 34 columns

In [23]:
data.loc[94,'YearsAtCompany']
Out[23]:
5
In [24]:
data['Department'].value_counts()
Out[24]:
Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64
In [25]:
data['Department'].value_counts().plot(kind='barh', title='Department')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e83cc18>
In [26]:
data['Department'].value_counts().plot(kind='pie', title='Department')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x1210450f0>
In [27]:
data['Attrition'].value_counts()
Out[27]:
No     1233
Yes     237
Name: Attrition, dtype: int64
In [28]:
data['Attrition'].value_counts(normalize=True)
Out[28]:
No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64
In [29]:
data['HourlyRate'].mean()
Out[29]:
65.89115646258503

What's the overall statisfaction of the Employees?

In [30]:
data['JobSatisfaction'].head()
Out[30]:
EmployeeNumber
1    4
2    2
4    3
5    3
7    2
Name: JobSatisfaction, dtype: int64

Let us change the levels of the variable satisfaction

In [31]:
JobSatisfaction_cat = {
    1: 'Low',
    2: 'Medium',
    3: 'High',
    4: 'Very High'
}
In [32]:
data['JobSatisfaction'] = data['JobSatisfaction'].map(JobSatisfaction_cat)
data['JobSatisfaction'].head()
Out[32]:
EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: object
In [33]:
data['JobSatisfaction'].value_counts()
Out[33]:
Very High    459
High         442
Low          289
Medium       280
Name: JobSatisfaction, dtype: int64
In [34]:
100*data['JobSatisfaction'].value_counts(normalize=True)
Out[34]:
Very High    31.224490
High         30.068027
Low          19.659864
Medium       19.047619
Name: JobSatisfaction, dtype: float64
In [35]:
data['JobSatisfaction'].value_counts(normalize=True).plot(kind='pie', title='Department')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1211472b0>
In [36]:
data['JobSatisfaction'] = data['JobSatisfaction'].astype(dtype='category', 
                               categories=['Low', 'Medium', 'High', 'Very High'],
                               ordered=True)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until
In [37]:
data['JobSatisfaction'].head()
Out[37]:
EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: category
Categories (4, object): [Low < Medium < High < Very High]
In [38]:
data['JobSatisfaction'].value_counts().plot(kind='barh', title='Department')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1211889b0>
In [39]:
data['JobSatisfaction'].value_counts(sort=False).plot(kind='barh', title='Department')
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x121235358>
In [40]:
data['JobSatisfaction'] == 'Low'
Out[40]:
EmployeeNumber
1       False
2       False
4       False
5       False
7       False
8       False
10       True
11      False
12      False
13      False
14      False
15      False
16      False
18      False
19      False
20       True
21      False
22      False
23      False
24      False
26      False
27       True
28      False
30      False
31       True
32      False
33       True
35      False
36      False
38       True
        ...  
2025    False
2026    False
2027    False
2031    False
2032    False
2034    False
2035    False
2036    False
2037    False
2038     True
2040    False
2041    False
2044    False
2045    False
2046    False
2048    False
2049    False
2051    False
2052    False
2053    False
2054     True
2055     True
2056    False
2057     True
2060    False
2061    False
2062     True
2064    False
2065    False
2068    False
Name: JobSatisfaction, Length: 1470, dtype: bool
In [41]:
data.loc[data['JobSatisfaction'] == 'Low'].index
Out[41]:
Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
            ...
            1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
           dtype='int64', name='EmployeeNumber', length=289)
In [42]:
data['JobInvolvement'].head()
Out[42]:
EmployeeNumber
1    3
2    2
4    2
5    3
7    3
Name: JobInvolvement, dtype: int64
In [43]:
subset_of_interest = data.loc[(data['JobSatisfaction'] == "Low") | (data['JobSatisfaction'] == "Very High")]
subset_of_interest.shape
Out[43]:
(748, 34)
In [44]:
subset_of_interest['JobSatisfaction'].value_counts()
Out[44]:
Very High    459
Low          289
High           0
Medium         0
Name: JobSatisfaction, dtype: int64

Let's then remove the categories or levels that we won't use

In [45]:
subset_of_interest['JobSatisfaction'].cat.remove_unused_categories(inplace=True)
In [46]:
grouped = subset_of_interest.groupby('JobSatisfaction')
In [47]:
grouped.groups
Out[47]:
{'Low': Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
             ...
             1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
            dtype='int64', name='EmployeeNumber', length=289),
 'Very High': Int64Index([   1,    8,   18,   22,   23,   24,   30,   36,   39,   40,
             ...
             2022, 2024, 2027, 2036, 2040, 2041, 2045, 2052, 2056, 2061],
            dtype='int64', name='EmployeeNumber', length=459)}

The Low statisfaction group

In [48]:
grouped.get_group('Low').head()
Out[48]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
10 59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1 3 ... 1 80 3 12 3 2 1 0 0 0
20 29 No Travel_Rarely 1389 Research & Development 21 4 Life Sciences 1 2 ... 3 80 1 10 1 3 10 9 8 8
27 36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences 1 3 ... 2 80 0 10 4 3 5 3 0 3
31 34 Yes Travel_Rarely 699 Research & Development 6 1 Medical 1 2 ... 3 80 0 8 2 3 4 2 1 3
33 32 Yes Travel_Frequently 1125 Research & Development 16 1 Life Sciences 1 2 ... 2 80 0 10 5 3 10 2 6 7

5 rows × 34 columns

and the Very High satisfaction group

In [49]:
grouped.get_group('Very High').head()
Out[49]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
1 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 2 ... 1 80 0 8 0 1 6 4 0 5
8 32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 4 ... 3 80 0 8 2 2 7 7 3 6
18 34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1 2 ... 3 80 1 3 2 3 2 2 1 2
22 22 No Non-Travel 1123 Research & Development 16 2 Medical 1 4 ... 2 80 2 1 2 2 1 0 0 0
23 53 No Travel_Rarely 1219 Sales 2 4 Life Sciences 1 1 ... 3 80 0 31 3 3 25 8 3 7

5 rows × 34 columns

The average of the Age of each group

In [50]:
grouped['Age']
Out[50]:
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x1212e59b0>
In [51]:
grouped['Age'].mean()
Out[51]:
JobSatisfaction
Low          36.916955
Very High    36.795207
Name: Age, dtype: float64
In [52]:
grouped['Age'].describe()
Out[52]:
count mean std min 25% 50% 75% max
JobSatisfaction
Low 289.0 36.916955 9.245496 19.0 30.0 36.0 42.0 60.0
Very High 459.0 36.795207 9.125609 18.0 30.0 35.0 43.0 60.0
In [53]:
grouped['Age'].describe().unstack()
Out[53]:
       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                 36.916955
       Very High           36.795207
std    Low                  9.245496
       Very High            9.125609
min    Low                 19.000000
       Very High           18.000000
25%    Low                 30.000000
       Very High           30.000000
50%    Low                 36.000000
       Very High           35.000000
75%    Low                 42.000000
       Very High           43.000000
max    Low                 60.000000
       Very High           60.000000
dtype: float64

Comparing densities

In [54]:
grouped['Age'].plot(kind='density', title='Age')
Out[54]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: Age, dtype: object

By Department

In [55]:
grouped['Department'].value_counts().unstack()
Out[55]:
Department Human Resources Research & Development Sales
JobSatisfaction
Low 11 192 86
Very High 17 295 147

We can normalize it

In [56]:
grouped['Department'].value_counts(normalize=True).unstack()
Out[56]:
Department Human Resources Research & Development Sales
JobSatisfaction
Low 0.038062 0.664360 0.297578
Very High 0.037037 0.642702 0.320261
In [57]:
grouped['Department'].value_counts().unstack().plot(kind="barh")
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a22e99cf8>
In [58]:
grouped['Department'].value_counts(normalize=True).unstack().plot(kind="barh")
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a22f84470>

We can compare it with the whole sample

In [59]:
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x121026e10>

But the colors and the order don't match with the other bar chart. We need to reorder the Department variable

In [60]:
data['Department'] = data['Department'].astype(dtype='category', 
                               categories=['Human Resources', 'Research & Development', 'Sales'],
                               ordered=True)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until
In [61]:
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x121026828>
In [62]:
grouped['DistanceFromHome'].describe().unstack()
Out[62]:
       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                  9.190311
       Very High            9.030501
std    Low                  8.045127
       Very High            8.257004
min    Low                  1.000000
       Very High            1.000000
25%    Low                  2.000000
       Very High            2.000000
50%    Low                  7.000000
       Very High            7.000000
75%    Low                 14.000000
       Very High           14.000000
max    Low                 29.000000
       Very High           29.000000
dtype: float64
In [63]:
grouped['DistanceFromHome'].plot(kind='density', title='Distance From Home',legend=True)
Out[63]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: DistanceFromHome, dtype: object
In [64]:
grouped['HourlyRate'].describe()
Out[64]:
count mean std min 25% 50% 75% max
JobSatisfaction
Low 289.0 68.636678 20.439515 30.0 52.0 72.0 86.0 100.0
Very High 459.0 64.681917 20.647571 30.0 47.0 64.0 82.5 100.0
In [65]:
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)
Out[65]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object
In [66]:
grouped['MonthlyIncome'].describe()
Out[66]:
count mean std min 25% 50% 75% max
JobSatisfaction
Low 289.0 6561.570934 4645.170134 1091.0 3072.0 4968.0 8564.0 19943.0
Very High 459.0 6472.732026 4573.906428 1051.0 2927.5 5126.0 7908.0 19845.0
In [67]:
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)
Out[67]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object