The Panda Libray

From the Pandas documentation:

pandas is a data analysis library providing fast, flexible, and expressive data structures designed to work with relational or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast [HDF5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Series and DataFrames

In [1]:
import pandas as pd

The Panda Series

The Series data structure in Pandas is a one-dimensional labeled array.

  • Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.).
  • Data within the array is homogeneous
  • Pandas Series objects always have an index: this gives them both ndarray-like and dict-like properties.

Creating a Panda Serie

  • Creation from a list
  • Creation from a dictionary
  • Creation from a ndarray
  • From an external source like a file

From a list

In [2]:
temperature = [34, 56, 15, -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
series_from_list = pd.Series(temperature, index=days)
series_from_list
Out[2]:
Mon     34
Tue     56
Wed     15
Thu     -9
Fri   -121
Sat     -5
Sun     39
dtype: int64

The series should contains homogeneous types

In [3]:
temperature = [34, 56, 'a', -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
series_from_list = pd.Series(temperature, index=days)
series_from_list
Out[3]:
Mon      34
Tue      56
Wed       a
Thu      -9
Fri    -121
Sat      -5
Sun      39
dtype: object

from a dictionary

In [4]:
my_dict = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
series_from_dict = pd.Series(my_dict)
series_from_dict
Out[4]:
Mon    33
Tue    19
Wed    15
Thu    89
Fri    11
Sat    -5
Sun     9
dtype: int64

From a numpy array

In [5]:
import numpy as np
help(np.linspace)
Help on function linspace in module numpy.core.function_base:

linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
    Return evenly spaced numbers over a specified interval.
    
    Returns `num` evenly spaced samples, calculated over the
    interval [`start`, `stop`].
    
    The endpoint of the interval can optionally be excluded.
    
    Parameters
    ----------
    start : scalar
        The starting value of the sequence.
    stop : scalar
        The end value of the sequence, unless `endpoint` is set to False.
        In that case, the sequence consists of all but the last of ``num + 1``
        evenly spaced samples, so that `stop` is excluded.  Note that the step
        size changes when `endpoint` is False.
    num : int, optional
        Number of samples to generate. Default is 50. Must be non-negative.
    endpoint : bool, optional
        If True, `stop` is the last sample. Otherwise, it is not included.
        Default is True.
    retstep : bool, optional
        If True, return (`samples`, `step`), where `step` is the spacing
        between samples.
    dtype : dtype, optional
        The type of the output array.  If `dtype` is not given, infer the data
        type from the other input arguments.
    
        .. versionadded:: 1.9.0
    
    Returns
    -------
    samples : ndarray
        There are `num` equally spaced samples in the closed interval
        ``[start, stop]`` or the half-open interval ``[start, stop)``
        (depending on whether `endpoint` is True or False).
    step : float, optional
        Only returned if `retstep` is True
    
        Size of spacing between samples.
    
    
    See Also
    --------
    arange : Similar to `linspace`, but uses a step size (instead of the
             number of samples).
    logspace : Samples uniformly distributed in log space.
    
    Examples
    --------
    >>> np.linspace(2.0, 3.0, num=5)
    array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ])
    >>> np.linspace(2.0, 3.0, num=5, endpoint=False)
    array([ 2. ,  2.2,  2.4,  2.6,  2.8])
    >>> np.linspace(2.0, 3.0, num=5, retstep=True)
    (array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ]), 0.25)
    
    Graphical illustration:
    
    >>> import matplotlib.pyplot as plt
    >>> N = 8
    >>> y = np.zeros(N)
    >>> x1 = np.linspace(0, 10, N, endpoint=True)
    >>> x2 = np.linspace(0, 10, N, endpoint=False)
    >>> plt.plot(x1, y, 'o')
    [<matplotlib.lines.Line2D object at 0x...>]
    >>> plt.plot(x2, y + 0.5, 'o')
    [<matplotlib.lines.Line2D object at 0x...>]
    >>> plt.ylim([-0.5, 1])
    (-0.5, 1)
    >>> plt.show()

In [6]:
my_array = np.linspace(0,10,15)
my_array
Out[6]:
array([ 0.        ,  0.71428571,  1.42857143,  2.14285714,  2.85714286,
        3.57142857,  4.28571429,  5.        ,  5.71428571,  6.42857143,
        7.14285714,  7.85714286,  8.57142857,  9.28571429, 10.        ])
In [7]:
series_from_ndarray = pd.Series(my_array)
series_from_ndarray
Out[7]:
0      0.000000
1      0.714286
2      1.428571
3      2.142857
4      2.857143
5      3.571429
6      4.285714
7      5.000000
8      5.714286
9      6.428571
10     7.142857
11     7.857143
12     8.571429
13     9.285714
14    10.000000
dtype: float64

Pandas DataFrames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

You can create a DataFrame from:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • From text, CSV, Excel files or databases
  • Many other ways

Reading the data

Sample data: HR Employee Attrition and Performance You can get it from here:

https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

In [8]:
file_xlsx = "/Users/dhafermalouche/Documents/Teaching_2018_2019/Teaching_python/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"

Importing the xlsx file by considering the variable EmployeeNumber as an Index variable

In [9]:
data = pd.read_excel(io=file_xlsx, sheetname=0, index_col='EmployeeNumber')
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py:329: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  **kwds)
In [10]:
data.head()
Out[10]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
1 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 2 ... 1 80 0 8 0 1 6 4 0 5
2 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 3 ... 4 80 1 10 3 3 10 7 1 7
4 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
5 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 4 ... 3 80 0 8 3 3 8 7 3 0
7 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 1 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 34 columns

Column names

In [11]:
data.columns
Out[11]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

Preview on one variable

In [12]:
data['Attrition'].head()
Out[12]:
EmployeeNumber
1    Yes
2     No
4    Yes
5     No
7     No
Name: Attrition, dtype: object

Extracting several columns

In [13]:
data[['Age', 'Gender','YearsAtCompany']].head()
Out[13]:
Age Gender YearsAtCompany
EmployeeNumber
1 41 Female 6
2 49 Male 10
4 37 Male 0
5 33 Female 8
7 27 Male 2

Adding a column

In [14]:
data['AgeInMonths'] = 12*data['Age']
data['AgeInMonths'].head()
Out[14]:
EmployeeNumber
1    492
2    588
4    444
5    396
7    324
Name: AgeInMonths, dtype: int64

Deleting a column

In [15]:
del data['AgeInMonths']
In [16]:
data.columns
Out[16]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

Extract observations from a variable

In [17]:
data['BusinessTravel'][10:15]
Out[17]:
EmployeeNumber
14    Travel_Rarely
15    Travel_Rarely
16    Travel_Rarely
18    Travel_Rarely
19    Travel_Rarely
Name: BusinessTravel, dtype: object

Extract the rows 10 to 15

In [18]:
data[10:15]
Out[18]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
14 35 No Travel_Rarely 809 Research & Development 16 3 Medical 1 1 ... 3 80 1 6 5 3 5 4 0 3
15 29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1 4 ... 4 80 0 10 3 3 9 5 0 8
16 31 No Travel_Rarely 670 Research & Development 26 1 Life Sciences 1 1 ... 4 80 1 5 1 2 5 2 4 3
18 34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1 2 ... 3 80 1 3 2 3 2 2 1 2
19 28 Yes Travel_Rarely 103 Research & Development 24 3 Life Sciences 1 3 ... 2 80 0 6 4 3 4 2 0 3

5 rows × 34 columns

Select by labels

In [19]:
selected_EmployeeNumbers = [15, 94, 337, 1120]
In [20]:
data['YearsAtCompany'].loc[selected_EmployeeNumbers]
Out[20]:
EmployeeNumber
15      9
94      5
337     2
1120    7
Name: YearsAtCompany, dtype: int64
In [21]:
data.loc[selected_EmployeeNumbers]
Out[21]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
15 29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1 4 ... 4 80 0 10 3 3 9 5 0 8
94 29 No Travel_Rarely 1328 Research & Development 2 3 Life Sciences 1 3 ... 4 80 1 6 3 3 5 4 0 4
337 31 No Travel_Frequently 1327 Research & Development 3 4 Medical 1 2 ... 1 80 1 9 3 3 2 2 2 2
1120 29 No Travel_Rarely 1107 Research & Development 28 4 Life Sciences 1 3 ... 1 80 1 11 1 3 7 5 1 7

4 rows × 34 columns

Extracting a single value

Getting a single value
In [22]:
data.loc[94,'YearsAtCompany']
Out[22]:
5

Let us answer to some questions from the data

How many employees are there by department in the dataset?

In [23]:
data['Department'].value_counts()
Out[23]:
Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64
In [24]:
data['Department'].value_counts().plot(kind='barh', title='Department')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b565358>

or a pie chart

In [25]:
data['Department'].value_counts().plot(kind='pie', title='Department')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x112c14b00>

What is the overall attrition rate?

In [26]:
data['Attrition'].value_counts()
Out[26]:
No     1233
Yes     237
Name: Attrition, dtype: int64

We can compute the proportions

In [27]:
data['Attrition'].value_counts(normalize=True)
Out[27]:
No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

What is the average hourly rate?

In [28]:
data['HourlyRate'].mean()
Out[28]:
65.89115646258503

What's the overall statisfaction of the Employees

In [29]:
data['JobSatisfaction'].head()
Out[29]:
EmployeeNumber
1    4
2    2
4    3
5    3
7    2
Name: JobSatisfaction, dtype: int64

Let us change the levels of the variable

In [30]:
JobSatisfaction_cat = {
    1: 'Low',
    2: 'Medium',
    3: 'High',
    4: 'Very High'
}
In [31]:
data['JobSatisfaction'] = data['JobSatisfaction'].map(JobSatisfaction_cat)
data['JobSatisfaction'].head()
Out[31]:
EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: object
In [32]:
data['JobSatisfaction'].value_counts()
Out[32]:
Very High    459
High         442
Low          289
Medium       280
Name: JobSatisfaction, dtype: int64
In [33]:
100*data['JobSatisfaction'].value_counts(normalize=True)
Out[33]:
Very High    31.224490
High         30.068027
Low          19.659864
Medium       19.047619
Name: JobSatisfaction, dtype: float64
In [34]:
data['JobSatisfaction'].value_counts(normalize=True).plot(kind='pie', title='Department')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ab111d0>

Transfor the JobStatisfaction variable into an ordered variable

In [35]:
data['JobSatisfaction'] = data['JobSatisfaction'].astype(dtype='category', 
                               categories=['Low', 'Medium', 'High', 'Very High'],
                               ordered=True)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until
In [36]:
data['JobSatisfaction'].head()
Out[36]:
EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: category
Categories (4, object): [Low < Medium < High < Very High]
In [37]:
data['JobSatisfaction'].value_counts().plot(kind='barh', title='Department')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a9bb8d0>
In [38]:
data['JobSatisfaction'].value_counts(sort=False).plot(kind='barh', title='Department')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x10aa7d668>

Here the list of the employees with Low level of JobSatisfaction?

In [39]:
data['JobSatisfaction'] == 'Low'
Out[39]:
EmployeeNumber
1       False
2       False
4       False
5       False
7       False
8       False
10       True
11      False
12      False
13      False
14      False
15      False
16      False
18      False
19      False
20       True
21      False
22      False
23      False
24      False
26      False
27       True
28      False
30      False
31       True
32      False
33       True
35      False
36      False
38       True
        ...  
2025    False
2026    False
2027    False
2031    False
2032    False
2034    False
2035    False
2036    False
2037    False
2038     True
2040    False
2041    False
2044    False
2045    False
2046    False
2048    False
2049    False
2051    False
2052    False
2053    False
2054     True
2055     True
2056    False
2057     True
2060    False
2061    False
2062     True
2064    False
2065    False
2068    False
Name: JobSatisfaction, Length: 1470, dtype: bool

We can get their index

In [40]:
data.loc[data['JobSatisfaction'] == 'Low'].index
Out[40]:
Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
            ...
            1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
           dtype='int64', name='EmployeeNumber', length=289)

Give us the list of employees with low statistfaction and low job involement

In [41]:
data['JobInvolvement'].head()
Out[41]:
EmployeeNumber
1    3
2    2
4    2
5    3
7    3
Name: JobInvolvement, dtype: int64
In [42]:
data.loc[(data['JobSatisfaction'] == 'Low') & (data['JobInvolvement'] == 'Low')].index.values
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py:1167: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  result = method(y)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-eab3bdba2345> in <module>()
----> 1 data.loc[(data['JobSatisfaction'] == 'Low') & (data['JobInvolvement'] == 'Low')].index.values

~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1281 
   1282             with np.errstate(all='ignore'):
-> 1283                 res = na_op(values, other)
   1284             if is_scalar(res):
   1285                 raise TypeError('Could not compare {typ} type with Series'

~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1167                     result = method(y)
   1168                 if result is NotImplemented:
-> 1169                     raise TypeError("invalid type comparison")
   1170             else:
   1171                 result = op(x, y)

TypeError: invalid type comparison

We will compare the employees with Low and Very High JobSatisfaction across the following variables: Age, Department, DistanceFromHome, HourlyRate, MonthlyIncome and YearsAtCompany.

We will create a new DataFrame containing only the employees with "Low" and "Very High" JobSatisfaction levels.

In [43]:
subset_of_interest = data.loc[(data['JobSatisfaction'] == "Low") | (data['JobSatisfaction'] == "Very High")]
subset_of_interest.shape
Out[43]:
(748, 34)
In [44]:
subset_of_interest['JobSatisfaction'].value_counts()
Out[44]:
Very High    459
Low          289
High           0
Medium         0
Name: JobSatisfaction, dtype: int64

Let's then remove the categories or levels that we won't use

In [45]:
subset_of_interest['JobSatisfaction'].cat.remove_unused_categories(inplace=True)
In [46]:
subset_of_interest['JobSatisfaction'].value_counts()
Out[46]:
Very High    459
Low          289
Name: JobSatisfaction, dtype: int64

Let's split our new DataFrame into groups.

In [47]:
grouped = subset_of_interest.groupby('JobSatisfaction')
In [48]:
grouped.groups
Out[48]:
{'Low': Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
             ...
             1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
            dtype='int64', name='EmployeeNumber', length=289),
 'Very High': Int64Index([   1,    8,   18,   22,   23,   24,   30,   36,   39,   40,
             ...
             2022, 2024, 2027, 2036, 2040, 2041, 2045, 2052, 2056, 2061],
            dtype='int64', name='EmployeeNumber', length=459)}

The Low statisfaction group

In [49]:
grouped.get_group('Low').head()
Out[49]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
10 59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1 3 ... 1 80 3 12 3 2 1 0 0 0
20 29 No Travel_Rarely 1389 Research & Development 21 4 Life Sciences 1 2 ... 3 80 1 10 1 3 10 9 8 8
27 36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences 1 3 ... 2 80 0 10 4 3 5 3 0 3
31 34 Yes Travel_Rarely 699 Research & Development 6 1 Medical 1 2 ... 3 80 0 8 2 3 4 2 1 3
33 32 Yes Travel_Frequently 1125 Research & Development 16 1 Life Sciences 1 2 ... 2 80 0 10 5 3 10 2 6 7

5 rows × 34 columns

and the Very High satisfaction group

In [50]:
grouped.get_group('Very High').head()
Out[50]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EnvironmentSatisfaction ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
EmployeeNumber
1 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 2 ... 1 80 0 8 0 1 6 4 0 5
8 32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 4 ... 3 80 0 8 2 2 7 7 3 6
18 34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1 2 ... 3 80 1 3 2 3 2 2 1 2
22 22 No Non-Travel 1123 Research & Development 16 2 Medical 1 4 ... 2 80 2 1 2 2 1 0 0 0
23 53 No Travel_Rarely 1219 Sales 2 4 Life Sciences 1 1 ... 3 80 0 31 3 3 25 8 3 7

5 rows × 34 columns

The average of the Age of each group

In [53]:
grouped['Age']
Out[53]:
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x1162a1ba8>
In [54]:
grouped['Age'].mean()
Out[54]:
JobSatisfaction
Low          36.916955
Very High    36.795207
Name: Age, dtype: float64
In [55]:
grouped['Age'].describe()
Out[55]:
count mean std min 25% 50% 75% max
JobSatisfaction
Low 289.0 36.916955 9.245496 19.0 30.0 36.0 42.0 60.0
Very High 459.0 36.795207 9.125609 18.0 30.0 35.0 43.0 60.0
In [56]:
grouped['Age'].describe().unstack()
Out[56]:
       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                 36.916955
       Very High           36.795207
std    Low                  9.245496
       Very High            9.125609
min    Low                 19.000000
       Very High           18.000000
25%    Low                 30.000000
       Very High           30.000000
50%    Low                 36.000000
       Very High           35.000000
75%    Low                 42.000000
       Very High           43.000000
max    Low                 60.000000
       Very High           60.000000
dtype: float64

Let's compare the densities

In [57]:
grouped['Age'].plot(kind='density', title='Age')
Out[57]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: Age, dtype: object

By Department

In [58]:
grouped['Department'].value_counts().unstack()
Out[58]:
Department Human Resources Research & Development Sales
JobSatisfaction
Low 11 192 86
Very High 17 295 147

We can normalize it

In [59]:
grouped['Department'].value_counts(normalize=True).unstack()
Out[59]:
Department Human Resources Research & Development Sales
JobSatisfaction
Low 0.038062 0.664360 0.297578
Very High 0.037037 0.642702 0.320261
In [60]:
grouped['Department'].value_counts().unstack().plot(kind="barh")
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x118084a58>
In [61]:
grouped['Department'].value_counts(normalize=True).unstack().plot(kind="barh")
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x118157c50>

We can compare it with the whole sample

In [62]:
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x118237668>

But the colors and the order don't match with the other bar chart. We need to reorder the Department variable

In [63]:
data['Department'] = data['Department'].astype(dtype='category', 
                               categories=['Human Resources', 'Research & Development', 'Sales'],
                               ordered=True)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until
In [64]:
data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1182fb8d0>

DistanceFromHome

In [65]:
grouped['DistanceFromHome'].describe().unstack()
Out[65]:
       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                  9.190311
       Very High            9.030501
std    Low                  8.045127
       Very High            8.257004
min    Low                  1.000000
       Very High            1.000000
25%    Low                  2.000000
       Very High            2.000000
50%    Low                  7.000000
       Very High            7.000000
75%    Low                 14.000000
       Very High           14.000000
max    Low                 29.000000
       Very High           29.000000
dtype: float64
In [66]:
grouped['DistanceFromHome'].plot(kind='density', title='Distance From Home',legend=True)
Out[66]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: DistanceFromHome, dtype: object

Hourly Rate

In [67]:
grouped['HourlyRate'].describe()
Out[67]:
count mean std min 25% 50% 75% max
JobSatisfaction
Low 289.0 68.636678 20.439515 30.0 52.0 72.0 86.0 100.0
Very High 459.0 64.681917 20.647571 30.0 47.0 64.0 82.5 100.0
In [68]:
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)
Out[68]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object

Monthly Income

In [69]:
grouped['MonthlyIncome'].describe()
Out[69]:
count mean std min 25% 50% 75% max
JobSatisfaction
Low 289.0 6561.570934 4645.170134 1091.0 3072.0 4968.0 8564.0 19943.0
Very High 459.0 6472.732026 4573.906428 1051.0 2927.5 5126.0 7908.0 19845.0
In [70]:
grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)
Out[70]:
JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object

Projects

Project 1

Can you write the same but with R?

Project 2

Can you do the same with other dataset collected from Surveys like the Afrobarometer or the World Value Survey?