The Panda Libray ¶

From the Pandas documentation:

pandas is a data analysis library providing fast, flexible, and expressive data structures designed to work with relational or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data

Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

Intuitive merging and joining data sets

Flexible reshaping and pivoting of data sets

Hierarchical labeling of axes (possible to have multiple labels per tick)

Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast [HDF5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)

Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Series and DataFrames ¶

import pandas as pd

The Panda Series ¶

The Series data structure in Pandas is a one-dimensional labeled array.

Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.).
Data within the array is homogeneous
Pandas Series objects always have an index: this gives them both ndarray-like and dict-like properties.

Creating a Panda Serie ¶

Creation from a list
Creation from a dictionary
Creation from a ndarray
From an external source like a file

From a list¶

temperature = [34, 56, 15, -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
series_from_list = pd.Series(temperature, index=days)
series_from_list

Mon     34
Tue     56
Wed     15
Thu     -9
Fri   -121
Sat     -5
Sun     39
dtype: int64

The series should contains homogeneous types

temperature = [34, 56, 'a', -9, -121, -5, 39]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
series_from_list = pd.Series(temperature, index=days)
series_from_list

Mon      34
Tue      56
Wed       a
Thu      -9
Fri    -121
Sat      -5
Sun      39
dtype: object

from a dictionary¶

my_dict = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
series_from_dict = pd.Series(my_dict)
series_from_dict

Mon    33
Tue    19
Wed    15
Thu    89
Fri    11
Sat    -5
Sun     9
dtype: int64

From a numpy array¶

import numpy as np
help(np.linspace)

Help on function linspace in module numpy.core.function_base:

linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
    Return evenly spaced numbers over a specified interval.
    
    Returns `num` evenly spaced samples, calculated over the
    interval [`start`, `stop`].
    
    The endpoint of the interval can optionally be excluded.
    
    Parameters
    ----------
    start : scalar
        The starting value of the sequence.
    stop : scalar
        The end value of the sequence, unless `endpoint` is set to False.
        In that case, the sequence consists of all but the last of ``num + 1``
        evenly spaced samples, so that `stop` is excluded.  Note that the step
        size changes when `endpoint` is False.
    num : int, optional
        Number of samples to generate. Default is 50. Must be non-negative.
    endpoint : bool, optional
        If True, `stop` is the last sample. Otherwise, it is not included.
        Default is True.
    retstep : bool, optional
        If True, return (`samples`, `step`), where `step` is the spacing
        between samples.
    dtype : dtype, optional
        The type of the output array.  If `dtype` is not given, infer the data
        type from the other input arguments.
    
        .. versionadded:: 1.9.0
    
    Returns
    -------
    samples : ndarray
        There are `num` equally spaced samples in the closed interval
        ``[start, stop]`` or the half-open interval ``[start, stop)``
        (depending on whether `endpoint` is True or False).
    step : float, optional
        Only returned if `retstep` is True
    
        Size of spacing between samples.
    
    
    See Also
    --------
    arange : Similar to `linspace`, but uses a step size (instead of the
             number of samples).
    logspace : Samples uniformly distributed in log space.
    
    Examples
    --------
    >>> np.linspace(2.0, 3.0, num=5)
    array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ])
    >>> np.linspace(2.0, 3.0, num=5, endpoint=False)
    array([ 2. ,  2.2,  2.4,  2.6,  2.8])
    >>> np.linspace(2.0, 3.0, num=5, retstep=True)
    (array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ]), 0.25)
    
    Graphical illustration:
    
    >>> import matplotlib.pyplot as plt
    >>> N = 8
    >>> y = np.zeros(N)
    >>> x1 = np.linspace(0, 10, N, endpoint=True)
    >>> x2 = np.linspace(0, 10, N, endpoint=False)
    >>> plt.plot(x1, y, 'o')
    [<matplotlib.lines.Line2D object at 0x...>]
    >>> plt.plot(x2, y + 0.5, 'o')
    [<matplotlib.lines.Line2D object at 0x...>]
    >>> plt.ylim([-0.5, 1])
    (-0.5, 1)
    >>> plt.show()

my_array = np.linspace(0,10,15)
my_array

array([ 0.        ,  0.71428571,  1.42857143,  2.14285714,  2.85714286,
        3.57142857,  4.28571429,  5.        ,  5.71428571,  6.42857143,
        7.14285714,  7.85714286,  8.57142857,  9.28571429, 10.        ])

series_from_ndarray = pd.Series(my_array)
series_from_ndarray

0      0.000000
1      0.714286
2      1.428571
3      2.142857
4      2.857143
5      3.571429
6      4.285714
7      5.000000
8      5.714286
9      6.428571
10     7.142857
11     7.857143
12     8.571429
13     9.285714
14    10.000000
dtype: float64

Pandas DataFrames ¶

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

You can create a DataFrame from:

Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
From text, CSV, Excel files or databases
Many other ways

Reading the data

Sample data: HR Employee Attrition and Performance You can get it from here:

https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

file_xlsx = "/Users/dhafermalouche/Documents/Teaching_2018_2019/Teaching_python/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"

Importing the xlsx file by considering the variable EmployeeNumber as an Index variable

data = pd.read_excel(io=file_xlsx, sheetname=0, index_col='EmployeeNumber')

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py:329: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  **kwds)

data.head()

Column names

data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

Preview on one variable

data['Attrition'].head()

EmployeeNumber
1    Yes
2     No
4    Yes
5     No
7     No
Name: Attrition, dtype: object

Extracting several columns

data[['Age', 'Gender','YearsAtCompany']].head()

Adding a column

data['AgeInMonths'] = 12*data['Age']
data['AgeInMonths'].head()

EmployeeNumber
1    492
2    588
4    444
5    396
7    324
Name: AgeInMonths, dtype: int64

Deleting a column

del data['AgeInMonths']

data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

Extract observations from a variable

data['BusinessTravel'][10:15]

EmployeeNumber
14    Travel_Rarely
15    Travel_Rarely
16    Travel_Rarely
18    Travel_Rarely
19    Travel_Rarely
Name: BusinessTravel, dtype: object

Extract the rows 10 to 15

data[10:15]

Select by labels

selected_EmployeeNumbers = [15, 94, 337, 1120]

data['YearsAtCompany'].loc[selected_EmployeeNumbers]

EmployeeNumber
15      9
94      5
337     2
1120    7
Name: YearsAtCompany, dtype: int64

data.loc[selected_EmployeeNumbers]

Extracting a single value

data.loc[94,'YearsAtCompany']

5

Let us answer to some questions from the data

How many employees are there by department in the dataset?

data['Department'].value_counts()

Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

data['Department'].value_counts().plot(kind='barh', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x10b565358>

or a pie chart

data['Department'].value_counts().plot(kind='pie', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x112c14b00>

What is the overall attrition rate?

data['Attrition'].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

We can compute the proportions

data['Attrition'].value_counts(normalize=True)

No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

What is the average hourly rate?

data['HourlyRate'].mean()

65.89115646258503

What's the overall statisfaction of the Employees

data['JobSatisfaction'].head()

EmployeeNumber
1    4
2    2
4    3
5    3
7    2
Name: JobSatisfaction, dtype: int64

Let us change the levels of the variable

JobSatisfaction_cat = {
    1: 'Low',
    2: 'Medium',
    3: 'High',
    4: 'Very High'
}

data['JobSatisfaction'] = data['JobSatisfaction'].map(JobSatisfaction_cat)
data['JobSatisfaction'].head()

EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: object

data['JobSatisfaction'].value_counts()

Very High    459
High         442
Low          289
Medium       280
Name: JobSatisfaction, dtype: int64

100*data['JobSatisfaction'].value_counts(normalize=True)

Very High    31.224490
High         30.068027
Low          19.659864
Medium       19.047619
Name: JobSatisfaction, dtype: float64

data['JobSatisfaction'].value_counts(normalize=True).plot(kind='pie', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x10ab111d0>

Transfor the JobStatisfaction variable into an ordered variable

data['JobSatisfaction'] = data['JobSatisfaction'].astype(dtype='category', 
                               categories=['Low', 'Medium', 'High', 'Very High'],
                               ordered=True)

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until

data['JobSatisfaction'].head()

EmployeeNumber
1    Very High
2       Medium
4         High
5         High
7       Medium
Name: JobSatisfaction, dtype: category
Categories (4, object): [Low < Medium < High < Very High]

data['JobSatisfaction'].value_counts().plot(kind='barh', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x10a9bb8d0>

data['JobSatisfaction'].value_counts(sort=False).plot(kind='barh', title='Department')

<matplotlib.axes._subplots.AxesSubplot at 0x10aa7d668>

Here the list of the employees with Low level of JobSatisfaction?

data['JobSatisfaction'] == 'Low'

EmployeeNumber
1       False
2       False
4       False
5       False
7       False
8       False
10       True
11      False
12      False
13      False
14      False
15      False
16      False
18      False
19      False
20       True
21      False
22      False
23      False
24      False
26      False
27       True
28      False
30      False
31       True
32      False
33       True
35      False
36      False
38       True
        ...  
2025    False
2026    False
2027    False
2031    False
2032    False
2034    False
2035    False
2036    False
2037    False
2038     True
2040    False
2041    False
2044    False
2045    False
2046    False
2048    False
2049    False
2051    False
2052    False
2053    False
2054     True
2055     True
2056    False
2057     True
2060    False
2061    False
2062     True
2064    False
2065    False
2068    False
Name: JobSatisfaction, Length: 1470, dtype: bool

We can get their index

data.loc[data['JobSatisfaction'] == 'Low'].index

Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
            ...
            1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
           dtype='int64', name='EmployeeNumber', length=289)

Give us the list of employees with low statistfaction and low job involement

data['JobInvolvement'].head()

EmployeeNumber
1    3
2    2
4    2
5    3
7    3
Name: JobInvolvement, dtype: int64

data.loc[(data['JobSatisfaction'] == 'Low') & (data['JobInvolvement'] == 'Low')].index.values

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py:1167: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  result = method(y)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-eab3bdba2345> in <module>()
----> 1 data.loc[(data['JobSatisfaction'] == 'Low') & (data['JobInvolvement'] == 'Low')].index.values

~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1281 
   1282             with np.errstate(all='ignore'):
-> 1283                 res = na_op(values, other)
   1284             if is_scalar(res):
   1285                 raise TypeError('Could not compare {typ} type with Series'

~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1167                     result = method(y)
   1168                 if result is NotImplemented:
-> 1169                     raise TypeError("invalid type comparison")
   1170             else:
   1171                 result = op(x, y)

TypeError: invalid type comparison

We will compare the employees with Low and Very High JobSatisfaction across the following variables: Age, Department, DistanceFromHome, HourlyRate, MonthlyIncome and YearsAtCompany.

We will create a new DataFrame containing only the employees with "Low" and "Very High" JobSatisfaction levels.

subset_of_interest = data.loc[(data['JobSatisfaction'] == "Low") | (data['JobSatisfaction'] == "Very High")]
subset_of_interest.shape

(748, 34)

subset_of_interest['JobSatisfaction'].value_counts()

Very High    459
Low          289
High           0
Medium         0
Name: JobSatisfaction, dtype: int64

Let's then remove the categories or levels that we won't use

subset_of_interest['JobSatisfaction'].cat.remove_unused_categories(inplace=True)

subset_of_interest['JobSatisfaction'].value_counts()

Very High    459
Low          289
Name: JobSatisfaction, dtype: int64

Let's split our new DataFrame into groups.

grouped = subset_of_interest.groupby('JobSatisfaction')

grouped.groups

{'Low': Int64Index([  10,   20,   27,   31,   33,   38,   51,   52,   54,   68,
             ...
             1975, 1980, 1998, 2021, 2023, 2038, 2054, 2055, 2057, 2062],
            dtype='int64', name='EmployeeNumber', length=289),
 'Very High': Int64Index([   1,    8,   18,   22,   23,   24,   30,   36,   39,   40,
             ...
             2022, 2024, 2027, 2036, 2040, 2041, 2045, 2052, 2056, 2061],
            dtype='int64', name='EmployeeNumber', length=459)}

The Low statisfaction group

grouped.get_group('Low').head()

and the Very High satisfaction group

grouped.get_group('Very High').head()

The average of the Age of each group

grouped['Age']

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x1162a1ba8>

grouped['Age'].mean()

JobSatisfaction
Low          36.916955
Very High    36.795207
Name: Age, dtype: float64

grouped['Age'].describe()

grouped['Age'].describe().unstack()

       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                 36.916955
       Very High           36.795207
std    Low                  9.245496
       Very High            9.125609
min    Low                 19.000000
       Very High           18.000000
25%    Low                 30.000000
       Very High           30.000000
50%    Low                 36.000000
       Very High           35.000000
75%    Low                 42.000000
       Very High           43.000000
max    Low                 60.000000
       Very High           60.000000
dtype: float64

Let's compare the densities

grouped['Age'].plot(kind='density', title='Age')

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: Age, dtype: object

By Department

grouped['Department'].value_counts().unstack()

We can normalize it

grouped['Department'].value_counts(normalize=True).unstack()

grouped['Department'].value_counts().unstack().plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x118084a58>

grouped['Department'].value_counts(normalize=True).unstack().plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x118157c50>

We can compare it with the whole sample

data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x118237668>

But the colors and the order don't match with the other bar chart. We need to reorder the Department variable

data['Department'] = data['Department'].astype(dtype='category', 
                               categories=['Human Resources', 'Research & Development', 'Sales'],
                               ordered=True)

/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  This is separate from the ipykernel package so we can avoid doing imports until

data['Department'].value_counts(normalize=True,sort=False).plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x1182fb8d0>

DistanceFromHome

grouped['DistanceFromHome'].describe().unstack()

       JobSatisfaction
count  Low                289.000000
       Very High          459.000000
mean   Low                  9.190311
       Very High            9.030501
std    Low                  8.045127
       Very High            8.257004
min    Low                  1.000000
       Very High            1.000000
25%    Low                  2.000000
       Very High            2.000000
50%    Low                  7.000000
       Very High            7.000000
75%    Low                 14.000000
       Very High           14.000000
max    Low                 29.000000
       Very High           29.000000
dtype: float64

grouped['DistanceFromHome'].plot(kind='density', title='Distance From Home',legend=True)

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: DistanceFromHome, dtype: object

Hourly Rate

grouped['HourlyRate'].describe()

grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object

Monthly Income

grouped['MonthlyIncome'].describe()

grouped['HourlyRate'].plot(kind='density', title='Hourly Rate',legend=True)

JobSatisfaction
Low          AxesSubplot(0.125,0.125;0.775x0.755)
Very High    AxesSubplot(0.125,0.125;0.775x0.755)
Name: HourlyRate, dtype: object

Projects¶

Project 1

Can you write the same but with R?

Project 2

Can you do the same with other dataset collected from Surveys like the Afrobarometer or the World Value Survey?

	count	mean	std	min	25%	50%	75%	max
JobSatisfaction
Low	289.0	6561.570934	4645.170134	1091.0	3072.0	4968.0	8564.0	19943.0
Very High	459.0	6472.732026	4573.906428	1051.0	2927.5	5126.0	7908.0	19845.0

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EnvironmentSatisfaction	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
EmployeeNumber
1	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	2	...	1	80	0	8	0	1	6	4	0	5
2	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	3	...	4	80	1	10	3	3	10	7	1	7
4	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
5	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	4	...	3	80	0	8	3	3	8	7	3	0
7	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	1	...	4	80	1	6	3	3	2	2	2	2

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EnvironmentSatisfaction	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
EmployeeNumber
10	59	No	Travel_Rarely	1324	Research & Development	3	3	Medical	1	3	...	1	80	3	12	3	2	1	0	0	0
20	29	No	Travel_Rarely	1389	Research & Development	21	4	Life Sciences	1	2	...	3	80	1	10	1	3	10	9	8	8
27	36	Yes	Travel_Rarely	1218	Sales	9	4	Life Sciences	1	3	...	2	80	0	10	4	3	5	3	0	3
31	34	Yes	Travel_Rarely	699	Research & Development	6	1	Medical	1	2	...	3	80	0	8	2	3	4	2	1	3
33	32	Yes	Travel_Frequently	1125	Research & Development	16	1	Life Sciences	1	2	...	2	80	0	10	5	3	10	2	6	7

	count	mean	std	min	25%	50%	75%	max
JobSatisfaction
Low	289.0	36.916955	9.245496	19.0	30.0	36.0	42.0	60.0
Very High	459.0	36.795207	9.125609	18.0	30.0	35.0	43.0	60.0

Department	Human Resources	Research & Development	Sales
JobSatisfaction
Low	0.038062	0.664360	0.297578
Very High	0.037037	0.642702	0.320261

	count	mean	std	min	25%	50%	75%	max
JobSatisfaction
Low	289.0	68.636678	20.439515	30.0	52.0	72.0	86.0	100.0
Very High	459.0	64.681917	20.647571	30.0	47.0	64.0	82.5	100.0

	Age	Gender	YearsAtCompany
EmployeeNumber
1	41	Female	6
2	49	Male	10
4	37	Male	0
5	33	Female	8
7	27	Male	2