Regression model with Python

Simple linear regression model

Method 1

We will be using the library sklearn.model_selection. We will estimate a regression model on training dataset and use it to predict the response target variable on a test dataset.

Importing the libraries

In [1]:
import matplotlib.pyplot as plt
import pandas as pd

Importing the dataset

In [2]:
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
In [3]:
dataset
Out[3]:
Experience Salary
0 1.1 39343
1 1.3 46205
2 1.5 37731
3 2.0 43525
4 2.2 39891
5 2.9 56642
6 3.0 60150
7 3.2 54445
8 3.2 64445
9 3.7 57189
10 3.9 63218
11 4.0 55794
12 4.0 56957
13 4.1 57081
14 4.5 61111
15 4.9 67938
16 5.1 66029
17 5.3 83088
18 5.9 81363
19 6.0 93940
20 6.8 91738
21 7.1 98273
22 7.9 101302
23 8.2 113812
24 8.7 109431
25 9.0 105582
26 9.5 116969
27 9.6 112635
28 10.3 122391
29 10.5 121872
In [4]:
X
Out[4]:
array([[ 1.1],
       [ 1.3],
       [ 1.5],
       [ 2. ],
       [ 2.2],
       [ 2.9],
       [ 3. ],
       [ 3.2],
       [ 3.2],
       [ 3.7],
       [ 3.9],
       [ 4. ],
       [ 4. ],
       [ 4.1],
       [ 4.5],
       [ 4.9],
       [ 5.1],
       [ 5.3],
       [ 5.9],
       [ 6. ],
       [ 6.8],
       [ 7.1],
       [ 7.9],
       [ 8.2],
       [ 8.7],
       [ 9. ],
       [ 9.5],
       [ 9.6],
       [10.3],
       [10.5]])
In [5]:
y
Out[5]:
array([ 39343,  46205,  37731,  43525,  39891,  56642,  60150,  54445,
        64445,  57189,  63218,  55794,  56957,  57081,  61111,  67938,
        66029,  83088,  81363,  93940,  91738,  98273, 101302, 113812,
       109431, 105582, 116969, 112635, 122391, 121872])

Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 5)
In [7]:
X_train
Out[7]:
array([[10.3],
       [ 1.1],
       [ 5.3],
       [ 2.9],
       [ 1.3],
       [ 9.6],
       [ 4. ],
       [ 6.8],
       [ 6. ],
       [ 8.7],
       [ 3.2],
       [ 2.2],
       [ 3.2],
       [ 3.7],
       [ 5.1],
       [ 7.9],
       [ 3. ],
       [ 4.9],
       [ 4.5],
       [ 2. ]])
In [8]:
X_test
Out[8]:
array([[ 4. ],
       [10.5],
       [ 8.2],
       [ 9. ],
       [ 5.9],
       [ 3.9],
       [ 1.5],
       [ 4.1],
       [ 9.5],
       [ 7.1]])

Fitting Simple Linear Regression to the Training set

In [9]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py:509: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  linalg.lstsq(X, y)
Out[9]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [10]:
help(regressor.fit)
Help on method fit in module sklearn.linear_model.base:

fit(X, y, sample_weight=None) method of sklearn.linear_model.base.LinearRegression instance
    Fit linear model.
    
    Parameters
    ----------
    X : numpy array or sparse matrix of shape [n_samples,n_features]
        Training data
    
    y : numpy array of shape [n_samples, n_targets]
        Target values. Will be cast to X's dtype if necessary
    
    sample_weight : numpy array of shape [n_samples]
        Individual weights for each sample
    
        .. versionadded:: 0.17
           parameter *sample_weight* support to LinearRegression.
    
    Returns
    -------
    self : returns an instance of self.

Predicting the Test set results

In [11]:
y_pred = regressor.predict(X_test)

Visualizing the Training set results

In [12]:
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Visualizing the Test set results

In [13]:
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

The $R^2$

In [14]:
regressor.score(X_train,y_train)
Out[14]:
0.9494021955344463

Since our model is $y=a+b\times x + \epsilon$ then the estimation of $b$ is

In [15]:
regressor.coef_
Out[15]:
array([9213.15275885])

and the estimation of the intercept $a$ is

In [16]:
regressor.intercept_
Out[16]:
27334.81404888486

Method 2

We will be using now statsmodels library.

In [17]:
import statsmodels.api as sm
/Users/dhafermalouche/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

The model without constant (intercept):

$y=b\times x + \epsilon$

In [18]:
model0 = sm.OLS(y, X).fit()
In [19]:
model0.summary()
Out[19]:
OLS Regression Results
Dep. Variable: y R-squared: 0.973
Model: OLS Adj. R-squared: 0.972
Method: Least Squares F-statistic: 1048.
Date: Sun, 04 Nov 2018 Prob (F-statistic): 2.56e-24
Time: 20:40:25 Log-Likelihood: -327.28
No. Observations: 30 AIC: 656.6
Df Residuals: 29 BIC: 658.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 1.325e+04 409.401 32.376 0.000 1.24e+04 1.41e+04
Omnibus: 0.610 Durbin-Watson: 0.323
Prob(Omnibus): 0.737 Jarque-Bera (JB): 0.671
Skew: -0.121 Prob(JB): 0.715
Kurtosis: 2.308 Cond. No. 1.00

The model with constant

$y=a+b\times x + \epsilon$

In [20]:
X = sm.add_constant(X) 
X.head()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-06e640ddfb5b> in <module>()
      1 X = sm.add_constant(X)
----> 2 X.head()

AttributeError: 'numpy.ndarray' object has no attribute 'head'
In [21]:
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
In [22]:
model.summary()
Out[22]:
OLS Regression Results
Dep. Variable: y R-squared: 0.957
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 622.5
Date: Sun, 04 Nov 2018 Prob (F-statistic): 1.14e-20
Time: 20:40:27 Log-Likelihood: -301.44
No. Observations: 30 AIC: 606.9
Df Residuals: 28 BIC: 609.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.579e+04 2273.053 11.347 0.000 2.11e+04 3.04e+04
x1 9449.9623 378.755 24.950 0.000 8674.119 1.02e+04
Omnibus: 2.140 Durbin-Watson: 1.648
Prob(Omnibus): 0.343 Jarque-Bera (JB): 1.569
Skew: 0.363 Prob(JB): 0.456
Kurtosis: 2.147 Cond. No. 13.2

Multivariate Regression Model

Method 1

Loading libraries

In [23]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing data

In [24]:
dataset = pd.read_csv('startups.csv')
dataset.head()
Out[24]:
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94

Independent and dependent variables

In [25]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
In [26]:
X
Out[26]:
array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 294919.57, 'Florida'],
       [86419.7, 153514.11, 0.0, 'New York'],
       [76253.86, 113867.3, 298664.47, 'California'],
       [78389.47, 153773.43, 299737.29, 'New York'],
       [73994.56, 122782.75, 303319.26, 'Florida'],
       [67532.53, 105751.03, 304768.73, 'Florida'],
       [77044.01, 99281.34, 140574.81, 'New York'],
       [64664.71, 139553.16, 137962.62, 'California'],
       [75328.87, 144135.98, 134050.07, 'Florida'],
       [72107.6, 127864.55, 353183.81, 'New York'],
       [66051.52, 182645.56, 118148.2, 'Florida'],
       [65605.48, 153032.06, 107138.38, 'New York'],
       [61994.48, 115641.28, 91131.24, 'Florida'],
       [61136.38, 152701.92, 88218.23, 'New York'],
       [63408.86, 129219.61, 46085.25, 'California'],
       [55493.95, 103057.49, 214634.81, 'Florida'],
       [46426.07, 157693.92, 210797.67, 'California'],
       [46014.02, 85047.44, 205517.64, 'New York'],
       [28663.76, 127056.21, 201126.82, 'Florida'],
       [44069.95, 51283.14, 197029.42, 'California'],
       [20229.59, 65947.93, 185265.1, 'New York'],
       [38558.51, 82982.09, 174999.3, 'California'],
       [28754.33, 118546.05, 172795.67, 'California'],
       [27892.92, 84710.77, 164470.71, 'Florida'],
       [23640.93, 96189.63, 148001.11, 'California'],
       [15505.73, 127382.3, 35534.17, 'New York'],
       [22177.74, 154806.14, 28334.72, 'California'],
       [1000.23, 124153.04, 1903.93, 'New York'],
       [1315.46, 115816.21, 297114.46, 'Florida'],
       [0.0, 135426.92, 0.0, 'California'],
       [542.05, 51743.15, 0.0, 'New York'],
       [0.0, 116983.8, 45173.06, 'California']], dtype=object)
In [27]:
y
Out[27]:
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
       156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
       141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
       124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
       108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
        99937.59,  97483.56,  97427.84,  96778.92,  96712.8 ,  96479.51,
        90708.19,  89949.14,  81229.06,  81005.76,  78239.91,  77798.83,
        71498.49,  69758.98,  65200.33,  64926.08,  49490.75,  42559.73,
        35673.41,  14681.4 ])

Encoding categorical data

Importing library

In [28]:
dX = pd.DataFrame(X,columns=dataset.columns[:4])
In [29]:
dummies = pd.get_dummies(dX.State)
dummies.head()
Out[29]:
California Florida New York
0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
In [30]:
dummies1=pd.DataFrame(dummies.iloc[:, :-1].values,columns=['California','Florida'])
dummies1.head()
Out[30]:
California Florida
0 0 0
1 1 0
2 0 1
3 0 0
4 0 1
In [31]:
X1 = dataset.iloc[:, :-2].values
X1
Out[31]:
array([[165349.2 , 136897.8 , 471784.1 ],
       [162597.7 , 151377.59, 443898.53],
       [153441.51, 101145.55, 407934.54],
       [144372.41, 118671.85, 383199.62],
       [142107.34,  91391.77, 366168.42],
       [131876.9 ,  99814.71, 362861.36],
       [134615.46, 147198.87, 127716.82],
       [130298.13, 145530.06, 323876.68],
       [120542.52, 148718.95, 311613.29],
       [123334.88, 108679.17, 304981.62],
       [101913.08, 110594.11, 229160.95],
       [100671.96,  91790.61, 249744.55],
       [ 93863.75, 127320.38, 249839.44],
       [ 91992.39, 135495.07, 252664.93],
       [119943.24, 156547.42, 256512.92],
       [114523.61, 122616.84, 261776.23],
       [ 78013.11, 121597.55, 264346.06],
       [ 94657.16, 145077.58, 282574.31],
       [ 91749.16, 114175.79, 294919.57],
       [ 86419.7 , 153514.11,      0.  ],
       [ 76253.86, 113867.3 , 298664.47],
       [ 78389.47, 153773.43, 299737.29],
       [ 73994.56, 122782.75, 303319.26],
       [ 67532.53, 105751.03, 304768.73],
       [ 77044.01,  99281.34, 140574.81],
       [ 64664.71, 139553.16, 137962.62],
       [ 75328.87, 144135.98, 134050.07],
       [ 72107.6 , 127864.55, 353183.81],
       [ 66051.52, 182645.56, 118148.2 ],
       [ 65605.48, 153032.06, 107138.38],
       [ 61994.48, 115641.28,  91131.24],
       [ 61136.38, 152701.92,  88218.23],
       [ 63408.86, 129219.61,  46085.25],
       [ 55493.95, 103057.49, 214634.81],
       [ 46426.07, 157693.92, 210797.67],
       [ 46014.02,  85047.44, 205517.64],
       [ 28663.76, 127056.21, 201126.82],
       [ 44069.95,  51283.14, 197029.42],
       [ 20229.59,  65947.93, 185265.1 ],
       [ 38558.51,  82982.09, 174999.3 ],
       [ 28754.33, 118546.05, 172795.67],
       [ 27892.92,  84710.77, 164470.71],
       [ 23640.93,  96189.63, 148001.11],
       [ 15505.73, 127382.3 ,  35534.17],
       [ 22177.74, 154806.14,  28334.72],
       [  1000.23, 124153.04,   1903.93],
       [  1315.46, 115816.21, 297114.46],
       [     0.  , 135426.92,      0.  ],
       [   542.05,  51743.15,      0.  ],
       [     0.  , 116983.8 ,  45173.06]])
In [32]:
dX = pd.DataFrame(X1, columns=dataset.columns[:3])
In [33]:
dX1=dX.join(dummies1)
In [34]:
dX1
Out[34]:
R&D Spend Administration Marketing Spend California Florida
0 165349.20 136897.80 471784.10 0 0
1 162597.70 151377.59 443898.53 1 0
2 153441.51 101145.55 407934.54 0 1
3 144372.41 118671.85 383199.62 0 0
4 142107.34 91391.77 366168.42 0 1
5 131876.90 99814.71 362861.36 0 0
6 134615.46 147198.87 127716.82 1 0
7 130298.13 145530.06 323876.68 0 1
8 120542.52 148718.95 311613.29 0 0
9 123334.88 108679.17 304981.62 1 0
10 101913.08 110594.11 229160.95 0 1
11 100671.96 91790.61 249744.55 1 0
12 93863.75 127320.38 249839.44 0 1
13 91992.39 135495.07 252664.93 1 0
14 119943.24 156547.42 256512.92 0 1
15 114523.61 122616.84 261776.23 0 0
16 78013.11 121597.55 264346.06 1 0
17 94657.16 145077.58 282574.31 0 0
18 91749.16 114175.79 294919.57 0 1
19 86419.70 153514.11 0.00 0 0
20 76253.86 113867.30 298664.47 1 0
21 78389.47 153773.43 299737.29 0 0
22 73994.56 122782.75 303319.26 0 1
23 67532.53 105751.03 304768.73 0 1
24 77044.01 99281.34 140574.81 0 0
25 64664.71 139553.16 137962.62 1 0
26 75328.87 144135.98 134050.07 0 1
27 72107.60 127864.55 353183.81 0 0
28 66051.52 182645.56 118148.20 0 1
29 65605.48 153032.06 107138.38 0 0
30 61994.48 115641.28 91131.24 0 1
31 61136.38 152701.92 88218.23 0 0
32 63408.86 129219.61 46085.25 1 0
33 55493.95 103057.49 214634.81 0 1
34 46426.07 157693.92 210797.67 1 0
35 46014.02 85047.44 205517.64 0 0
36 28663.76 127056.21 201126.82 0 1
37 44069.95 51283.14 197029.42 1 0
38 20229.59 65947.93 185265.10 0 0
39 38558.51 82982.09 174999.30 1 0
40 28754.33 118546.05 172795.67 1 0
41 27892.92 84710.77 164470.71 0 1
42 23640.93 96189.63 148001.11 1 0
43 15505.73 127382.30 35534.17 0 0
44 22177.74 154806.14 28334.72 1 0
45 1000.23 124153.04 1903.93 0 0
46 1315.46 115816.21 297114.46 0 1
47 0.00 135426.92 0.00 1 0
48 542.05 51743.15 0.00 0 0
49 0.00 116983.80 45173.06 1 0
In [35]:
X2 = dX1.values
In [36]:
X2
Out[36]:
array([[1.6534920e+05, 1.3689780e+05, 4.7178410e+05, 0.0000000e+00,
        0.0000000e+00],
       [1.6259770e+05, 1.5137759e+05, 4.4389853e+05, 1.0000000e+00,
        0.0000000e+00],
       [1.5344151e+05, 1.0114555e+05, 4.0793454e+05, 0.0000000e+00,
        1.0000000e+00],
       [1.4437241e+05, 1.1867185e+05, 3.8319962e+05, 0.0000000e+00,
        0.0000000e+00],
       [1.4210734e+05, 9.1391770e+04, 3.6616842e+05, 0.0000000e+00,
        1.0000000e+00],
       [1.3187690e+05, 9.9814710e+04, 3.6286136e+05, 0.0000000e+00,
        0.0000000e+00],
       [1.3461546e+05, 1.4719887e+05, 1.2771682e+05, 1.0000000e+00,
        0.0000000e+00],
       [1.3029813e+05, 1.4553006e+05, 3.2387668e+05, 0.0000000e+00,
        1.0000000e+00],
       [1.2054252e+05, 1.4871895e+05, 3.1161329e+05, 0.0000000e+00,
        0.0000000e+00],
       [1.2333488e+05, 1.0867917e+05, 3.0498162e+05, 1.0000000e+00,
        0.0000000e+00],
       [1.0191308e+05, 1.1059411e+05, 2.2916095e+05, 0.0000000e+00,
        1.0000000e+00],
       [1.0067196e+05, 9.1790610e+04, 2.4974455e+05, 1.0000000e+00,
        0.0000000e+00],
       [9.3863750e+04, 1.2732038e+05, 2.4983944e+05, 0.0000000e+00,
        1.0000000e+00],
       [9.1992390e+04, 1.3549507e+05, 2.5266493e+05, 1.0000000e+00,
        0.0000000e+00],
       [1.1994324e+05, 1.5654742e+05, 2.5651292e+05, 0.0000000e+00,
        1.0000000e+00],
       [1.1452361e+05, 1.2261684e+05, 2.6177623e+05, 0.0000000e+00,
        0.0000000e+00],
       [7.8013110e+04, 1.2159755e+05, 2.6434606e+05, 1.0000000e+00,
        0.0000000e+00],
       [9.4657160e+04, 1.4507758e+05, 2.8257431e+05, 0.0000000e+00,
        0.0000000e+00],
       [9.1749160e+04, 1.1417579e+05, 2.9491957e+05, 0.0000000e+00,
        1.0000000e+00],
       [8.6419700e+04, 1.5351411e+05, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00],
       [7.6253860e+04, 1.1386730e+05, 2.9866447e+05, 1.0000000e+00,
        0.0000000e+00],
       [7.8389470e+04, 1.5377343e+05, 2.9973729e+05, 0.0000000e+00,
        0.0000000e+00],
       [7.3994560e+04, 1.2278275e+05, 3.0331926e+05, 0.0000000e+00,
        1.0000000e+00],
       [6.7532530e+04, 1.0575103e+05, 3.0476873e+05, 0.0000000e+00,
        1.0000000e+00],
       [7.7044010e+04, 9.9281340e+04, 1.4057481e+05, 0.0000000e+00,
        0.0000000e+00],
       [6.4664710e+04, 1.3955316e+05, 1.3796262e+05, 1.0000000e+00,
        0.0000000e+00],
       [7.5328870e+04, 1.4413598e+05, 1.3405007e+05, 0.0000000e+00,
        1.0000000e+00],
       [7.2107600e+04, 1.2786455e+05, 3.5318381e+05, 0.0000000e+00,
        0.0000000e+00],
       [6.6051520e+04, 1.8264556e+05, 1.1814820e+05, 0.0000000e+00,
        1.0000000e+00],
       [6.5605480e+04, 1.5303206e+05, 1.0713838e+05, 0.0000000e+00,
        0.0000000e+00],
       [6.1994480e+04, 1.1564128e+05, 9.1131240e+04, 0.0000000e+00,
        1.0000000e+00],
       [6.1136380e+04, 1.5270192e+05, 8.8218230e+04, 0.0000000e+00,
        0.0000000e+00],
       [6.3408860e+04, 1.2921961e+05, 4.6085250e+04, 1.0000000e+00,
        0.0000000e+00],
       [5.5493950e+04, 1.0305749e+05, 2.1463481e+05, 0.0000000e+00,
        1.0000000e+00],
       [4.6426070e+04, 1.5769392e+05, 2.1079767e+05, 1.0000000e+00,
        0.0000000e+00],
       [4.6014020e+04, 8.5047440e+04, 2.0551764e+05, 0.0000000e+00,
        0.0000000e+00],
       [2.8663760e+04, 1.2705621e+05, 2.0112682e+05, 0.0000000e+00,
        1.0000000e+00],
       [4.4069950e+04, 5.1283140e+04, 1.9702942e+05, 1.0000000e+00,
        0.0000000e+00],
       [2.0229590e+04, 6.5947930e+04, 1.8526510e+05, 0.0000000e+00,
        0.0000000e+00],
       [3.8558510e+04, 8.2982090e+04, 1.7499930e+05, 1.0000000e+00,
        0.0000000e+00],
       [2.8754330e+04, 1.1854605e+05, 1.7279567e+05, 1.0000000e+00,
        0.0000000e+00],
       [2.7892920e+04, 8.4710770e+04, 1.6447071e+05, 0.0000000e+00,
        1.0000000e+00],
       [2.3640930e+04, 9.6189630e+04, 1.4800111e+05, 1.0000000e+00,
        0.0000000e+00],
       [1.5505730e+04, 1.2738230e+05, 3.5534170e+04, 0.0000000e+00,
        0.0000000e+00],
       [2.2177740e+04, 1.5480614e+05, 2.8334720e+04, 1.0000000e+00,
        0.0000000e+00],
       [1.0002300e+03, 1.2415304e+05, 1.9039300e+03, 0.0000000e+00,
        0.0000000e+00],
       [1.3154600e+03, 1.1581621e+05, 2.9711446e+05, 0.0000000e+00,
        1.0000000e+00],
       [0.0000000e+00, 1.3542692e+05, 0.0000000e+00, 1.0000000e+00,
        0.0000000e+00],
       [5.4205000e+02, 5.1743150e+04, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00],
       [0.0000000e+00, 1.1698380e+05, 4.5173060e+04, 1.0000000e+00,
        0.0000000e+00]])

80% Training Set, 20% Test Set.

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.2, random_state = 0)

We can then create a regressor and “fit the line” (and use that line on Test Set):

In [38]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Out[38]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [39]:
regressor.coef_
Out[39]:
array([ 7.73467193e-01,  3.28845975e-02,  3.66100259e-02, -6.99369053e+02,
       -1.65865321e+03])
In [40]:
regressor.intercept_
Out[40]:
43253.53667068361

Predicting the Test set results

In [41]:
y_pred = regressor.predict(X_test)
y_pred 
Out[41]:
array([103015.20159776, 132582.27760831, 132447.73845184,  71976.09851266,
       178537.4822107 , 116161.24230157,  67851.69209689,  98791.73374679,
       113969.43533008, 167921.06569569])

Method 2

Let's start by importing the libraries and the data

In [42]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

Defining dependent and independent variables

In [43]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

Constructing the dummy variables

In [44]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

Deleting one dummy variable to avoid the dummy variable trap

In [45]:
X = X[:, 1:]
In [46]:
X[:4]
Out[46]:
array([[0.0000000e+00, 1.0000000e+00, 1.6534920e+05, 1.3689780e+05,
        4.7178410e+05],
       [0.0000000e+00, 0.0000000e+00, 1.6259770e+05, 1.5137759e+05,
        4.4389853e+05],
       [1.0000000e+00, 0.0000000e+00, 1.5344151e+05, 1.0114555e+05,
        4.0793454e+05],
       [0.0000000e+00, 1.0000000e+00, 1.4437241e+05, 1.1867185e+05,
        3.8319962e+05]])

Constructing the data of the independent variables

In [47]:
dX2 = pd.DataFrame(X, columns=['Florida','New York','R&D Spend', 'Administration', 'Marketing Spend'])

We add an intercept

In [48]:
dX2 = sm.add_constant(dX2)

Performing the regression model

In [49]:
model = sm.OLS(y, dX2).fit() ## sm.OLS(output, input)
In [50]:
model.summary()
Out[50]:
OLS Regression Results
Dep. Variable: y R-squared: 0.951
Model: OLS Adj. R-squared: 0.945
Method: Least Squares F-statistic: 169.9
Date: Sun, 04 Nov 2018 Prob (F-statistic): 1.34e-27
Time: 20:40:43 Log-Likelihood: -525.38
No. Observations: 50 AIC: 1063.
Df Residuals: 44 BIC: 1074.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
Florida 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
New York -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
R&D Spend 0.8060 0.046 17.369 0.000 0.712 0.900
Administration -0.0270 0.052 -0.517 0.608 -0.132 0.078
Marketing Spend 0.0270 0.017 1.574 0.123 -0.008 0.062
Omnibus: 14.782 Durbin-Watson: 1.283
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.266
Skew: -0.948 Prob(JB): 2.41e-05
Kurtosis: 5.572 Cond. No. 1.45e+06