We will be using the library sklearn.model_selection
. We will estimate a regression model on training dataset and use it to predict the response target variable on a test dataset.
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
dataset
X
y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 5)
X_train
X_test
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
help(regressor.fit)
y_pred = regressor.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
The $R^2$
regressor.score(X_train,y_train)
Since our model is $y=a+b\times x + \epsilon$ then the estimation of $b$ is
regressor.coef_
and the estimation of the intercept $a$ is
regressor.intercept_
We will be using now statsmodels
library.
import statsmodels.api as sm
The model without constant (intercept):
$y=b\times x + \epsilon$
model0 = sm.OLS(y, X).fit()
model0.summary()
The model with constant
$y=a+b\times x + \epsilon$
X = sm.add_constant(X)
X.head()
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
model.summary()
Loading libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing data
dataset = pd.read_csv('startups.csv')
dataset.head()
Independent and dependent variables
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
X
y
Encoding categorical data
Importing library
dX = pd.DataFrame(X,columns=dataset.columns[:4])
dummies = pd.get_dummies(dX.State)
dummies.head()
dummies1=pd.DataFrame(dummies.iloc[:, :-1].values,columns=['California','Florida'])
dummies1.head()
X1 = dataset.iloc[:, :-2].values
X1
dX = pd.DataFrame(X1, columns=dataset.columns[:3])
dX1=dX.join(dummies1)
dX1
X2 = dX1.values
X2
80% Training Set, 20% Test Set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.2, random_state = 0)
We can then create a regressor and “fit the line” (and use that line on Test Set):
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
regressor.coef_
regressor.intercept_
Predicting the Test set results
y_pred = regressor.predict(X_test)
y_pred
Let's start by importing the libraries and the data
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
Defining dependent and independent variables
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Constructing the dummy variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
Deleting one dummy variable to avoid the dummy variable trap
X = X[:, 1:]
X[:4]
Constructing the data of the independent variables
dX2 = pd.DataFrame(X, columns=['Florida','New York','R&D Spend', 'Administration', 'Marketing Spend'])
We add an intercept
dX2 = sm.add_constant(dX2)
Performing the regression model
model = sm.OLS(y, dX2).fit() ## sm.OLS(output, input)
model.summary()