Regression
Model Types
Simple Linear Regression
- Predict dependent variable based on a single independent variable
- Works well on linear pattern
- Joining data points result in a single line
- Does not require Feature Scaling
Sample Code
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True)
lr.fit(X_train, y_train)
intercept = lr.intercept_
coefficient = lr.coef_
y_pred = lr.predict(X_test)
Multiple Linear Regression
- Predict dependent variable based on multiple independent variables
- Works well on linear pattern
- Joining data points result in a single line
- Does not require Feature Scaling
Remember
For ML models, the categorical variables need to be transformed to dummy variables (using one hot encoding, for example). However, care should be taken not to include all the resulting dummy variables for multiple regression models, as it will result in multicollinearity. This is also referred to as the Dummy Variable Trap.
In order to avoid this, we should always omit one dummy variable corresponding to each categorical variable.
In the python code, the scikit learn Multiple Regression class automatically takes care of this, so we don't have to omit the dummy variables explicitly
Tip
In Multiple Linear Regression, there is no need to apply Feature Scaling.
The coefficients for the independent variables will put everything on the same scale.
Backward Elimination using Statsmodel
Sample Code
import statsmodels.api as sm
# Here X is the array after one hot encoding the independent variable
X = X[:, 1:] # Avoiding the Dummy Variable Trap
# Statsmodel does not take into account the constant (intercept).
# So we need to add it in the form b0x0 where x0 is an array of 1s
X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis = 1)
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
Polynomial Regression
- Predict dependent variable based on single independent variable
- Works well on non-linear pattern
- Joining data points result in a single line
- Does not require Feature Scaling
- Need to choose the right polynomial degree for a good bias/variance tradeoff
Sample Code
from sklearn.preprocessing import PolynomialFeatures
num_features = 4
pf4 = PolynomialFeatures(degree=num_features, include_bias=False)
X4 = pf4.fit_transform(X_train)
from sklearn.linear_model import LinearRegression
lr4 = LinearRegression()
lr4.fit(X4, y)
print (lr4.intercept_)
print (lr4.coef_)
y_pred = lr4.predict(pf4.transform(X_test))
Support Vector Regression
- Predict dependent variable based on single independent variable
- Works very well on non linear problems; works with linear problems as well
- Not biased by outliers
- Joining data points result in a single line
- Requires Feature Scaling
Sample Code
# Scale Data
from sklearn.preprocessing import StandardScaler
# Apply feature scaling to independent variables
# Also apply to dependent variables, if needed
sc_x = StandardScaler()
sc_y = StandardScaler()
X_train = sc_x.fit_transform(X_train)
y_train = sc_y.fit_transform(y_train)
from sklearn.svm import SVR
r_svr = SVR(kernel = 'rbf') # Using radial basis function kernel
r_svr.fit(X_train,y_train)
y_pred_sc = r_svr.predict(sc_x.transform(X_test))
# Inverse Transform to get the predicted value in the original scale
y_pred = sc_y.inverse_transform(y_pred_sc.reshape(-1,1))
Decision Tree Regression
- Predict dependent variable based on one or more independent variables
- More applicable to datasets with high number of independent variables
- Works on both linear / nonlinear problems
- Does not require Feature Scaling
- Poor results if dataset is too small
- Overfitting can easily occur
Sample Code
from sklearn.tree import DecisionTreeRegressor
r_dt = DecisionTreeRegressor(random_state = 0)
r_dt.fit(X_train,y_train)
y_pred = r_dt.predict(X_test)
Random Forest Regression
- Predict dependent variable based on one or more independent variables
- Is an ensemble method
- Create multiple models (such as decision trees) using different combinations of subset of the independent variables
- Need to choose the number of trees
- Final prediction is based on an aggregated value (e.g. avg) of all the predictions
- Does not require Feature Scaling
- Difficult to interpret
- Overfitting can easily occur
Sample Code
from sklearn.ensemble import RandomForestRegressor
r_rf = RandomForestRegressor(n_estimators = 10, random_state = 0)
r_rf.fit(X_train,y_train)
y_pred = r_rf.predict(X_test)
XGB Regression
Sample Code
from xgboost import XGBRegressor
r_xgb = XGBRegressor()
r_xgb.fit(X_train, y_train)
y_pred_xgb = r_xgb.predict(X_test)