Model Basics
Modeling Steps
- Data Preparation - Generate or Read data
- Separate the independent and dependent variables
- Handle missing data (Needed if there is missing data)
- Encode Categorical Data (Needed if there is categorical data)
- Split training and test data
- Feature Scaling (Needed only for some models)
Handle missing data
Common Approach - Replace missing value with average of all values in the column
Sample Code
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3]) # Include all numeric columns
X[:, 1:3] = imputer.transform(X[:, 1:3])
Encode Categorical Data
- This usually pushes the encoded columns to the front of the array
Encoding the Categorical data when the data is related (e.g. male, female etc.) or the order matters (e.g. small,medium,large etc.)
Sample Code
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
Encoding the Categorical data when the data points are not related and the order does not matter (e.g. state names)
Sample Code
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
Split training and test data
Sample Code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Feature Scaling
- Used in some ML models(not all)
- Models where there is an implicit relation between the dependent and independent variables (e.g. Support Vector Regression Model)
- Done in order to avoid some features dominating the other features
- Feature scaling is always applied to columns
- Should not be applied to encoded columns
- Will result in loss of interpretation (of the original categories) if applied
Remember
Feature Scaling should always be done after splitting the training and test data. The test data should be clean and not a part of the feature scaling process.
Standardization
- This will result in all the features taking values between -3 and 3
- Works all the time irrespective of the distribution of the features
Normalization
- This will result in all the features taking values between 0 and 1
- Recommended when the distribution for most of the features are normalized
Sample Code
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# When some independent variables have been encoded
# Scale only the non-encoded colummns
# Since the encoded columns are present in the front of the array, we usually just take everything from the index of the first non encoded numerical column
# In the below code, 'n' is the number of resulting encoded columns after encoding
X_train[:, n:] = sc.fit_transform(X_train[:, n:])
# When no independent variables have been encoded
# Apply feature scaling to all independent variables
# Also apply to dependent variables, if needed
X_train = sc.fit_transform(X_train)
Model Evaluation
Regression Models
Classification Models
Confusion Matrix
Displays a matrix of four categories based on the actual and predicted labels
- True positive : actual = 1, predicted = 1
- False positive : actual = 0, predicted = 1
- False negative : actual = 1, predicted = 0
- True negative : actual = 0, predicted = 0
Also see Type I and Type II Errors
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | TN | FP |
| Actual Positive | FN | TP |
Sample Code
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
Accuracy
Fractions of samples predicted correctly
Sample Code
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
Recall
Fractions of positive events that are predicted correctly
Precision
Fractions of predicted positive events that are actually positive
Sample Code
from sklearn.metrics import precision_score
precision_score(y_test, y_pred)
F1 Score
Harmonic mean of recall and precision
- The higher the score the better the model
ROC Curve and ROC AUC Score
- Help with understanding the balance between true positive rate and false positive rates
- The area under curve metric helps to analyze the performance
- Inputs to these functions are the actual labels and the predicted probabilities (not the predicted labels)
- ROC stands for Receiver Operating Characteristic
- The roc curve function returns three lists:
- thresholds: all unique prediction probabilities in descending order
- fpr: the false positive rate (FP / (FP + TN)) for each threshold
- tpr: the true positive rate (TP / (TP + FN)) for each threshold
Sample Code
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred_prob)
Model Selection
Bias-Variance Tradeoff
Terminology
- Bias: Error in ML model due to incorrect assumptions
- Variance: Changes in the model resulting from using different parts of the training dataset for training
The ideal scenario is to have low bias and low variance
K-Fold Cross Validation
- Used to validate the model using various combinations of the training data
- Ensures that we don't just rely on a single validation using test data to determine how good or bad the model is
- Steps
- Divide the training data into \(k\) folds
- Iterate over the folds using \(k -1\) folds for training and the 1 fold for testing
Sample Code
from sklearn.model_selection import cross_val_score
# Example for SVC classifier
accuracies = cross_val_score(estimator = c_svc, X = X_train, y = y_train, cv = 10) # 10 training splits
accuracies
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Grid Search
- Allows testing various hyperparameter combinations to get the base selection
- Uses cross validation to ensure a good model selection with the optimal hyperparameters
- Steps
- Choose the hyperparameters to test the model with
- Divide the training data into \(k\) folds
- For each hyperparameter combination, iterate over the folds using \(k -1\) folds for training and the 1 fold for testing
Sample Code
# Example for SVC classifier
# Specify hyperparameters to test
parameters = [{'C': [0.2, 0.4, 0.6, 0.8, 1], 'kernel': ['linear']},
{'C': [0.2, 0.4, 0.6, 0.8, 1], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = c_svc,
param_grid = parameters,
scoring = 'accuracy',
cv = 10,
n_jobs = -1) # all processors in the machine will be used
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)