You may complete the following entirely in a Jupyter notebook. Use the LaTeX interpreter in a markdown cell where needed.
A linear model such as logistic regression or linear regression might overfit the data when there is a large number of features. One way to address this issue is to regularize or shrink the contribution of some of the features by reducing the magnitude of their corresponding weights or coefficients. By doing so, we can reduce the variance of the predicted values, and we can enhance the interpretability of the linear model by determining a smaller subset of the most important features. This is achieved by adding a penalty or regularization term to the cost function, which forces the learning algorithm to not only fit the data but also to keep the weights of the model as small as possible.
One way to regularize linear regression is to modify the cost function as follows:
\[\frac{1}{N} \sum_{j=1}^N \left( w_0 + \sum_{i=1}^m w_i x_i^{(j)} - y^{(j)} \right)^2 + \lambda \sum_{i=1}^m w_i^2\]The penalty term with coefficient λ is known as the L2 norm of the weight vector. It is the regularization term or shrinkage penalty and is a hyperparameter that you can use to control how much you want to regularize the model. It is important to scale the data before performing regularization so that the weights are on comparable scales for features of comparable importance. Regularizing linear regression using the L2 norm is known as the ridge regression.
Lasso regression uses the L1 norm of the weight vector. An important characteristic of the Lasso regression is that it can force some of the weights to be exactly equal to zero when the hyperparameter λ is high enough. On the other hand, ridge regression shrinks all of the weights toward zero but does not set any of them to zero.
train_test_split
.Use the ridge regression implementation in scikit-learn and fit several ridge regression models to the training data.
Make sure to scale the training data before fitting each model.
Use a different λ (α in scikit-learn) for each of these models. You can try values in np.arange(0,MAXIMUM,STEP)
or np.linspace(0,MAXIMUM,STEPS)
; consider np.logspace
if you want to try a few orders of magnitude or more of λ values, but note its arguments differ. Choose your λ values so that you see a wide range of coefficient norms; make it large enough so you can see asymptotic performance.
For each fitted model extract its coefficients and compute their norms. How is the norm of the coefficients changed by varying the hyperparameter λ?
The ridge regression implementation differs from most others in that it requires you to request that it calculate an intercept (otherwise it assumes a 0 intercept!) and returns the intercept and weights instead of providing a predict method that applies them for you.
Apply the following 3 model types:
For the ridge regression and lasso models, try different values for λ (or α in sklearn), and then choose a final model and evaluate it on the test set. (Optional: show train vs. test MSE vs. λ for both model types.)
Optional: To increase confidence that the model generalizes well, perform 5-fold cross-validation. Note: to perform cross-validation (5-fold), you can use cross_validate
and specify the scoring as neg_mean_squared_error
or mean_squared_error
.
Optional: In addition to plotting the norm of the weights overall (the regularization term), make a plot showing how each weight varies with λ.
Many other models, such as logistic regression, can be also extended by including a regularization term in the cost function. The regularization term can be in terms of the L2 norm or L1 norm of the weight vector.
We’ve seen that the cost functions used in training the machine models can be written as:
\[\sum_{j=1}^N \text{Loss}(y_{\text{pred}}^{(j)}, y_{\text{true}}^{(j)})\]ML models used for classification assume that there is an equal number of samples observed from each class, and that all losses incurred due to misclassification are the same. However, this is not always the case, as the data might be imbalanced and where the minority class is the class of interest for us.
One way to mitigate the class imbalances is to modify the cost function to the following:
\[v_1 \sum_{j \in \text{Class 1}} \text{Loss}(y_{\text{pred}}^{(j)}, y_{\text{true}}^{(j)}) + v_0 \sum_{j \in \text{Class 0}} \text{Loss}(y_{\text{pred}}^{(j)}, y_{\text{true}}^{(j)})\]v₁ and v₀ are weights assigned to each class, where the v hyperparameters are chosen in order to draw the attention of the learning algorithm to the minority class. When the weighted version of the cost function is implemented, we refer to the modified version of the algorithm as the class weighted algorithm or weighted algorithm.
If class 1 is the minority class, how should v₁ be chosen with respect to v₀?
Load the creditcard.csv
data.
StandardScaler
, scale the testing set in the same way.class_weight
to “balanced”).