The Thief’s Almanac

a roundup of the ML-speedrunning course so far (linear regression):

## Jupyter Notebook ## 
1. Exploratory Data Analysis (EDA):

- gotten a grasp on EDA methods i.e. check/drop duplicates, deal with na values, check/deal with outliers (drop/cap/impute mean)
- check distribution of target variable - if skewed (left/right) and apply an appropriate transformation (e.g. log)
- univariate analysis/visualisation: trellis plots, scatterplots, using seaborn and plotly (for interactive, e.g. hovering)
- bivariate analysis: correlation matrix (any highly correlated features? maybe drop 1 for redundancy)

2. Data-splitting
3. Feature scaling (normalise or standardise)
- using standardscaler() to standardise necessary numerical values

9. Build Lasso regression model 
- repeat steps 7 and 8 to compare Lasso v.s. Baseline
10. Hyperparameter Tuning
- GridSearchCV (exhaustive/inefficient; for small search space) OR RandomisedSearchCV (random/efficient; for large search space)
- Perform grid search with CV for ridge and lasso regression to find the best perimeters.
11. Evaluate Final model on X_test using the best parameters.

7. Build ridge regression model to penalise large coefficient (magnitude)
- fit X_train, y_train to ridge model's pipeine
- predict on X_val
- evaluate metrics: MAE', MSE', RMSE', R-squared'
8. Compare coefficients
- useful to visualise the differnt _coef using a seaborn barplot

4. Feature encoding pipeline (for categorical features)
- one-hot encoding (binary) or ordinal encoding
5. Assemble main pipeline (columnTransformer)
6. Build baseline model:
- fit X_train, y_train through main pipeline (i.e. training the baseline model)
- predict on X_val
- evaluate metrics: MAE, MSE, RMSE, R-squared