ML Concepts

Familiarity with common ML algorithms and their usage within applied data science use cases.

Is mindful of overfitting and common techniques to address overfitting. Understands the rationale behind these techniques.

Understands that random splitting cannot be applied in this context. Must define test set based on most recent dates and train set on prior dates.

For regression models, common metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared.
For classification models, usually look at Accuracy, Precision, Recall, F1-Score, and AUC-ROC.
For clustering, Silhouette Score, Davies-Bouldin index, and Distortion (sum of squared distances to cluster centers) are often used.

Understands problem of overfitting and the appropriate hyperparameters to tune in common ML algorithms.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
Underfitting refers to a model that can neither model the training data nor generalize to new data.

Can relate to feeling of scepticism and suggests potential pitfalls that might falsely exagerate model performance eg. target leak, imbalance datasets etc.

Suggests alternative encodings, understands that the pros and cons and how this might depend on the problem domain and model algorithm.
Is OHE the only way? For NN, OHE is fine. If tree based algorithms, ask why OHE may not be ideal (e.g. RF, boosted trees may subsample the columns and if cardinality is huge, you are adding on to curse of dimensionality and potentially the numerical features being overwhelmed)

Classical ML is often concerned with prediction, but causal inference seeks to understand the cause-and-effect relationship between variables. It’s about asking ‘what if’ questions.
Common techniques for causal inference include randomized controlled trials, natural experiments, difference-in-differences, instrumental variables, regression discontinuity, and propensity score matching.

Approaches may include: Deleting Rows/Columns, Replacing with Mean/Median/Mode, Predicting the Missing Values, Assigning a Unique Category to Missing Values.

If a candidate mentions any specific algorithms, check what are the parameters that was considered when using such algorithm. E.g. learning rate, number of rounds, depth, colsubsample, rowsubsample, min child weight etc.
Check if the candidate had given thoughts about model selection process. Things to consider should account for complexity vs simplicity when it comes to deployment.
If a candidate chooses neural networks (esp for tabular problems) - the reasoning should be solid, else check with the candidate why no other forms of algorithms that may give better results.
Candidate’s knowledge on model interpretation methods, including model agnostic methods: SHAP, LIME…etc and if candidate mentions, have the candidate talk about how it is calculated, knows that SHAP is instance level computation.