Data Science Knowledge

Bias vs Variance

Low Bias(very sensitive to the training data), (then it performs poorly when we got new data)High Variance – Overfitting
Higher Bias(less sensitive to the training data), (then it performs better when we got new data)Low Variance – Underfitting
Error = bias^2 + variance + inreducible error
The best model is where the error is reduced
Compromise between bias and variance
Solution: Use Cross Validation

Precision is a good measure to determine, when the costs of False Positive is high.
We know that Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.
F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).
- F1: weighted average of the precision and recall of a model. 1 is the best, 0 is the worst.
- You would use it in classification tests where true negatives don’t matter much.

Regularization is an approach to address over-fitting in ML.
Overfitted model fails to generalize estimations on test data
When the underlying model to be learned is low bias/high variance, or when we have small amount of data, the estimated model is prone to over-fitting.

L2 Regularization: Prevents the weights from getting too large(defined by L2 norm). Larger the weights, more complex the model is, more chances of overfitting.
L1 Regularization: Prevents the weights from getting too large(defined by L1 norm). Larger the weights, more complex the model is, more chances of overfitting. L1 Regularization introduces sparsity in the weights. It forces more weights to be zero, than reducing the average magnitude of all weights.
Entropy: Used for the models that output probability. Forces the probability distribution towards uniform distribution.

Data augmentation: Create more data from available data by randomly cropping, dialting, rotating, adding small amount of noise, etc.
K-fold Cross-validation: Divide the data in to k groups. Train on (k - 1) groups and test on 1 group. Try all k possible combinations.

Injecting noise: Add random noise to the weights when they are being learned. It pushes the model to be relatively insensitive to small variations in the weights, hence regularization.
Dropout: Generally used for neural networks. Connections between consecutive layers are randomly dropped based on a dropout-ratio and the remaining network is trained in the current iteration. In the next iteration, another set of random connections are dropped.

L2 regularization tends to spread error among all the terms
L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting.
L1 corresponds to setting a Laplacean prior on the terms
L2 corresponds to a Gaussian prior.

Type I error is a false positive: claiming something has happened when it hasn’t
- e.g. Telling a man he is pregnant.
Type II eeror is a false negative: claiming nothing is happening when in fact something is.
- e.g. Telling a pregnant woman she isn’t carrying a baby.

The name Support Vector Classifier comes from the fact that the observations on the edge and within the Soft Margin are called Support Vectors.

Because it makes an assumption: the conditional probabilities is calculated as the pure product of the individual probabilities of components.
This implies the absolute independence of features – a condition probably never met in real life.

sort the nearest neighbors of the given point by the distances in increasing order

Combined multiple weak models/learners into one predictive model to reduce bias, variance and/or improve accuracy.

Bagging: Trains N different weak models(usually of same types - homogenous) with N non-overlapping subset of the input dataset in parallel. In the test phase, each model is evaluated. The label with the greatest number of predictions is selected as the prediction. Bagging methods reduces variance of the prediction. Simple voting
Boosting: Trains N different weak models(usually of same types - homogenous) with the complete dataset in a sequential order. The datapoints wrongly classified with previous weak model is provided more weights to that they an be classified by the next weak learner properly. In the test phase, each model is evaluated and based on the test error of each weak model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction. Weighted voting
Stacking: Trains N different weak models(usually of different types - heterogenous) with one of the two subsets of the dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their predictions and carry out final prediction using the other subset. In the test phase, each model predicts its label, these set of labels are fed to the meta learner which generates the final prediction. Focus on improving accuracy. Learned voting(meta-learning)