Machine Learning All-in-one

What is machine learning?

Arthur Samuel 1949 (1901 -1990): Machine Learning is a field of study that gives Computers the ability to learn without being explicitly programmed.
Tom Mitchell 1997 (1951 -): Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if it is performance on T, as measured by P, improves with experience E.

Full Cycle of a ML project

define project
define and collect data (Data Sampling) + Data preprocessing
train model: training, Error analysis & iterative improvement -> loop between 2 and 3 until your model is done
deploy in production (Machine Learning Systems Design): deploy, monitor, and maintain system -> back to 3 and/or 2 if needed

Types of Machine Learning

ML Algorithms Cheat Sheet

[[ML+Algorithms+Cheat+Sheet.pdf]]

Supervised Learning

Classification

linear
- Logistic Regression
- Support Vector Machine SVM: Support Vector X#Support Vector Machine SVM
non-linear
- Kernel SVM: Support Vector X#Kernels SVM
- K-Nearest Neighbor (k-NN)
- Naive Bayes
- Decision Tree Classification: Decision Tree & Random Forest#Decision Tree
- Random Forest Classification: Decision Tree & Random Forest#Random Forest
Pros and Cons
Multi-class vs. Multi-label Classification

Regression

Types
- Linear Regression
- Polynomial Regression
- Regularized regression
  - Lasso regression: Cost Functions#^ed4ab0
  - Ridge regression: Cost Functions#^aef45a
- Support Vector Regression (SVR) Support Vector X#^46e7f9
- Decision Tree Regression: Decision Tree & Random Forest#Decision Tree
- Random Forest Regression: Decision Tree & Random Forest#Random Forest
Pros and Cons

Unsupervised Learning

Note

The distinction between unsupervised versus self-supervised learning can be blurry sometimes. Roughly:

Unsupervised learning attempts to learn representations without labels by not using any targets of any sort during training, e.g. by using correlations in activity between units.
Self-supervised learning attempts to learn representations without labels by using the data itself to generate targets, e.g. generating targets using the next word in a sentence
Put another way, self-supervised learning looks a lot like supervised learning in code, but there is a big difference related to the following question: do you as a machine learning researcher have to actually ask someone to label the data or not.

Density estimation

Anomaly detection

Clustering

Types
- Centroid-based Clustering: K-Means Clustering
- Connectivity-based Clustering: Hierarchical Clustering
- Density-based Clustering: DBSCAN
- Graph-based Clustering: Affinity Propagation
- Distribution-based Clustering: Gaussian Mixture Model
- Compression-based Clustering: Spectral Clustering
Pros and Cons

Clustering Model	Pros	Cons
K-Means	interpretability; works well on even-sized and globular-shaped data	need to predefine the number of clusters; not appropriate for outliers; low computation efficiency
Hierarchical Clustering	no need to predefine the number of cluster; high computation efficiency; works well on high dimensional data	not appropriate for large data
DBSCAN	no need to predefine the number of cluster; can determine arbitrarily-shaped clusters; can detect outlier	unstable performance (sensitive to density units parameter)
Affinity Propagation	no need to predefine the number of cluster	low computation efficiency
Gaussian Mixture Model	highest computation efficiency; ensure clusters to follow Gaussian distributions	not appropriate when insufficient data in each cluster
Spectral Clustering	high computaion efficiency	need to predefine the number of clusters

Association Rule Learning

Apriori: Association Rule Learning#Apriori
Eclat: Association Rule Learning#Eclat

Dimensionality Reduction

Feature Selection
Feature Extraction: transform datasets with more than 3 features (high-dimensional) into typically a 2D or 3D
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Kernel PCA
- Quadratic Discriminant Analysis (QDA)
- T-Distributed Stochastic Neighbor Embedding (t-SNE, Dimensionality Reduction#^53a070)
- Uniform manifold approximation and projection (UMAP): similarly to t-SNE, but also for general non-linear dimension reduction
- Autoencoders

Outlier Detection

Ensemble Learning

Bagging: Ensemble Learning#Bagging (Bootstrap Aggregating)
Boosting: Ensemble Learning#Boosting

Reinforcement Learning

Decision making
- Q-Learning: Reinforcement Learning#Q-Learning
- R Learning
- TD Learning
Upper Confidence Bound: Reinforcement Learning#Upper Confidence Bound
Thompson Sampling: Reinforcement Learning#Thompson Sampling

Deep Learning

Important

deep learning = training large neural network
deep learning is most powerful in supervised learning
applications: Advertisement, Images vision, Audio to Text, Machine translation, Autonomous Driving
Practical Aspects in Deep Learning

Model Selection & Improving

Cross-Validation
tuning on parameters/Hyperparameter Tuning
- randomized search
- grid search
Ensemble Learning

ML Model into production

Machine Learning Systems Design

Best Practice: Be careful about common issues

sample bias

selection bias
response bias
real-business scenario: activity bias (social media content), societal bias (human generated content), selection bias (generated by the model itself to form a feedback loop)

data drift/shift: data distribution shifts

covariant drift: data distribution of independent variables shifts
label/prior drift:
- data distribution of target variables shifts (the output distribution changes but, for a given output, the input distribution stays the same)
concept/posterior drift:
- the definition of label itself changes based on a particular feature (the input distribution remains the same but the conditional distribution of the output given an input changes)
general data distribution shifts
- feature change
- label schema change

Endogeneity

= a situation where there is a correlation between the predictor variables and the error term in a statistical model.
e.g. Rich becomes richer, poor becomes poorer.

Machine Learning All-in-one

Full Cycle of a ML project

Types of Machine Learning

ML Algorithms Cheat Sheet

Supervised Learning

Classification

Regression

Unsupervised Learning

Density estimation

Anomaly detection

Clustering

Association Rule Learning

Dimensionality Reduction

Outlier Detection

Ensemble Learning

Reinforcement Learning

Deep Learning

Model Selection & Improving

ML Model into production

Best Practice: Be careful about common issues

sample bias

data drift/shift: data distribution shifts

Endogeneity

the difference between correlation and causation

multicollinearity

Underfitting vs. Overfitting

Advanced Practice in ML