Machine Learning All-in-one

What is machine learning?

  • Arthur Samuel 1949 (1901 -1990): Machine Learning is a field of study that gives Computers the ability to learn without being explicitly programmed.
  • Tom Mitchell 1997 (1951 -): Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if it is performance on T, as measured by P, improves with experience E.

Full Cycle of a ML project

  1. define project
  2. define and collect data (Data Sampling) + Data preprocessing
  3. train model: training, Error analysis & iterative improvement -> loop between 2 and 3 until your model is done
  4. deploy in production (Machine Learning Systems Design): deploy, monitor, and maintain system -> back to 3 and/or 2 if needed

Types of Machine Learning

Pasted image 20231001114343.png|600

ML Algorithms Cheat Sheet

[[ML+Algorithms+Cheat+Sheet.pdf]]

Supervised Learning

Classification

Regression

Unsupervised Learning

Note

The distinction between unsupervised versus self-supervised learning can be blurry sometimes. Roughly:

  • Unsupervised learning attempts to learn representations without labels by not using any targets of any sort during training, e.g. by using correlations in activity between units.
  • Self-supervised learning attempts to learn representations without labels by using the data itself to generate targets, e.g. generating targets using the next word in a sentence
  • Put another way, self-supervised learning looks a lot like supervised learning in code, but there is a big difference related to the following question: do you as a machine learning researcher have to actually ask someone to label the data or not.

Density estimation

Anomaly detection

Clustering

Clustering Model Pros Cons
K-Means interpretability; works well on even-sized and globular-shaped data need to predefine the number of clusters; not appropriate for outliers; low computation efficiency
Hierarchical Clustering no need to predefine the number of cluster; high computation efficiency; works well on high dimensional data not appropriate for large data
DBSCAN no need to predefine the number of cluster; can determine arbitrarily-shaped clusters; can detect outlier unstable performance (sensitive to density units parameter)
Affinity Propagation no need to predefine the number of cluster low computation efficiency
Gaussian Mixture Model highest computation efficiency; ensure clusters to follow Gaussian distributions not appropriate when insufficient data in each cluster
Spectral Clustering high computaion efficiency need to predefine the number of clusters

Association Rule Learning

Dimensionality Reduction

Outlier Detection

Ensemble Learning

Reinforcement Learning

Deep Learning

Important

  • deep learning = training large neural network
  • deep learning is most powerful in supervised learning
  • applications: Advertisement, Images vision, Audio to Text, Machine translation, Autonomous Driving
  • Practical Aspects in Deep Learning

Model Selection & Improving

ML Model into production

Best Practice: Be careful about common issues

sample bias

data drift/shift: data distribution shifts

Endogeneity

the difference between correlation and causation

multicollinearity

Underfitting vs. Overfitting

Advanced Practice in ML