Dimensionality Reduction

There are two types of Dimensionality Reduction techniques:

Feature Selection

PCA capture the variability; LDA class separation
PCA is unsupervised; LDA is supervised (because of the relation to the dependent variable)
How LDA works exactly
- it creates new axes to maximize the class-separation, step by step (2-class example):
  1. maximize the distance between class means $μ_{1}$ and $μ_{2}$
  2. minimize the variation (which LDA calls "scatter") within each class $s_{1}$ and $s_{2}$
  3. compute $\frac{(μ_{1} - μ_{2})^{2}}{s_{1}^{2} + s_{2}^{2}}$ (= $\frac{d_{1, 2}^{2}}{s_{1}^{2} + s_{2}^{2}}$ ) and LDA should maximize it
  4. if more than 2 classes, e.g. there are 3 classes, the value to be maximized will be $\frac{d_{1, 2}^{2} + d_{1, 3}^{2} + d_{2, 3}^{2}}{s_{1}^{2} + s_{2}^{2} + s_{3}^{2}}$

= T-Distributed Stochastic Neighbor Embedding
t-SNE takes high-dimensional data and reduces it to a low-dimensional graph (2-D typically)
Unlike PCA (which is linear), t-SNE can reduce dimensions with non-linear relationships (such as “Swiss Roll” non-linear distribution)
it calculates a similarity measure based on the distance between points instead of trying to maximize variance.