Arguably, the main reason that deep nets became so powerful is self-supervision. In many domains, from image, to text, to DNA analysis, the concept of self-supervision was sufficient to generate practically infinite “labelled” data for training deep models. The idea is simple yet extremely powerful: just hide some parts of the (unlabelled) data and turn the hidden parts into the labels to predict.
Here are some notes (mostly to myself) about self-supervision.
There are two standard ways to make self supervision: Auto-regression and de-noising. Auto-regression typically involves a causal model, i.e., we aim to predict the next word given the previous words, without using words “from the future” (e.g., if we predict the nth word of a sentence, we have access only to the first n-i words). De-noising typically does not underlie causal model. An example to de-noising would be taking an image and hiding or adding noise to some part of it, and then predicting the noise-free (i.e., original) image.
More to be added to this post:
- Discrete vs. Continuous (i.e., the power of embedding)
- Conv-nets vs. Attention-based nets
- Probabilistic vs. deterministic
- Latent space vs. original space