| 1 | # Loading microGPT source… |
1948
Information Theory
Published "A Mathematical Theory of Communication" — the foundation of information theory.
Introduced entropy and cross-entropy concepts.
1986
Backpropagation Popularized
Published "Learning representations by back-propagating errors" in Nature.
Made backpropagation famous and started the modern deep learning era.
1997
LSTM
"Long Short-Term Memory" dominated sequence modeling for 20 years until the Transformer replaced it.
2003
Neural Language Models
A Neural Probabilistic Language Model" — First successful neural language model.
Learned distributed word representations and predicted next words.
2012
AlexNet
Won ImageNet by a landslide using deep CNNs on GPUs.
Single-handedly reignited the deep learning revolution
and proved neural networks could work at scale.
2013
Word2Vec
"Efficient Estimation of Word Representations in Vector Space".
Showed that vector arithmetic captures semantic relationships
(king − man + woman ≈ queen).
2015
Adam Optimizer
"Adam: A Method for Stochastic Optimization" — Combines momentum and RMSprop.
Became the most widely-used optimizer in deep learning.
2016
ResNet
"Deep Residual Learning for Image Recognition" — Won CVPR Best Paper.
Residual connections enable training of 100+ layer networks.
2017
⭐ ATTENTION IS ALL YOU NEED
THE MOST IMPORTANT PAPER!
Introduced the Transformer architecture: Multi-head self-attention, positional encoding, encoder-decoder structure. No recurrence or convolution needed.
Introduced the Transformer architecture: Multi-head self-attention, positional encoding, encoder-decoder structure. No recurrence or convolution needed.
2018
GPT-1
"Improving Language Understanding by Generative Pre-Training" —
Decoder-only Transformer trained autoregressively. Pre-training + fine-tuning paradigm.
BERT
"BERT: Pre-training of Deep Bidirectional Transformers" —
Bidirectional encoder using masked language modeling.
2019
GPT-2
"Language Models are Unsupervised Multitask Learners" —
10× larger than GPT-1. Demonstrated zero-shot task transfer.
Generated surprisingly coherent text.
2020
GPT-3
"Language Models are Few-Shot Learners" — 100× larger than GPT-2.
Demonstrated in-context learning and few-shot learning.
Showed emergent abilities at scale.
Scaling Laws
"Scaling Laws for Neural Language Models" — Showed that loss decreases
predictably with more data, compute, and parameters.
2022
ChatGPT & RLHF
InstructGPT + ChatGPT launch. Reinforcement Learning from Human Feedback
(RLHF) aligned language models with human intent.
2023
GPT-4
Multimodal model accepting images and text. State-of-the-art performance
on reasoning benchmarks.