The Art of GPT

1948

Information Theory

Claude Shannon

Published "A Mathematical Theory of Communication" — the foundation of information theory. Introduced entropy and cross-entropy concepts.

1986

Backpropagation Popularized

Rumelhart, Hinton, Williams

Published "Learning representations by back-propagating errors" in Nature. Made backpropagation famous and started the modern deep learning era.

1997

LSTM

Sepp Hochreiter & Jürgen Schmidhuber

"Long Short-Term Memory" dominated sequence modeling for 20 years until the Transformer replaced it.

2003

Neural Language Models

Yoshua Bengio et al.

A Neural Probabilistic Language Model" — First successful neural language model. Learned distributed word representations and predicted next words.

2012

AlexNet

Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton

Won ImageNet by a landslide using deep CNNs on GPUs. Single-handedly reignited the deep learning revolution and proved neural networks could work at scale.

2013

Word2Vec

Tomas Mikolov et al. (Google)

"Efficient Estimation of Word Representations in Vector Space". Showed that vector arithmetic captures semantic relationships (king − man + woman ≈ queen).

2015

Adam Optimizer

Diederik Kingma & Jimmy Ba

"Adam: A Method for Stochastic Optimization" — Combines momentum and RMSprop. Became the most widely-used optimizer in deep learning.

2016

ResNet

Kaiming He et al.

"Deep Residual Learning for Image Recognition" — Won CVPR Best Paper. Residual connections enable training of 100+ layer networks.

2017

⭐ ATTENTION IS ALL YOU NEED
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

              THE MOST IMPORTANT PAPER!

              Introduced the Transformer architecture: Multi-head self-attention, positional encoding,
              encoder-decoder structure. No recurrence or convolution needed.
            

2018

GPT-1

Alec Radford et al. (OpenAI)

"Improving Language Understanding by Generative Pre-Training" — Decoder-only Transformer trained autoregressively. Pre-training + fine-tuning paradigm.

BERT

Jacob Devlin et al. (Google)

"BERT: Pre-training of Deep Bidirectional Transformers" — Bidirectional encoder using masked language modeling.

2019

GPT-2

Alec Radford et al. (OpenAI)

"Language Models are Unsupervised Multitask Learners" — 10× larger than GPT-1. Demonstrated zero-shot task transfer. Generated surprisingly coherent text.

2020

GPT-3

Tom Brown et al. (OpenAI)

"Language Models are Few-Shot Learners" — 100× larger than GPT-2. Demonstrated in-context learning and few-shot learning. Showed emergent abilities at scale.

Scaling Laws

Jared Kaplan et al. (OpenAI)

"Scaling Laws for Neural Language Models" — Showed that loss decreases predictably with more data, compute, and parameters.

2022

ChatGPT & RLHF

OpenAI

InstructGPT + ChatGPT launch. Reinforcement Learning from Human Feedback (RLHF) aligned language models with human intent.

2023

GPT-4

OpenAI

Multimodal model accepting images and text. State-of-the-art performance on reasoning benchmarks.