The History of Artificial Intelligence

The Art of GPT

From Information Theory to GPT in 200-line MircoGPT.
Everything else is just efficiency. - Karpathy

1# Loading microGPT source…
1948
Information Theory
Claude Shannon
Published "A Mathematical Theory of Communication" — the foundation of information theory. Introduced entropy and cross-entropy concepts.
1986
Backpropagation Popularized
Rumelhart, Hinton, Williams
Published "Learning representations by back-propagating errors" in Nature. Made backpropagation famous and started the modern deep learning era.
1997
LSTM
Sepp Hochreiter & Jürgen Schmidhuber
"Long Short-Term Memory" dominated sequence modeling for 20 years until the Transformer replaced it.
2003
Neural Language Models
Yoshua Bengio et al.
A Neural Probabilistic Language Model" — First successful neural language model. Learned distributed word representations and predicted next words.
2012
AlexNet
Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
Won ImageNet by a landslide using deep CNNs on GPUs. Single-handedly reignited the deep learning revolution and proved neural networks could work at scale.
2013
Word2Vec
Tomas Mikolov et al. (Google)
"Efficient Estimation of Word Representations in Vector Space". Showed that vector arithmetic captures semantic relationships (king − man + woman ≈ queen).
2015
Adam Optimizer
Diederik Kingma & Jimmy Ba
"Adam: A Method for Stochastic Optimization" — Combines momentum and RMSprop. Became the most widely-used optimizer in deep learning.
2016
ResNet
Kaiming He et al.
"Deep Residual Learning for Image Recognition" — Won CVPR Best Paper. Residual connections enable training of 100+ layer networks.
2017
ATTENTION IS ALL YOU NEED
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
THE MOST IMPORTANT PAPER!
Introduced the Transformer architecture: Multi-head self-attention, positional encoding, encoder-decoder structure. No recurrence or convolution needed.
2018
GPT-1
Alec Radford et al. (OpenAI)
"Improving Language Understanding by Generative Pre-Training" — Decoder-only Transformer trained autoregressively. Pre-training + fine-tuning paradigm.
BERT
Jacob Devlin et al. (Google)
"BERT: Pre-training of Deep Bidirectional Transformers" — Bidirectional encoder using masked language modeling.
2019
GPT-2
Alec Radford et al. (OpenAI)
"Language Models are Unsupervised Multitask Learners" — 10× larger than GPT-1. Demonstrated zero-shot task transfer. Generated surprisingly coherent text.
2020
GPT-3
Tom Brown et al. (OpenAI)
"Language Models are Few-Shot Learners" — 100× larger than GPT-2. Demonstrated in-context learning and few-shot learning. Showed emergent abilities at scale.
Scaling Laws
Jared Kaplan et al. (OpenAI)
"Scaling Laws for Neural Language Models" — Showed that loss decreases predictably with more data, compute, and parameters.
2022
ChatGPT & RLHF
OpenAI
InstructGPT + ChatGPT launch. Reinforcement Learning from Human Feedback (RLHF) aligned language models with human intent.
2023
GPT-4
OpenAI
Multimodal model accepting images and text. State-of-the-art performance on reasoning benchmarks.