Welcome to Incels.is - Involuntary Celibate Forum

Welcome! This is a forum for involuntary celibates: people who lack a significant other. Are you lonely and wish you had someone in your life? You're not alone! Join our forum and talk to people just like you.

Elegance - complete GPT from training to inference in 200 lines

Knajjd

Knajjd

Commander
★★
Joined
Sep 2, 2021
Posts
3,122
Online time
9h 15m
200 lines of code, no libs except for standard libs.

From the author:
"This file contains the full algorithmic content of what is needed: dataset of documents, tokenizer, autograd engine, a GPT-2-like neural network architecture, the Adam optimizer, training loop, and inference loop. Everything else is just efficiency. I cannot simplify this any further."

Source:

Code + Timeline:

And here is the complete code:

Microgpt
 
Last edited:
200 lines of code, no libs except for standard libs.

From the author:
"This file contains the full algorithmic content of what is needed: dataset of documents, tokenizer, autograd engine, a GPT-2-like neural network architecture, the Adam optimizer, training loop, and inference loop. Everything else is just efficiency. I cannot simplify this any further."

Source:

Code + Timeline:

And here is the complete code:

View attachment 1683755
What's even the purpose of this
 
What's even the purpose of this

It is considered remarkable for structural and conceptual reasons,
not performance.

1) It compresses a modern Transformer language model into a single,
dependency-free Python file. Most real systems rely on large
frameworks (e.g., PyTorch), GPU kernels, tensor libraries, and
distributed infrastructure. microgpt removes all of that and
still reproduces the essential algorithmic structure of GPT.

2) It implements its own automatic differentiation engine.
Backpropagation via the chain rule is central to deep learning.
Building a working autograd system from scratch — and making it
function correctly through a Transformer — is technically
non-trivial.

3) It reveals how compact the intellectual core of GPT models is.
Attention, embeddings, normalization, nonlinear layers,
cross-entropy loss, and Adam optimization are sufficient to
create a functioning generative model. The mathematics itself
is not enormous.

4) It is pedagogically powerful.
Every step — raw text → tokenization → embeddings →
attention → loss → gradient → parameter update — is visible
and inspectable. Modern frameworks often hide this structure.

5) It clarifies the role of scale.
The difference between microgpt and large production models
is not new mathematics. It is scale: more parameters,
more data, more compute, and heavy engineering optimization.

In short, it is remarkable because it reduces a GPT-style
language model to a compact, readable artifact while preserving
the complete training and inference loop.
 
Last edited:
Very interesting. Thanks.

I want to start studyng LLMs in deep. I feel curiosity by small LLM as llama2.c and others, and to analyze their code.
 
Very interesting. Thanks.

I want to start studyng LLMs in deep. I feel curiosity by small LLM as llama2.c and others, and to analyze their code.
I found the link here which carries a lot of AI articles that the mainstream sites don't carry.



I posted this a few months back which you might also find interesting. It's about the story of how two failed techs, CUDA and neural nets, that when combined created AI.

 
Last edited:
Are you the real knajjd?
 

Users who are viewing this thread

shape1
shape2
shape3
shape4
shape5
shape6
Back
Top