Build a Large Language Model (From Scratch) - Sebastian Raschka | Blogmarks

A lot of examples used when building up the concepts behind LLMs use vectors somewhere in the range of 2 to 10 dimensions. The dimensionality of real-world models is much higher:

The smallest GPT-2 models (117M and 125M parameters) use an embedding size of 768 dimensions... The largest GPT-3 model (175B parameters) uses an embedding size of 12,288 dimensions.

The Self-Attention Mechanism:

A key component of transformers and LLMs is the self-attention mechanism, which allows the model to weigh the importance of different words or tokens in a sequence relative to each other. This mechanism enables the model to capture long-range dependencies and contextual relationships within the input data, enhancing its ability to generate coherent and contextually relevant output.

Variants of the transformer architecture:
- BERT (Bidirectional Encoder Representations from Transformers) - "designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers."
- GPT (Generative Pretrained Transformers) - "primarily designed and trained to perform text completion tasks."

BERT receives input where some words are missing and then attempts to predict the most likely word to fill each of those blanks. The original text can then be used to provide feedback to the model's predictions during training.

GPT models are pretrained using self-supervised learning on next-word prediction tasks.

Foundation models are called that because the are generalized with their pretraining and can then be fine-tuned afterward to specific tasks.