groking-transformers-series-intro

# Groking Transformers 0/4 - series intro Get a **deep intuition + understanding** of the **logic, math & code** behind a transformer neural network. > define:**grok** /ɡrɒk/ verb | informal | US > - to understand intuitively or by empathy, to establish rapport with > - **"When you grok something, you just get it — in other words, you totally grasp its meaning."** > - **"When you claim to 'grok' some knowledge or technique, you are asserting that you have not merely learned it in a detached instrumental way but that it has become part of you, part of your identity."** # Series intro What everyone has started to see with the *ChatGPT-hype* and everything around it, going as far as the recent resurgence of the term AGI, was only the tip of the iceberg of what was happening in the field Deep Learning (a sub-field of Machine Learning (ML), itself a sub-field of AI). A *new neural-network architecture, the Transformer*, was revolutionizing the field, at the same time unifying the different ML sub-fields (like computer-vision, natural-language etc.) through the usage of a similar architecture in different fields (and making optimizations made by experts in one field be applicable for all others), and at the same time making feasable ("cheap enough" + useful / profitable) the training of networks with billions of paramters. I'm shameful to admit I've fully jumped into the Machine-Learning Engineer role a few years *after* the seminal "Attention is all you need" (2017) paper, and for years I've lived with just a "bare minimal" understanding of this architecture - along the lines of _"oh, uhm, you have those embeddings, yeah attention matrix, uhm, yeah, that makes sense too ...auto-regressive so uhm it sort of can behave like an RNN too ...the code, uhm, that's actually simpler than I expected, uhm, yeah that and that make sense too"_. I knew it went deeper than this when I saw the success of transformers in computer-vision, so it was on my to-do list to go deeper. And I basically procrastinted the f out of it ultill the ChatGPT madness and the "Sparks of AGI" (2023) "paper" started to ring some alarm sirens in the back of my head. As luck has it, by this time the web was flooded with _really good explanations of the concepts._ After feeling I got far enough myself, I even held an internal workshop/training for our own AI/ML/Data-science team (and others interested from our now big-ish corporate landscape) - _it was far from a success, with horrid time-management (first try only managed to cover 30% of the content, a separate requested sub-workshop I held for the actual hands on part)_. And in doing it I realized **how many way better ways to explain things I've totally missed.** Anyway, here it is - a totally re-imagined v2 (or 3?) of an **introduction to the trannsformer arhcitecture** that is ment to be both **deep** AND **wide**: * from **basic intuitions to proper understanding of the math and logic** (part 1) * moving on to **fully coding a baby-GPT transformer text generative model "from scratch"** (part 2) * proceeding to **understanding the usage of transformers in computer-vision, speech-to-tech, graphs etc. and of multi-modal transformer models** (part 3) * then **ascending to higher intuitions about: emergence, meta-learning, transformers as a general differentiable arhitecture amongst the likes of NeuralGPUs and Neural Turing Machines** and speculation of what could be next (part 4) We'll go **waaay beyond / deeper than** what other sources go in terms of intuition, diving from the get-go into the motions like "attention blocks understood as communication followed computation" (not just the basic "attention in terms of search with queries and keys etc.") or "graphs being the 'native' input for a transformer" and its consequences (like the "obvious" success of appllying this to problems outside text / natural-language).