The Unreasonable Effectiveness of Recurrent Neural Networks | Blogmarks

I recently started reading How to Build a Large Language Model (from scratch). Early in the introduction of the book they mentioned that the breakthroughs behind the current surge in LLMs is relatively recent. A 2017 paper (Attention Is All You Need) from various people including several at Google Brain introduced major improvements over RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory Networks). The paper introduced the Transformer architecture which uses self-attention which enables two important things. First, it allows for better recall of long-range dependencies as input is processed (which deals with the vanishing gradient problem). Second, it overcomes the sequential processing limitation of RNNs allowing parallelizable processing of input.

Anyway, all of this had me feeling like it would be nice to know a bit more about Recurrent Neural Networks. A top search result I turned up was this Andrej Karpathy blog post from 2015 (two years before the aforementioned Transformer paper). It goes into a good amount of detail about how RNNs, specifically LSTMs, work and has nice diagrams. You might need to brush up on your linear algebra to appreciate the whole thing.