Building a Language Model completely from scratch: Tech Tree
A tech tree from Numpy to a modern Language Model
Keeping up with AI is hard
As you might have heard (from every single tech outlet), AI is developing at a rapid pace and it can be hard to keep up. If a new technology or breakthrough comes out every week, how can anyone possibly stay on top of everything?
Part of the problem is that the recent breakthroughs are built on a foundation of technologies and ideas that have been developed for decades. If you miss a link in the chain, then everything that depends on it becomes harder to understand.
There are some great resources (like Andrej Karpathy's excellent nanoGPT) that detail every step you need to follow to build a modern deep learning model but they always leave me feeling like I am missing something. Whether it is the reason why a particular algorithm works so well or the choice of a specific hyperparameter, I often get the sense that I don’t quite understand everything.
To figure out where the gaps in my knowledge are, I am building a language model completely from scratch. I want to go from nothing to a fully functional coding assistant.
I know my starting point (Python with Numpy installed) and I know the end point (a functioning code assistant) but what do I put in the middle?
Tech Tree
In the game Civilisation 5 you start off as a primitive civilisation that cant do much more than farm. Throughout a game, you unlock progressively more powerful technologies, first the wheel, then steam power and eventually giant death robots. This is done through a Tech Tree which is a list of all of the technologies that your civilisation can discover.
You can spend points to unlock technologies but there is a catch: each more advanced technology depends on simpler technologies and you can only unlock a technology if you have already unlocked all of its precursors. The plan is to do the same with developments in AI. I can only use a technology if I have unlocked it.
The Rules
A technology is unlocked if the following checklist is complete:
- It has a description of what it does and how it does it
- Any technologies it depends on (prerequisites) have also been unlocked
- I have written my own implementation of the technology
- My implementation has some unit tests that pass
In return, once I've unlocked a particular technology, I'm allowed to use someone else's implementation in future work. The goal of the project is to understanding so, instead of spending hours profiling and optimising every implementation, i'll just borrow someone else's implementation and use that instead.
That said, for some technologies (for example using a KV cache with attention during inference), the important thing about them is how they are optimised and in these cases, I think it will be helpful to spend some time understanding how and why the optimisation was done.
Finally, when i'm writing up a given technology, i'll include links at the bottom of the post to its prerequisites as well as any technologies that it leads to. This should lead to a nice map of how everything links together.
Without further ado, the first technology is: Python + Numpy