Skip to content

LLM 101

August Fu edited this page Aug 30, 2023 · 4 revisions

The essential idea behind a language model is its ability to complete a sentence by predicting the most possible word that comes next. For example, given the words once upon a, the model will look at a list of words (for example, [time: 0.834, year: 0.211, day: 0.107, ...]) that can potentially follow the given expression and choose the one with the highest probability.

The way that such probabilities are generated is rather complex and sits at the heart of the LLM technologies. One thing the model tries to do is to quantify the semantic meanings in a way that enables easy comparison. This brings us to the concept of vector embeddings. Essentially, given an expression, the language model create a vector of floating point numbers that carries the expression's semantic meanings. Usually a short expression could be much longer in length when converted to embeddings. This makes sense because our language is a medium that carries rich information in compact form; it requires more effort to represent all the semantic meanings with numbers, which is a fundamentally different medium.

One consequence of this next-word-prediction methodology is that under the hood the model does not have awareness of the semantic meaning of the content that it is generating. This is why the model has been observed to perform badly on even the most simple arithmetic operations - predicting the next word is not exactly the best way to do math.

There are still many behaviors of LLMs that we are still trying to understand. Their ability to produce such coherent responses came as much of a surprise to us. After all, they are just trying to predict the next word. LLMs are difficult to analyze also because of the fact that it falls out of the traditional machine learning paradigm: before deep learning, ML models tried to avoid overfitting - using too many training data that the model preforms almost perfectly during training but poorly when it needs to generalize to testing data. However, with the advent of deep learning, somehow by feeding the model an enormous amount of data (way more than what was traditionally defined as overfitting), the model seems to be able to not only memorize the training data but also synthesize them in interesting ways that allow them to generalize very well.

This of course is a very simplified and basic explanation of how LLMs work. I hope to give you a general idea so that when you work with LLMs and calling the APIs you have a better idea of what's going on under the hood. This field is drastically changing and the same principles might not apply tomorrow. This is what makes the field exciting.

Clone this wiki locally