Context
The following text is just me, a long-term programmer, trying to make sense of large language models (LLM) and GPTs. It’s a bird’s-eye view on the topic with many “I think,” “the way I understand it,” and so on. If you read this let me know. I’m interested to hear your thoughts.
GPT
The way I understand how the generative pre-trained transformer (GPT) work:
- read the prompt (the user’s request)
- pass it through the LLM + runtime
- print out the first word that is (kind of) statistically the most likely first word for this particular prompt
- repeat step (3) with the addition of the word(s) printed before
- Continue until, the printed words so far, (kind of) statistically are the answer.
Something like this (where l
is the language model and p
is the prompt and p'
is the prompt with the append answers so far):
f(l+p) = p' (*)
f(l+p') = p''
f(l+p'') = p'''
... and so on
*) There are obviously random factors I miss here. You do not get the same p'
for a specific l+p
. However, the way I understand it, the LLM is in it’s core a deterministic system and the randomness is not part of the LLM but part of the driver that runs the LLM. In other words: I assume that, if the runtime would use a fixed RNG (which always picks the same random numbers), it’ll work the way described above.
The language model, the thing that is used to pick the words (kind of) statistically, is generated by the learning process which means: it reads large amounts of books, websites, instructions, source code and more. In there it learns what word most likely follows other words given a specific context.
This is how the LLM learned the language. For Hello my ...
it observed that in many sources the word friend
followed so that word has a high probability to be picked. An additional factor is the context of the source: in a love letter the word darling
might be found more often than friend
. So when one ask a LLM runtime to write a love letter it’ll pick darling over friend.
In this very reduced view on the concept a LLM is nothing more than gigabytes of probabilities in a multi dimensional space that decide which word to pick next - based on the input it learned from.
Accessing knowledge
If we ask a LLM based system (e.g. ChatGPT): How high is the Eiffel tower?
it doesn’t know the answer. It just observed that, in most texts it was trained on, an answer to this question would very likely begin with a: The
. It also observed that it’s very likely that the answer to this question, that already has The
as first word will very likely to continue with Eiffel
. And so on and so on. In the end we get: The Eiffel Tower is 330 meters (1,083 feet) tall
.
Other than the results in a search engine the, LLM runtime does not need to look up an index and point to a website that likely contains an answer to this question. It simply has read the words, that the Eiffel Tower is 330meters, so many times that it must be true. It’s the most likely answer. There can’t be a doubt about it.
Classic applications store this kind of information in a database. Deterministic systems. Programmers plan for a data layout that can hold this kind of information and make it accessible to the given program. For this data we have strict syntax to query information from it. An example is SQL. A programmer writes: select height from public_buildings where name = 'Eiffel Tower'
and gets 330.0
as a result. In such a scenario we are sure that if we input 330
into the system and ask for this value later we always get the given value. There is no learning here. It’s more like storing a file in a very specific file system. A database does not know anything. It simply reads, stores and prints whatever data we fill in it and that’s that.
And here seems to be the big difference: the LLM stored the information about the height of the Eiffel Tower on it’s own terms, in it’s own way. No programmer ever planned that it has to store this particular information. They just showed it so many texts and sources that contains details around the Eiffel Tower that it stored all the factors that lead to the right answer, when asked. The problem is that the factors can also lead to wrong answers.
The key takeaway here is that, if we ask the LLM for data we cannot know if it is correct or not. Nobody can. A database, can contain wrong data and therefore give incorrect results but if the data is correct, the answers will be correct. The LLMs, on the other hand, can be trained with Petabytes of correct data and still be wrong about it at times. The way I understand it, this difference is not a detail that can and will be fixed in the near future but a general rule.