Large Language Models(LLM)
Language models are used for many purposes. They can be used to predict the next word or character in a sentence, summarize a piece of document, translate given texts from one language to another, recognize speech or convert a piece of text to speech.
Whoever brought transformers has to be blamed for how far language models went in a number of parameters(but no one to be blamed at all, transformers are one of the greatest inventions of the decade 2010s, it’s just shocking(and amazing) that large models always work better if given enough data and compute). For the last 5 years, the size of language models has been increasing every now and then.
A year after the introduction of the paper attention is all you need
, it’s when it all started. In 2018, OpenAI released the GPT
(Generative Pre-trained Transformer), one of the largest language models back then. A year later, OpenAI also released GPT-2
, a model that had 1.5 billion parameters. Another year later, they also released GPT-3
which had 175 billion parameters. GPT-3 was trained on 570GB of texts. With 175B parameters, the whole model is 700GB big. To understand how big GPT-3 is, according to lambdalabs
, it would require 366 years and $4.6M to train it on the lowest-priced GPU cloud on the market!
GPT-n series models were just the beginning. There have been other bigger models that are comparatively near to GPT-3 or even bigger. Some examples: NVIDIA Megatron-LM
has 8.3B parameters. Latest DeepMind Gopher
has 280B parameters. Just last week(April 12th, 2022), DeepMind released
another 70B language model dubbed Chinchilla that outperforms many language models despite being smaller than Gopher, GPT-3, and Megatron-Turing NLG(530B parameters)
. Chinchilla’s paper
showed that existing language models are undertrained where it showed that by doubling the size of the model, data should be doubled too. But then here again comes Google Pathways Language Model(PaLM
) with 540 billion parameters in almost the same week!