Table of contents
- Official information, current trends and forecasts
- Model Size: GPT-4 Won’t Be Super Large
- Optimization: Get the most out of GPT-4
- Optimal parameterization
- Optimized computational model
- Multimodality: GPT-4 will be a text-only model
- Sparsity: GPT-4 will be a dense model
- Alignment: GPT-4 will be more aligned than GPT-3
Official information, current trends and forecasts
The day for GPT-4 announcement is approaching.
GPT-3 was announced in May 2020, about two years ago. It was published one year after GPT-2, that is, one year after the original GPT paper was published. If this trend continues from version to version, GPT-4 should already be here. It wasn’t, but OpenAI CEO Sam Altman said a few months ago that GPT-4 was coming. Current expectations are for a release date sometime in 2022, probably around July/August .
Despite being one of the most eagerly awaited pieces of AI news, there is very little public information about what GPT-4 will look like, what features it will have, and what it will be able to do.
Altman did a Q&A last year and gave me some hints about OpenAI’s ideas for GPT-4 (he asked the attendees to keep the information private, so I kept quiet). but it would be reasonable to wait seven months before breaking the silence).
One thing’s for sure, GPT-4 doesn’t have 100 trillion parameters, as I assumed in my previous post (I’ll put such a big model on the back burner).
It’s been a while since OpenAI revealed information about GPT-4. However, some new trends that are gaining a lot of attention in the field of AI, especially natural language processing, may give us a hint about GPT-4.
From the success of recent approaches in natural language processing and OpenAI’s involvement in that success, and Altman’s remarks, we can make some plausible predictions. And I’m sure they go beyond the well-known (and tired) practice of making models bigger and bigger.
Given what I’ve learned from OpenAI and Sam Altman, as well as current trends and cutting edge in language AI, here’s what I expect from GPT-4: (Clarifying in context which predictions are speculative or deterministic).
What Sam Altman said during the September 2021 Q&A session
Model Size: GPT-4 Won’t Be Super Large
GPT-4 will not be the largest language model. Altman said it wouldn’t be much bigger than the GPT-3. It’s certainly larger than the previous generation of neural networks, but size isn’t the distinguishing feature. It will probably be somewhere between GPT-3 and Gopher (175-280 billion parameters) in size.
There is a good reason for this decision (not making the model size increase the primary goal).
The Megatron-Turing NLG , built by Nvidia and Microsoft last year , until very recently held the title of the largest dense neural network with 530 billion parameters, already three times larger than GPT-3 (currently Google PaLM holds the title with 540 billion). But surprisingly, some smaller models that came after MT-NLG reached higher performance levels.
Good big ≠ good good.
The existence of a better, smaller model serves two purposes.
First, companies realized that using model size as a metric for improving performance is neither the only way nor the best way. In 2020, OpenAI’s (then research consultant) Jared Kaplan and his colleagues found that performance gains were greatest when increasing computational budgets were allocated primarily to scaling the number of parameters, following a power-law relationship. concluded. Companies developing language models such as Google, Nvidia, Microsoft, OpenAI, and DeepMind took the guidelines at face value.
But the MT-NLG isn’t the best in terms of performance for its size. In fact, no single benchmark in any category is the best. Smaller models like Gopher (280 billion) and Chinchilla (70 billion) are a fraction of the size of MT-NLG, but still outperform MT-NLG on all tasks.
It turns out that model size isn’t the only factor in improving language understanding, which leads to a second motive.
Companies are beginning to reject the “bigger is better” dogma. Having more parameters is just one of many factors that improve performance. And considering the collateral damage (e.g. carbon footprint, computational cost, barriers to entry, etc.) It becomes one. Companies will reconsider building giant models if they can get similar or better results with smaller models.
Altman says OpenAI researchers have become less focused on getting the best results from smaller models, rather than making them extremely large. They were early advocates for the scaling hypothesis , but may now realize that other unexplored avenues lead to improved models.
That’s why we say that GPT-4 won’t be much bigger than GPT-3. OpenAI will shift its focus to other aspects, such as data, algorithms , parameter optimization, alignment, etc., which can more clearly bring significant improvements. We’ll have to wait and see what the 100 trillion parameter model is capable of.
The straight lines in the above coordinate system represent the estimated optimum model size and learning token amount. Plotting above this line means under- learning relative to the model size, and plotting below this line means over-learning . And the language model on the best fit line is using the computational budget most efficiently . Naturally, Chinchilla is plotted on this best fit straight line.
Optimization: Get the most out of GPT-4
Language models suffer from one serious limitation in terms of optimization. Training is so expensive that companies must make tradeoffs between accuracy and cost. As a result, models are often poorly optimized.
GPT-3 was only trained once, although some use cases had errors that required retraining. OpenAI did not retrain due to unaffordable costs. Since we didn’t retrain, the researchers didn’t find the optimal set of hyperparameters (learning rate, batch size, sequence length, etc.) for the model.
Another consequence of the high learning cost is the limited analysis of model behavior. When Kaplan’s team concluded that model size was the most relevant variable for improving performance, they did not consider the number of training tokens, the amount of data fed to the model. Considering the number of training tokens, it would require a huge amount of computing resources.
Tech companies followed Kaplan’s conclusions because they were the best they could know. Ironically, Google, Microsoft, Facebook (now Meta), and other companies were motivated by the very economic constraints to “waste” millions of dollars on larger models, and in the process generated enormous amounts of pollutants.
Now, companies, led by DeepMind and OpenAI, are exploring other approaches. They’re trying to find the perfect fit, not just the big model.
Last month, Microsoft and OpenAI proved that GPT-3 could be further improved by training the model with optimal hyperparameters . They found that the 6.7 billion parameter version of GPT-3 improved performance to match that of the original 13 billion. Hyperparameter tuning, which is not feasible for larger models, yields performance improvements equivalent to doubling the number of parameters.
They discovered a new parameter optimization method (μP) in which hyperparameters that are best for small models are also best for large models of the same series. μP made it possible to optimize models of arbitrary size with a very small training cost. And those hyperparameters can be ported to larger models at virtually no cost.
Optimized computational model
A few weeks ago, DeepMind revisited Kaplan’s findings (that the number of parameters is the most important factor in improving model performance) and found, contrary to what was believed, that the number of training tokens was as important as the model size. I realized that it affects me. They concluded that as more computational budget becomes available, it should be allocated equally to parameter and data scaling. They developed a 70 billion parameter model, Chinchilla (a quarter of the previous state-of-the-art model Gopher), with four times the data of all large language models since GPT-3 (usually 300 billion to 1 400 billion tokens) to prove the hypothesis.
The results were plain and simple. Chinchilla “uniformly and significantly” outperforms Gopher, GPT-3, MT-NLG, and all other language models in many language benchmarks. Current models are undertrained and oversized.
Given that GPT-4 is slightly larger than GPT-3, the number of learning tokens required for optimal computation (according to DeepMind’s findings) is about 5 trillion, an order of magnitude higher than the current dataset. Become. In addition, the number of FLOPs required to train the model to minimize the learning loss is about 10 to 20 times that used in GPT-3 (estimated using Gopher’s computational configuration).
Altman may have meant this when he said in the Q&A that “GPT-4 will use much more computation than GPT-3.”
OpenAI will undoubtedly implement optimization-related insights into GPT-4, but to what extent is unpredictable as the computational budget is unknown. What is certain is that they will focus on optimizing variables other than model size. Finding the optimal set of hyperparameters, optimal computational model size and number of parameters may result in incredible improvements across all benchmarks. Performance prediction for language models fails when the optimization approaches described above are combined into a single explanatory model.
“People won’t believe how good a model can be without making it bigger,” Altman said. His remarks may suggest that scaling efforts are over for now.
Multimodality: GPT-4 will be a text-only model
The future of deep learning lies in multimodal models. The human brain is also multi-sensitive because it lives in a multi-modal world. Perceiving the world in only one mode at a time severely limits an AI’s ability to navigate and understand the world.
But good multimodal models are much harder to build than language-only or visual-only models. Combining visual information and text information into a single representation is a very difficult task. Since our knowledge of how our brains do it is very limited (not even the deep learning community considers insights from cognitive science about brain structure and function). , I don’t know how to implement it in a neural network.
Altman said in a Q&A that GPT-4 will be a text-only model, not a multimodal one (like DALL-E or MUM). My guess is that OpenAI is reaching the limits of its language models, adjusting factors such as model and dataset size before moving to the next generation of multimodal AI.
Sparsity: GPT-4 will be a dense model
Sparse models, which exploit conditional computation using different parts of the model to handle different types of inputs, have seen great success recently. These models easily scale beyond 1 trillion parameters without suffering from high computational costs, and appear to create an orthogonal relationship between model size and computational budget. However, the benefits of the MoE approach diminish for very large models .
Given OpenAI’s history of focusing on high-density language models, it’s reasonable to expect that GPT-4 will also be a high-density model. And since Altman says GPT-4 won’t be much larger than GPT-3, we can conclude that sparsity isn’t an option for OpenAI, at least for now.
Given that our brains, the source of inspiration for AI, rely heavily on sparse processing, sparsity will likely dominate future generations of neural networks, as does multimodality.
Schematic diagram of hierarchical MoE model
Alignment: GPT-4 will be more aligned than GPT-3
OpenAI has put a lot of effort into tackling the AI alignment problem (how to make the language model align with our intentions and follow our values, whatever that means). It’s here. This is not only a mathematically difficult question (how do we get AI to understand exactly what we want it to do), but also a philosophically difficult one (human values vary widely between groups, There is no universal way to align AI with humans, as they are often at odds.)
But OpenAI made a first attempt with InstructGPT. This is a new GPT-3 that has been trained by human feedback to learn to follow instructions (whether or not those instructions were well-intentioned is not yet factored into the model).
The main breakthrough of InstructGPT is human judges regardless of language benchmark results (OpenAI employees and English speakers form a very homogenous group, so draw conclusions with caution) Recognized as a better model by The results highlight the need to move beyond using benchmarks as the sole metric for evaluating AI capabilities. How humans perceive models may be more important than benchmarks.
Given Altman’s and OpenAI’s commitment to beneficial AGI, I believe GPT-4 will build on and implement insights from InstructGPT.
OpenAI will improve how it aligns models that were limited to its employees and English-speaking labelers. A true alignment should include groups of all origins and characteristics regarding gender, race, nationality, religion, etc. It’s a big challenge, and any steps toward that goal would be welcome (but be wary of calling it alignment when it doesn’t fit most people).
Model size: The GPT-4 will be larger than the GPT-3, but not significantly larger than the current largest models (MT-NLG 530B and PaLM 540B). Model size will not be a distinguishing feature.
Optimization: GPT-4 requires more computation than GPT-3. We will also implement new optimization findings on parameter optimization (optimal hyperparameters) and scaling laws (number of learning tokens is as important as model size).
Multimodality: GPT-4 is a text-only model (not multimodal). OpenAI is looking to push language models to their limits before moving fully to multimodal models like DALL-E. These OpenAI plans will push the limits of unimodal models in the future.
Sparsity: GPT-4, like GPT-2 and GPT-3, aims for dense models (all parameters are used to process arbitrary inputs). In the future sparsity will dominate (with respect to neural network architectures).
Alignment: GPT-4 will be more aligned with us than GPT-3. GPT-4 will implement InstructGPT’s training content trained by human feedback. Still, working with AI is a long way to go, and the effort should be carefully evaluated and not overestimated.