Can Scale Save Us From Plasticity Loss in Large Language Models?
This research investigates a fundamental challenge in artificial intelligence: "loss of plasticity." This phenomenon occurs when a neural network gradually loses its ability to learn new information after it has been trained for a long time on previous data. While this has been studied in smaller, older systems, this paper explores whether modern, large-scale Transformer models—the architecture behind tools like GPT—are also susceptible to this decline when trained on natural language.
Testing Plasticity in Language Models
To determine if large models eventually stop learning effectively, the researchers created a "multilingual continual learning" experiment. They trained several GPT-style Transformer models of varying sizes (ranging from 5 million to 314 million parameters) on a rotating sequence of eight different languages. To measure whether these models were still capable of learning, the team periodically paused the training and tested the models on a completely new, "held-out" language: Vietnamese. By measuring how quickly the models adapted to this new language over time, the researchers could track whether the models' learning efficiency improved or degraded as they were exposed to more data.
The Scaling Law of Learning Decline
The study found that all tested models, regardless of size, eventually exhibit a loss of plasticity. As the models were trained for longer periods, their ability to adapt to the new Vietnamese language task began to deteriorate. However, the researchers discovered a predictable pattern: the onset of this decline follows a sublinear power-law scaling. In simpler terms, while larger models are more resilient and can delay the onset of plasticity loss for much longer than smaller models, simply increasing the number of parameters is not a permanent solution. Eventually, even the largest models in the study reached a point where their capacity to learn new information diminished.
Stationary Training and Practical Implications
A common belief in the field has been that plasticity loss is primarily triggered by abrupt changes in tasks. This research challenges that view by finding evidence of plasticity loss even under "stationary" multilingual training, where the data distribution remains consistent. This suggests that the problem is not just a result of switching between different types of tasks, but rather a consequence of extended training itself.
Key Takeaways for Future Research
The researchers identified several internal indicators of this decline, such as dormant units and "lazy" attention heads, which provide clues for future work. However, they note that they have not yet found a definitive "smoking gun" or a simple fix to prevent this loss of plasticity. The findings suggest that as we continue to train larger language models for longer durations, we must develop new strategies to maintain their ability to adapt, as relying on scale alone will not be enough to keep these systems flexible in the long run.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!