Back to AI Research

AI Research

Can Scale Save Us From Plasticity Loss in Large Lan... | AI Research

Key Takeaways

  • Can Scale Save Us From Plasticity Loss in Large Language Models?
  • This research investigates a fundamental challenge in artificial intelligence: "loss of plas...
  • The loss of plasticity - the ability of a network to learn new information after having already learned older information - is a fundamental challenge in creating artificial neural networks capable of continual learning.
  • Although this phenomenon has been known for decades, it has mostly been studied in older, relatively small architectures and rarely in natural-language domains.
  • To determine whether loss of plasticity remains a problem in the modern transformer-based LLM paradigm, we study plasticity loss in GPT-style Transformer models trained on a multilingual continual learning problem.
Paper AbstractExpand

The loss of plasticity - the ability of a network to learn new information after having already learned older information - is a fundamental challenge in creating artificial neural networks capable of continual learning. Although this phenomenon has been known for decades, it has mostly been studied in older, relatively small architectures and rarely in natural-language domains. To determine whether loss of plasticity remains a problem in the modern transformer-based LLM paradigm, we study plasticity loss in GPT-style Transformer models trained on a multilingual continual learning problem. Consistent with prior work, we find evidence of plasticity loss across models ranging from 5M to 314M non-embedding parameters, as measured by deterioration on a held-out Vietnamese probing task. We further find that the onset of plasticity loss follows a predictable scaling law, growing sublinearly with model size. These results suggest that larger models may delay the measurable effects of plasticity loss, but that increasing parameter count alone is likely to be insufficient to completely prevent it. We also find evidence of plasticity loss under stationary multilingual training, challenging the view that the phenomenon is exclusive to continual learning with abrupt task changes. Overall, our results suggest that even large Transformer language models trained on natural-language will eventually lose the ability to efficiently adapt to new data after sufficiently long training, in both continual and stationary settings.

Can Scale Save Us From Plasticity Loss in Large Language Models?
This research investigates a fundamental challenge in artificial intelligence: "loss of plasticity." This phenomenon occurs when a neural network gradually loses its ability to learn new information after it has been trained for a long time on previous data. While this has been studied in smaller, older systems, this paper explores whether modern, large-scale Transformer models—the architecture behind tools like GPT—are also susceptible to this decline when trained on natural language.

Testing Plasticity in Language Models

To determine if large models eventually stop learning effectively, the researchers created a "multilingual continual learning" experiment. They trained several GPT-style Transformer models of varying sizes (ranging from 5 million to 314 million parameters) on a rotating sequence of eight different languages. To measure whether these models were still capable of learning, the team periodically paused the training and tested the models on a completely new, "held-out" language: Vietnamese. By measuring how quickly the models adapted to this new language over time, the researchers could track whether the models' learning efficiency improved or degraded as they were exposed to more data.

The Scaling Law of Learning Decline

The study found that all tested models, regardless of size, eventually exhibit a loss of plasticity. As the models were trained for longer periods, their ability to adapt to the new Vietnamese language task began to deteriorate. However, the researchers discovered a predictable pattern: the onset of this decline follows a sublinear power-law scaling. In simpler terms, while larger models are more resilient and can delay the onset of plasticity loss for much longer than smaller models, simply increasing the number of parameters is not a permanent solution. Eventually, even the largest models in the study reached a point where their capacity to learn new information diminished.

Stationary Training and Practical Implications

A common belief in the field has been that plasticity loss is primarily triggered by abrupt changes in tasks. This research challenges that view by finding evidence of plasticity loss even under "stationary" multilingual training, where the data distribution remains consistent. This suggests that the problem is not just a result of switching between different types of tasks, but rather a consequence of extended training itself.

Key Takeaways for Future Research

The researchers identified several internal indicators of this decline, such as dormant units and "lazy" attention heads, which provide clues for future work. However, they note that they have not yet found a definitive "smoking gun" or a simple fix to prevent this loss of plasticity. The findings suggest that as we continue to train larger language models for longer durations, we must develop new strategies to maintain their ability to adapt, as relying on scale alone will not be enough to keep these systems flexible in the long run.

Comments (0)

No comments yet

Be the first to share your thoughts!