On Memorization and Generalization in Compact Transformers
byA. Härmä, A. Al-Saeedi, A. Changalidis, D. Verșebeniuc, M. Pietrasik, A. Wilbik
Abstract
Large language models (LLMs) seem to demonstrate human-like understanding and generalization of language content. These arise from the capabilities of the models to memorize and generalize the training content. In this paper, we review the recent literature and theories on the mechanisms in self-attention neural networks. We also report three computational experiments that give insights into the underlying mechanisms and capabilities of the models. We also report three computational experiments showing that memorization capacity in compact transformers can be empirically linked to architectural parameters, that structured domain knowledge can be retained in small decoder-only models, and that in-context abstraction requires sufficient architectural depth. These findings suggest that the current models are superfluous for many specific applications, especially in on-edge use cases. A better understanding of application requirements and architecture details can be expected to help in building new LLM architectures that can be efficiently implemented on dedicated on-edge circuits.