Reducing LLM Training Dimensions Through Definition-Specific Token Encoding
Abstract This paper proposes a novel approach to reducing the dimensional complexity of Large Language Model (LLM) training by introducing a definition-specific byte encoding system that disambiguates word meanings during the tokenization phase. By attaching a single byte that maps to specific Oxford English Dictionary definitions, we can potentially reduce the embedding dimensions required for