A Novel Algorithm to Improve the Performance of Mixture of Experts in Complex AI Tasks
Abstract
across complex AI tasksincluding natural language processing, computer vision, and speech recognition. Building on the Mixture of Layer
Experts (MoLEx) framework, DLR dynamically routes inputs through selected network layers treated as experts, using a task-aware gating
mechanism that adapts to the difficulty and modality of each input. This layer-level routing promotes richer cross-layer and cross-modal information fusion while keeping additional parameter overhead minimal. Theoretical analysis demonstrates that DLR maintains comparable effective parameter budget to dense baselines yet achieves stronger robustness under distribution shifts. Empirical evaluations on GLUE benchmarks for NLP, CIFAR100 for vision, and LibriSpeech for ASR show consistent accuracy gains of 2.43.2% over MoLEx and traditional
sparse MoE models, with only marginal increases in computation. By enabling parallel expert processing with a lightweight shared-parameter
design, DLR offers an efficient and scalable approach for parameter-efficient fine-tuning in diverse multimodal and multi-task settings.
Keywords
Full Text:
PDFReferences
[1] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
[2] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: LowRank
Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021.
[3] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and
Sylvain Gelly. ParameterEfficient Transfer Learning for NLP. In ICML, pp. 27902799, 2019.
[4] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The SparselyGated MixtureofExperts Layer. In ICLR, 2017.
[5] William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient
Sparsity. JMLR, 23(120):139, 2022.
[6] Aran Komatsuzaki, Joan Puigcerver, James LeeThorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani,
and Neil Houlsby. Sparse Upcycling: Training MixtureofExperts from Dense Checkpoints. arXiv preprint arXiv:2212.05055, 2022.
[7] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr Susano Pinto, Daniel Keysers, and
Neil Houlsby. Scaling Vision with Sparse Mixture of Experts. NeurIPS, 34:85838595, 2021.
[8] Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, and Yu Shi. Building a Great
Multilingual Teacher with SparselyGated Mixture of Experts for Speech Recognition. arXiv preprint arXiv:2112.05820, 2021.
[9] Yonatan Belinkov and James Glass. Analysis Methods in Neural Language Processing: A Survey. TACL, 7:4972, 2019.
[10] Tran P, Tropmann-Frick M. Global Contextualized Representations: Enhancing Machine Reading Comprehension with Graph Neural
Networks[M]//Information Modelling and Knowledge Bases XXXVI. IOS Press, 2025: 1-18..
[11] Wang H, Vaze S, Han K. Sptnet: An efficient alternative framework for generalized category discovery with spatial prompt tuning[J].
arXiv preprint arXiv:2403.13684, 2024.
[12] Wang Y, Liu Y, Zheng A, et al. Decoupled feature-based mixture of experts for multi-modal object re-identification[C]//Proceedings of
the AAAI Conference on Artificial Intelligence. 2025, 39(8): 8141-8149.
[13] Cai W, Jiang J, Wang F, et al. A survey on mixture of experts in large language models[J]. IEEE Transactions on Knowledge and Data
Engineering, 2025.
[14] Farina M, Ahmad U, Taha A, et al. Sparsity in transformers: A systematic literature review[J]. Neurocomputing, 2024: 127468.
[15] Nakamura T, Akiba T, Fujii K, et al. Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization[J]. arXiv preprint
arXiv:2502.19261, 2025.
DOI: http://dx.doi.org/10.70711/aitr.v2i10.7132
Refbacks
- There are currently no refbacks.