A Novel Algorithm to Improve the Performance of Mixture 
of Experts in Complex AI Tasks

Hanjie Xu*; Rui Li; Qiyu Chen

doi:10.70711/aitr.v2i10.7132

A Novel Algorithm to Improve the Performance of Mixture of Experts in Complex AI Tasks

Hanjie Xu*, Rui Li, Qiyu Chen

Abstract

We introduce Dynamic Layer Routing (DLR), a novel hierarchical sparse expert-mixing algorithm designed to improve performance
across complex AI tasksincluding natural language processing, computer vision, and speech recognition. Building on the Mixture of Layer
Experts (MoLEx) framework, DLR dynamically routes inputs through selected network layers treated as experts, using a task-aware gating
mechanism that adapts to the difficulty and modality of each input. This layer-level routing promotes richer cross-layer and cross-modal information fusion while keeping additional parameter overhead minimal. Theoretical analysis demonstrates that DLR maintains comparable effective parameter budget to dense baselines yet achieves stronger robustness under distribution shifts. Empirical evaluations on GLUE benchmarks for NLP, CIFAR100 for vision, and LibriSpeech for ASR show consistent accuracy gains of 2.43.2% over MoLEx and traditional
sparse MoE models, with only marginal increases in computation. By enabling parallel expert processing with a lightweight shared-parameter
design, DLR offers an efficient and scalable approach for parameter-efficient fine-tuning in diverse multimodal and multi-task settings.

Keywords

Sparse mixtureofexperts; Transformer; Parameterefficient finetuning; Hierarchical Routing; Multimodal Learning

Full Text:

PDF

Included Database

References

[1] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: LowRank

Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021.

[3] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and

Sylvain Gelly. ParameterEfficient Transfer Learning for NLP. In ICML, pp. 27902799, 2019.

[4] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The SparselyGated MixtureofExperts Layer. In ICLR, 2017.

[5] William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient

Sparsity. JMLR, 23(120):139, 2022.

[6] Aran Komatsuzaki, Joan Puigcerver, James LeeThorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani,

and Neil Houlsby. Sparse Upcycling: Training MixtureofExperts from Dense Checkpoints. arXiv preprint arXiv:2212.05055, 2022.

[7] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr Susano Pinto, Daniel Keysers, and

Neil Houlsby. Scaling Vision with Sparse Mixture of Experts. NeurIPS, 34:85838595, 2021.

[8] Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, and Yu Shi. Building a Great

Multilingual Teacher with SparselyGated Mixture of Experts for Speech Recognition. arXiv preprint arXiv:2112.05820, 2021.

[9] Yonatan Belinkov and James Glass. Analysis Methods in Neural Language Processing: A Survey. TACL, 7:4972, 2019.

[10] Tran P, Tropmann-Frick M. Global Contextualized Representations: Enhancing Machine Reading Comprehension with Graph Neural

Networks[M]//Information Modelling and Knowledge Bases XXXVI. IOS Press, 2025: 1-18..

[11] Wang H, Vaze S, Han K. Sptnet: An efficient alternative framework for generalized category discovery with spatial prompt tuning[J].

arXiv preprint arXiv:2403.13684, 2024.

[12] Wang Y, Liu Y, Zheng A, et al. Decoupled feature-based mixture of experts for multi-modal object re-identification[C]//Proceedings of

the AAAI Conference on Artificial Intelligence. 2025, 39(8): 8141-8149.

[13] Cai W, Jiang J, Wang F, et al. A survey on mixture of experts in large language models[J]. IEEE Transactions on Knowledge and Data

Engineering, 2025.

[14] Farina M, Ahmad U, Taha A, et al. Sparsity in transformers: A systematic literature review[J]. Neurocomputing, 2024: 127468.

[15] Nakamura T, Akiba T, Fujii K, et al. Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization[J]. arXiv preprint

arXiv:2502.19261, 2025.

DOI: http://dx.doi.org/10.70711/aitr.v2i10.7132

Refbacks

There are currently no refbacks.