An Exploration of the Application of Reinforcement Learning in Large Model Training
Abstract
which includes value functions, policy optimization, and three major algorithm types. Building on this foundation, it analyzes the integrated
architecture of large models and reinforcement learning. Starting from three key modulesagents, environment interaction, and reward
designthe paper examines the adaptation logic of key algorithms such as RLHF, PPO, and Actor-Critic, and explores their application stages
and optimization pathways throughout the entire training process of large models. The study demonstrates that reinforcement learning can
effectively align large models with human preferences, thereby enhancing output quality and training stability. Through framework optimization, algorithmic improvements, and multidimensional validation, a closed-loop optimization system can be established, significantly improving the decision-making capabilities and generalization performance of large models. This provides a feasible technical pathway for the efficient training and alignment optimization of large models.
Keywords
Full Text:
PDFReferences
[1] Dai Huijie. Design and Implementation of Automatic Pathfinding Based on A* Algorithm and Reinforcement Learning Algorithm[J].
Computer Programming Skills & Maintenance, 2026, (01): 3-6+13.
[2] Xie Hao. Network Slicing Resource Allocation Based on Deep Reinforcement Learning [D]. Nanjing University of Posts and Telecommunications, 2024.
[3] Liang Yuangao. On the Hierarchical Regulation of Risks in Training Data for Large AI Models[J]. Journal of Zhengzhou University
(Philosophy and Social Sciences), 2025, 58(03): 61-67+144.
[4] Zheng Weimin. Four Issues to Consider in Building a Computer System Supporting Large Model Training[J]. Big Data Research, 2024,
10(01): 1-8.
DOI: http://dx.doi.org/10.70711/aitr.v3i11.9349
Refbacks
- There are currently no refbacks.