Vol. 4 No. 4 (2025)
Articles

Structured Preference Modeling for Reinforcement Learning-Based Fine-Tuning of Large Models

Published 2025-04-30

How to Cite

Zhu, L., Guo, F., Cai, G., & Ma, Y. (2025). Structured Preference Modeling for Reinforcement Learning-Based Fine-Tuning of Large Models. Journal of Computer Technology and Software, 4(4). https://doi.org/10.5281/zenodo.15340770

Abstract

This paper aims to explore how preference modeling can enhance policy optimization efficiency and behavior controllability during reinforcement learning fine-tuning of large models. To address the limitations of traditional RLHF methods in modeling human feedback and guiding policy learning, we propose a strategy optimization framework that integrates a multi-scale preference modeling mechanism. The proposed method first constructs a structured preference scoring function from human feedback data to approximate reward signals. It then combines this with a policy gradient approach to guide the fine-tuning of language models, enabling effective alignment between model behavior and human preferences. The experimental section evaluates the performance of different preference modeling strategies on multiple natural language generation tasks. A comparative analysis is conducted across several dimensions, including accuracy, preference alignment, convergence speed, and training stability. Results show that the proposed method achieves better overall performance than existing approaches. It demonstrates strong capability in modeling preferences and improving fine-tuning effectiveness.