Instruction Alignment and Risk Calibration in Large Language Models for Safe Human-AI Interaction
Published 2025-06-30
How to Cite

This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
Large language models (LLMs) have shown remarkable proficiency in performing complex language tasks across diverse domains. However, their widespread deployment in real-world settings is constrained by safety concerns, including hallucinations, inappropriate advice, and misalignment with user intent. In this paper, we propose a unified framework for instruction alignment and risk calibration that enhances the safety and controllability of LLM outputs. The framework integrates three core components: risk-conditioned instruction tuning, real-time risk-aware response calibration, and reinforcement learning with a composite reward based on both human preferences and automated risk estimations. Experimental results across healthcare, finance, legal, and general dialogue tasks demonstrate that our model significantly improves helpfulness and calibrated refusal accuracy compared to instruction-tuned and RLHF baselines. Furthermore, case studies confirm its robustness in high-stakes applications, with better content moderation and user trust. The proposed approach provides a scalable and modular solution for building LLMs that are not only capable and coherent, but also responsible and safe in deployment.