Instruction Alignment and Risk Calibration in Large Language Models for Safe Human-AI Interaction

Linnea Forsberg

doi:10.5281/zenodo.15809635

Vol. 4 No. 6 (2025)

Articles

Instruction Alignment and Risk Calibration in Large Language Models for Safe Human-AI Interaction

pdf

Linnea Forsberg

DOI: https://doi.org/10.5281/zenodo.15809635

Published 2025-06-30

How to Cite

Forsberg, L. (2025). Instruction Alignment and Risk Calibration in Large Language Models for Safe Human-AI Interaction. Journal of Computer Technology and Software, 4(6). https://doi.org/10.5281/zenodo.15809635

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Large language models (LLMs) have shown remarkable proficiency in performing complex language tasks across diverse domains. However, their widespread deployment in real-world settings is constrained by safety concerns, including hallucinations, inappropriate advice, and misalignment with user intent. In this paper, we propose a unified framework for instruction alignment and risk calibration that enhances the safety and controllability of LLM outputs. The framework integrates three core components: risk-conditioned instruction tuning, real-time risk-aware response calibration, and reinforcement learning with a composite reward based on both human preferences and automated risk estimations. Experimental results across healthcare, finance, legal, and general dialogue tasks demonstrate that our model significantly improves helpfulness and calibrated refusal accuracy compared to instruction-tuned and RLHF baselines. Furthermore, case studies confirm its robustness in high-stakes applications, with better content moderation and user trust. The proposed approach provides a scalable and modular solution for building LLMs that are not only capable and coherent, but also responsible and safe in deployment.

pdf

Instruction Alignment and Risk Calibration in Large Language Models for Safe Human-AI Interaction

How to Cite

Download Citation

Abstract

Most read articles by the same author(s)