미세조정 - 딥러닝 언어 모델

모델 선택부터 데이터셋 준비, 대화 템플릿 적용, 훈련, 활용까지 미세조정 한 흐름을 실습합니다. 미세조정 도구는 빠르게 바뀌므로, 이 책은 현재 시점에서 선택한 소프트웨어로 Unsloth를 사용합니다 — 메모리·속도를 크게 줄여 단일 GPU에서도 LoRA 미세조정을 돌릴 수 있게 해 줍니다.

대규모 모델을 통째로 갱신하는 완전 미세조정은 비현실적이므로, 먼저 효율적 미세조정 기법인 PEFT와 그 표준인 LoRA를 살펴본 뒤 Unsloth로 직접 실습합니다.

1PEFT¶

**PEFT(Parameter-Efficient Fine-Tuning, 파라미터 효율적 미세조정)**는 원래 모델의 가중치는 완전히 동결(Freeze)한 상태에서, 극소수(대체로 전체의 0.1% ~ 1% 미만)의 추가 매개변수나 선별된 레이어만을 업데이트해 사전 훈련된 모델의 막강한 표현력을 다운스트림 작업에 저비용으로 적응시키는 사후 훈련 방법론입니다. 모델 크기가 수십억~수천억 파라미터로 폭증하면서, 모든 가중치를 갱신하는 완전 미세조정(Full Fine-Tuning)은 가늠하기 힘든 연산·저장 비용을 요구하는 물리적 재앙이 되었기 때문입니다.

1.1전통적 전이 학습 기법과의 비교¶

사전 훈련 가중치를 하위 작업에 적응시키는 전이 학습(Transfer Learning)의 가치는 이미 깊이 증명되어 왔으나Howard & Ruder (2018)Radford et al. (2018), 기존 기법과 PEFT는 하드웨어 운용 효율성에서 뚜렷이 대비됩니다.

비교 항목	완전 미세조정 (Full FT)	특징 추출 (Feature Extraction)	PEFT
작동 원리	모든 사전 훈련 가중치 업데이트	백본 동결 후 최상단 Task Head만 훈련	백본 동결 후 극소수 모듈만 훈련
학습 매개변수 비중	100%	0.01% 미만	0.1% ~ 1.0% 내외
훈련 VRAM	매우 높음	매우 낮음	낮음 (역전파 범위 축소)
최종 성능	매우 우수	보통 이하	완전 미세조정과 대등하거나 우수
체크포인트 용량	수십~수백 GB	수십 MB 미만	수십 MB (모듈만 단독 저장)

1.2현대 PEFT의 3대 기술 분류¶

PEFT 방법론은 가중치를 삽입·선택·표현하는 수학적 설계에 따라 세 갈래로 구조화됩니다.

① Additive PEFT (추가형) — 기존 아키텍처에 새 매개변수 레이어를 삽입하거나 입력에 가상 토큰을 주입합니다.

바틀넥 어댑터(Bottleneck Adapter): 2019년 Houlsby 등Houlsby et al. (2019)이 제안. 원래 차원 $D$ 의 은닉 상태를 작은 중간 차원 $d$ ( $d \ll D$ )로 축소(Down-Projection)→비선형 활성화→원래 차원으로 복원(Up-Projection)하고 잔차 연결로 결합해, 작업당 학습 파라미터를 100배 이상 절감합니다.
프롬프트/프리픽스 튜닝: 입력 임베딩 앞에 훈련 가능한 가상 토큰을 잇거나Lester et al. (2021), 모든 Attention 레이어의 KV 캐시 앞단에 가상 프리픽스 벡터를 확장 주입합니다Li & Liang (2021).

② Selective PEFT (선택형) — 신규 모듈 없이 기존 백본 가중치의 작은 부분집합만 선별해 해제합니다.

BitFit: 가중치 행렬 $W$ 는 모두 동결하고 편향(Bias) 벡터만 학습합니다Zaken et al. (2022). 전체의 0.1% 미만만 학습해도 분류·자연어 벤치마크에서 견고한 적응력을 보였습니다.

③ Reparameterization PEFT (재매개변수화형) — 가중치 변화량 $\Delta W$ 자체를 저차원 형태로 재정의·투영합니다. 추론 지연 없이 완전 미세조정에 가장 근접한 성능을 내 현대 LLM 정렬의 표준이 된 LoRA가 여기에 속합니다.

1.3PEFT의 공학적 이점¶

저장·배포 효율: 베이스 가중치 하나만 공유하고 작업별로 수십 MB 모듈만 저장·교체하면 됩니다(완전 미세조정은 작업당 수십 GB 복제).
파괴적 망각(Catastrophic Forgetting) 차단: 베이스 가중치를 물리적으로 동결해, 새 도메인 학습 시 사전 훈련 지능이 파괴되는 현상을 원천 방어합니다.

2LoRA¶

**LoRA(Low-Rank Adaptation, 저차원 적응)**는 원래 가중치를 고정한 채 소규모 저차원 랭크 분해 행렬만 훈련에 개입시켜, 완전 미세조정 대비 약 1만 분의 1의 학습 파라미터만으로 최첨단 성능을 확보하는 현대 미세조정 표준입니다Hu et al. (2021).

2.1완전 미세조정의 한계¶

체크포인트 파편화: 70B 모델 기준 학습마다 140GB 전체 가중치 복제본이 저장되어, 10개 작업 지원에 테라바이트급 비용이 듭니다.
태스크 스위칭 오버헤드: 작업 전환 서빙 때마다 140GB 전체를 VRAM에 다시 올려야 하는 지연이 발생합니다Brown et al. (2020).

2.2저차원 행렬 분해의 수학적 원리¶

LoRA는 가중치 업데이트가 사실상 매우 낮은 내재 랭크(Intrinsic Rank) 공간에서 표현될 수 있다는 통찰에 기반합니다. 사전 학습 가중치 $W_0 \in \mathbb{R}^{d \times k}$ , 입력 $x$ 에 대한 선형 연산 $h = W_0x$ 에서, $W_0$ 는 동결하고 변화분 $\Delta W$ 만을 저차원 랭크 $r$ ( $r \ll \min(d,k)$ )의 두 행렬 $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ 곱으로 재매개변수화합니다.

h = W_0x + \Delta Wx = W_0x + BAx

(1)

초기화: 훈련 초반 출력이 $W_0$ 와 동일하도록 $A$ 는 가우시안 무작위, $B$ 는 0으로 초기화합니다. 최초 $\Delta W = BA = 0$ 이라 베이스 성능이 그대로 계승됩니다.

2.3하이퍼파라미터 노하우¶

가중치 변화분은 스케일링 인자 $\alpha$ 로 보정해 반영합니다.

\Delta W = \frac{\alpha}{r} BA

(2)

랭크 $r$ 와 $\alpha$ : $\frac{\alpha}{r}$ 보정 덕분에 $r$ 을 바꿔 실험해도 학습률을 다시 크게 튜닝할 필요가 없습니다. 보통 $\alpha$ 를 $r$ 의 2배(예: $r=16$ , $\alpha=32$ ) 또는 $r$ 과 같게 둡니다.
타겟 모듈: MLP는 동결하고 Self-Attention의 $W_q, W_k, W_v, W_o$ 투영에 어댑터를 삽입하는 것이 파라미터 대비 효율이 가장 높습니다.

2.4무지연 병합 (Zero-Latency Merge)¶

어댑터 기법은 추론 시 중간 레이어가 늘어 지연을 더하지만, LoRA는 훈련 후 $BA$ 를 베이스 가중치에 직접 병합할 수 있습니다.

W_{\text{final}} = W_0 + \frac{\alpha}{r} BA

(3)

병합 후에는 구조 변경 없이 완전 미세조정 모델과 **100% 동일한 추론 속도(Zero Latency)**로 서빙합니다.

2.5벤치마크¶

A100 환경에서 완전 미세조정과 LoRA의 VRAM 요구량 비교입니다.

모델	완전 미세조정	PEFT-LoRA	LoRA + CPU Offload
3B (T0 3B)	47.1 GB	14.4 GB	9.8 GB GPU / 17.8 GB CPU
7B (BLOOMZ 7B)	OOM	32.0 GB	18.1 GB GPU / 35.0 GB CPU
12B (MT0 XXL)	OOM	56.0 GB	22.0 GB GPU / 52.0 GB CPU

완전 미세조정은 7B만 되어도 단일 GPU에서 OOM이 잦지만, LoRA는 VRAM을 1/3 이하로 낮춰 저비용 환경에서도 정렬을 완수합니다. 성능 손실도 작아(예: Flan-T5 3B 완전 미세조정 0.892 vs T0-3B+LoRA 0.863, 인간 기준 0.897), 체크포인트는 원본 11GB 대비 단 19MB에 불과해 로컬·에이전트 배포의 패권 기법으로 자리 잡았습니다.

3실습 환경¶

pip install unsloth

Unsloth는 빠르게 바뀌므로, 새 버전에서 API가 달라질 수 있습니다. 이 장의 코드는 unsloth 2026.6·trl 0.24·transformers·torch 2.10(CUDA) 조합에서 검증했습니다.

import unsloth

print('Unsloth', unsloth.__version__)

Unsloth 2026.5.4

4모델 선택¶

from unsloth import FastLanguageModel

model_name = 'unsloth/Qwen3-4B-unsloth-bnb-4bit'
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name, max_seq_length=2048, dtype=None, load_in_4bit=True)

==((====))==  Unsloth 2026.5.4: Fast Qwen3 patching. Transformers: 5.5.0.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.999 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.35. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
unsloth/Qwen3-4B-unsloth-bnb-4bit does not have a padding token! Will use pad_token = <|PAD_TOKEN|>.

5미세조정 설정¶

model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # 클수록 성능이 증가하지만, 메모리 사용량도 증가합니다.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 8, # rank와 동일하거나 두 배 값을 권장합니다.
    lora_dropout = 0, # 대체로 0 권장
    bias = "none",    # 임의 값 가능하지만, "none" 권장
    use_gradient_checkpointing = "unsloth", # "unsloth" 권장. True/False도 가능
    random_state = 2025, # 재현성을 위해 난수 초기값 (0 이상 정수)
)

Unsloth 2026.5.4 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

6데이터셋¶

SFT 데이터셋은 모델이 따라 할 모범 대화를 담습니다. 소스마다 컬럼·역할 이름·형식이 제각각이므로, 역할(role)·내용(content) 쌍의 대화(conversation) 형식으로 표준화한 뒤 학습에 씁니다. 여기서는 여러 고품질 출처의 지시-응답을 대화형으로 모은 mlabonne/FineTome-100k를 사용합니다.

from datasets import load_dataset

dataset = load_dataset('mlabonne/FineTome-100k', split='train')

sample = dataset[0]
for key, value in sample.items():
    print(f'{key}: {value}\n')

conversations: [{'from': 'human', 'value': 'Explain what boolean operators are, ... across different programming languages.'}, {'from': 'gpt', 'value': 'Boolean operators are logical operators used in programming ... regardless of the language's truthiness and falsiness rules.'}]

source: infini-instruct-top-500k

score: 5.212620735168457

원본은 출처마다 역할 이름(human·gpt 등)과 컬럼 구조가 다릅니다. standardize_data_formats는 이를 role(system·user·assistant)·content의 표준 스키마로 정규화해, 이후 어떤 모델의 대화 템플릿에도 그대로 흘려보낼 수 있게 합니다.

from unsloth.chat_templates import standardize_data_formats

dataset = standardize_data_formats(dataset)

sample = dataset[0]
for turn in sample['conversations']:
    print(f"{turn['role']}: {turn['content']}\n")

user: Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. ... write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.

assistant: Boolean operators are logical operators used in programming to manipulate boolean values. ... This ensures that the result is always a boolean value, regardless of the language's truthiness and falsiness rules.

7Chat Template¶

대화 템플릿은 메시지를 역할 구분 토큰(system·user·assistant)으로 감싸 모델이 대화 구조를 인식하게 합니다. 사전 훈련만 거친 base 모델에는 이 형식이 없고 사후 훈련을 거친 instruct 모델이 갖추므로, 미세조정 데이터에는 모델이 학습한 것과 동일한 템플릿을 적용해야 합니다. 템플릿 문자열을 직접 작성하는 대신, get_chat_template로 모델 고유의 검증된 템플릿을 토크나이저에 입힙니다 — 여기서는 Qwen3 템플릿(qwen-3)입니다.

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template='qwen-3')

표준화한 대화를 이 템플릿으로 렌더링해, 학습기가 읽을 단일 text 컬럼으로 만듭니다.

def formatting_prompts_func(examples):
    texts = [
        tokenizer.apply_chat_template(
            conversation, tokenize=False, add_generation_prompt=False)
        for conversation in examples['conversations']
    ]
    return {'text': texts}

train_dataset = dataset.map(formatting_prompts_func, batched=True)

print(train_dataset[0]['text'])

<|im_start|>user
Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. ... implemented differently across different programming languages.<|im_end|>
<|im_start|>assistant
<think>

</think>

Boolean operators are logical operators used in programming to manipulate boolean values. ... regardless of the language's truthiness and falsiness rules.<|im_end|>

8모델 훈련¶

데이터 컬럼(dataset_text_field)과 시퀀스 길이(max_seq_length)를 포함한 모든 학습 설정을 SFTConfig 하나에 모읍니다.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        max_seq_length = 2048,
        packing = False, # 짧은 시퀀스에서 학습을 최대 5배 빠르게 합니다.
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "qwen3-finetuned",
        report_to = "none", # WandB 등으로 보내려면 변경합니다.
    ),
)

프롬프트(지시문)까지 학습하면 모델이 사용자 발화를 흉내 내는 데 용량을 낭비합니다. train_on_responses_only는 손실(loss)을 assistant 응답 구간에만 적용하도록 마스킹해, 모델이 "답하는 법"에 집중하게 합니다. 역할 경계는 Qwen3 템플릿의 구분 토큰으로 지정합니다.

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

Unsloth: Removed 11 out of 100000 samples from train_dataset where all labels were -100 (no response found after truncation). This prevents NaN loss during training.

C 컴파일러 필요

우분투와 같은 Debian 계열에서 컴파일러 설치

sudo apt update
sudo apt install -y build-essential

import torch

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3090. Max memory = 23.999 GB.
3.449 GB of memory reserved.

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 99,989 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 16,515,072 of 4,038,983,168 (0.41% trained)
{'loss': 1.1, 'grad_norm': 0.5406, 'learning_rate': 0.0, 'epoch': 0.0001}
{'loss': 0.9904, 'grad_norm': 0.4712, 'learning_rate': 4e-05, 'epoch': 0.0002}
{'loss': 1.117, 'grad_norm': 0.5819, 'learning_rate': 8e-05, 'epoch': 0.0002}
...
{'loss': 0.6233, 'grad_norm': 0.1532, 'learning_rate': 7.273e-06, 'epoch': 0.0047}
{'loss': 0.5512, 'grad_norm': 0.131, 'learning_rate': 3.636e-06, 'epoch': 0.0048}
{'train_runtime': 466.8, 'train_loss': 0.7907, 'epoch': 0.0048}

import torch

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

466.8338 seconds used for training.
7.78 minutes used for training.
Peak reserved memory = 6.068 GB.
Peak reserved memory for training = 2.619 GB.
Peak reserved memory % of max memory = 25.284 %.
Peak reserved memory for training % of max memory = 10.913 %.

8.1모델 저장¶

model_path = "qwen3-lora-finetuned"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

9모델 활용¶

9.1모델 적재¶

from pathlib import Path
from unsloth import FastLanguageModel

model_path = Path("qwen3-lora-finetuned")
if model_path.exists():
    model_path = str(model_path)
    print(f"모델 불러오기: {model_path}")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_path, max_seq_length=2048, dtype=None, load_in_4bit=True)
    FastLanguageModel.for_inference(model) # 2배 빠른 추론 속도

FastLanguageModel.for_inference(model)
messages = [
    {
        "role": "user",
        "content": "Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,"
    },
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    input_ids, streamer=text_streamer, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)

<think>

</think>

The next numbers in the Fibonacci sequence are 13, 21, 34, 55, 89, 144, 233, and so on. The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding numbers. The sequence starts with 1 and 1, and each subsequent number is the sum of the two previous numbers.<|im_end|>

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [
    {"role": "user",      "content": "Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8"},
    {"role": "assistant", "content": "The fibonacci sequence continues as 13, 21, 34, 55 and 89."},
    {"role": "user",      "content": "What is France's tallest tower called?"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer=text_streamer, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)

References¶

Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. https://arxiv.org/abs/1801.06146
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning for NLP. https://arxiv.org/abs/1902.00751
Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. https://arxiv.org/abs/2104.08691
Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. https://arxiv.org/abs/2101.00190
Zaken, E. B., Ravfogel, S., & Goldberg, Y. (2022). BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models. https://arxiv.org/abs/2106.10199
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165