케라스로 신경망 구성

Part 1에서 우리는 퍼셉트론에서 출발해 다층 신경망과 역전파, CNN까지 신경망의 내부 동작을 손으로 따라가며 익혔습니다. 이 장의 목표는 그 이해를 바탕으로 고수준 프레임워크로 올라서는 것입니다. 같은 신경망을 (1) 선형 분류기, (2) PyTorch의 nn.Sequential, (3) 케라스(Keras)의 Sequential로 차례로 구현하며 추상화 단계가 올라갈수록 같은 일을 얼마나 더 짧게 표현할 수 있는지 비교합니다.

출발점은 분류를 확률 모델로 보는 관점입니다. 신경망의 마지막 층과 손실함수는 결국 '입력이 각 범주에 속할 확률’을 추정하는 장치이며, 이 관점은 로지스틱 회귀에서 가장 깨끗하게 드러납니다.

1확률 모델링¶

붓꽃(iris) 데이터로 시작합니다. 꽃받침/꽃잎의 길이·너비 4개 특징으로 3개 품종을 구분하는 다중 분류 문제입니다.

from sklearn.datasets import load_iris

iris = load_iris()
print(type(iris))
print(iris.keys())

print(iris.target_names, '->', np.unique(iris.target))
print(iris.data.shape)
iris.frame = pd.DataFrame(iris.data, columns=iris.feature_names)
iris.frame['target'] = iris.target
iris.frame.head()

로지스틱 회귀는 이름과 달리 분류기입니다. 각 범주마다 가중치 벡터를 두고 입력의 선형결합을 계산한 뒤, 그 값들을 확률로 변환해 가장 큰 확률의 범주를 예측합니다.

from sklearn.linear_model import LogisticRegression

최대학습횟수 = 1000
model = LogisticRegression(max_iter=최대학습횟수)
model.fit(iris.data, iris.target)

예측 = model.predict(iris.data)
채점 = 예측 == iris.target
print(f'{채점.sum()}/{len(채점)} = {채점.mean():.1%}')
오답필터 = np.logical_not(채점) # True <-> False
iris.frame.assign(예측=예측)[오답필터]

틀린 표본들은 대개 두 품종의 경계에 걸쳐 있습니다. 두 특징만으로 산점도를 그려 보면 오답(검은 ×)이 어디에 분포하는지 한눈에 들어옵니다.

import matplotlib.pyplot as plt

plt.scatter(x=iris.data[:, 0], y=iris.data[:, 2], c=iris.target, cmap='brg')
# 오답 표시
plt.scatter(x=iris.data[오답필터, 0], y=iris.data[오답필터, 2], c='k', marker='x')
plt.colorbar()
plt.grid()
plt.show()

모델이 학습한 것은 가중치 행렬 $W$ 와 편향 $b$ 입니다. 입력 $X$ 에 대해 범주별 점수(로짓, logit)는

Z = X W^{\top} + b

(1)

로 계산되며, 이것이 사이킷런의 decision_function 값과 정확히 일치합니다. 이 점수를 확률로 바꾸는 함수가 **소프트맥스(softmax)**입니다.

\text{softmax}(z)_k = \frac{e^{z_k}}{\sum_j e^{z_j}}

(2)

소프트맥스는 점수 벡터를 합이 1인 양수 벡터, 즉 확률분포로 만듭니다. 가장 큰 로짓이 가장 큰 확률이 되므로 argmax(Z)로 얻은 예측은 확률 최대 범주와 같습니다. 선형 점수 → 소프트맥스 → 교차엔트로피로 이어지는 이 구조가 앞으로 만들 모든 분류 신경망의 출력부와 동일합니다.

from scipy.special import softmax
# 학습된 매개변수
W = model.coef_ # 가중치(weights) 또는 계수(coefficients)
b = model.intercept_ # 절편(intercept) 또는 편향(bias)
print(f'W: {W.shape} b: {b.shape}')

Z = iris.data @ W.T + b # X @ W.T + b
assert np.allclose(Z, model.decision_function(iris.data))
예측 = np.argmax(Z, axis=1) # 가장 큰 값의 인덱스(위치)
assert np.all(예측 == model.predict(iris.data))
# # 소프트맥스
# softmax = lambda Z: np.exp(Z) / np.exp(Z).sum(axis=1, keepdims=True)
pd.DataFrame(softmax(Z, axis=1), columns=iris.target_names).round(3).assign(예측=예측, 정답=iris.target)[오답필터]

2MNIST¶

이제 손글씨 숫자 데이터셋 MNIST로 옮겨 갑니다. 28×28 회색조 이미지 7만 장(학습 6만, 평가 1만)으로 이루어진, 딥러닝의 "Hello World"입니다. torchvision으로 내려받습니다.

from torchvision.datasets import MNIST

mnist = {}
for split in ['train', 'test']:
    mnist[split] = MNIST(root='data', train=(split=='train'), download=True)
    print(f'{split:<5}: {len(mnist[split]):,}')

3전처리¶

원본 표본은 PIL 이미지이고 픽셀값은 0~255의 정수입니다. 신경망에 넣으려면 (1) 실수형으로 바꾸고, (2) 255로 나눠 $[0, 1]$ 로 정규화하고, (3) 채널 차원을 붙여 텐서로 만들어야 합니다. 아래는 이 과정을 손으로 펼쳐 본 것입니다.

sample, label = mnist['train'][0]
print(type(sample), sample.size, f'label: {label}')

xi = np.array(sample)
print(type(xi), xi.shape, xi.dtype, xi.min(), xi.max())
# ToTensor() 스타일 전처리
xi = xi.astype(np.float32)
xi /= 255
xi = xi.reshape(1, 28, 28) #  ToTensor() 스타일 채널 추가
xi = torch.from_numpy(xi) # numpy -> torch
print(type(xi), xi.shape, xi.dtype, xi.min(), xi.max())

display(sample)

이 일련의 변환을 한 줄로 처리해 주는 것이 transforms.ToTensor()입니다. 형변환, 정규화, 채널 차원 추가를 모두 수행해 (1, 28, 28) 형상의 float32 텐서를 돌려줍니다.

import torchvision.transforms as transforms

xi = transforms.ToTensor()(sample)
print(type(sample), '->', type(xi))
print(xi.shape, xi.dtype, xi.min(), xi.max())

완전연결(fully-connected) 층에 넣으려면 이미지를 1차원 벡터로 펴야 합니다. (1, 28, 28)의 원소 수는 $1 \times 28 \times 28 = 784$ 이며, reshape(-1)의 -1은 "나머지가 알아서 채워질 차원"을 뜻합니다.

sample, label = mnist['train'][0]
print(type(sample), sample.shape, sample.dtype, f'label: {label}')
# 텐서 형상 조정 (review)
# 1 x 28 x 28 = ? (-1)
for xi in [sample, sample.reshape(-1), sample.reshape(1, -1)]:
    print(f'총 원소수: {xi.numel()}') # numel = number of elements
    print(xi.shape)

4PyTorch 모델¶

Part 1에서 한 층씩 직접 쌓던 것을 이제 torch.nn으로 선언적으로 구성합니다. Flatten으로 이미지를 펴고, Linear(784, 512) + ReLU로 은닉층을, Linear(512, 10)으로 10개 숫자에 대한 로짓을 출력합니다. 손실은 CrossEntropyLoss인데, 이 함수는 로짓을 직접 받아 내부에서 소프트맥스와 로그우도를 함께 계산합니다. 그래서 출력층에 별도의 소프트맥스를 두지 않습니다.

import torch.nn as nn # 신경망(neural network) 모듈

model = nn.Sequential(
    nn.Flatten(), # (28, 28) -> (784,)
    nn.Linear(in_features=28*28, out_features=512),
    nn.ReLU(),
    nn.Linear(in_features=512, out_features=10)
)
print(model)

손실함수 = nn.CrossEntropyLoss()

%run load_mnist.py
sample, label = mnist['train'][0]
print(sample.shape, '->', sample.reshape(1, -1).shape)
Xs = sample.reshape(1, -1) # (1, 28, 28) -> (1, ?)
outputs = model(Xs)
print(outputs.shape)
print(f'손실: {손실함수(outputs, torch.tensor([label])):.3f}')

5학습¶

학습 루프의 뼈대는 Part 1과 같습니다. 미니배치마다 세 단계를 반복합니다.

순전파·손실 계산: outputs = model(X_batch) → loss = 손실함수(...)
역전파: zero_grad()로 이전 경사를 비우고 loss.backward()로 경사를 구합니다
갱신: optimizer.step()으로 매개변수를 한 걸음 옮깁니다

GPU가 있으면 모델과 배치를 .to(device)로 옮겨 같은 코드를 가속합니다. 최적화기로는 RMSprop을 씁니다.

import torch.nn as nn # 신경망(neural network) 모듈

model = nn.Sequential(
    nn.Flatten(), # (28, 28) -> (784,)
    nn.Linear(in_features=28*28, out_features=512),
    nn.ReLU(),
    nn.Linear(in_features=512, out_features=10)
)
print(model)
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
    model.to(device) # 모델을 GPU로 이동
손실함수 = nn.CrossEntropyLoss()
최적화기법 = torch.optim.RMSprop(model.parameters(), lr=1e-3)

# 학습
data_loader = {}
data_loader['train'] = DataLoader(mnist['train'], batch_size=128, shuffle=True)
loss_history = []
에폭수 = 5
for epoch in range(에폭수):
    for X_batch, y_batch in data_loader['train']:
        if device != 'cpu':
            X_batch = X_batch.to(device) # 입력 데이터를 GPU로 이동
            y_batch = y_batch.to(device) # 레이블 데이터를 GPU로 이동
        # 1. 손실 계산
        outputs = model(X_batch)
        loss = 손실함수(outputs, y_batch)
        loss_history.append(loss.item())
        # 2. 경사 산출 (역전파)
        최적화기법.zero_grad() # 경사 초기화
        loss.backward()
        # 3. 매개변수 갱신
        최적화기법.step()
    print(f'손실 (에폭 {epoch+1}/{에폭수}): {loss.item():.3f}')

6평가¶

평가 단계에서는 경사가 필요 없으므로 torch.inference_mode() 안에서 순전파만 수행합니다. 전체 평가셋의 출력을 이어붙인 뒤 argmax로 예측을 뽑아 정확도를 집계합니다.

data_loader['test'] = DataLoader(mnist['test'], batch_size=128, shuffle=False)

all_outputs = []; all_targets = []
with torch.inference_mode(): # 평가 모드 (역전파 불필요)
    for X_batch, y_batch in data_loader['test']:
        outputs = model(X_batch)
        all_outputs.append(outputs)
        all_targets.append(y_batch)

all_outputs = torch.cat(all_outputs, dim=0) # cat (concatenate) = 이어붙이기
all_targets = torch.cat(all_targets, dim=0)
print(all_outputs.shape)
예측 = torch.argmax(all_outputs, dim=1)
채점 = 예측 == all_targets
print(f'{채점.sum()}/{len(채점)} = {채점.float().mean():.1%}')
pd.DataFrame(all_outputs).assign(예측=예측, 정답=all_targets).round(3).head()

7nn.Linear의 정체¶

nn.Linear가 내부에서 하는 일은 단순한 행렬 연산입니다. 가중치 weight와 편향 bias를 꺼내 직접 계산해 보면 층의 출력과 정확히 일치합니다.

\text{nn.Linear}(X) = X W^{\top} + b

(3)

여기서 전치 $W^{\top}$ 가 붙는다는 점을 기억해 둡시다. 잠시 뒤 케라스와 비교할 때 핵심이 됩니다.

f = nn.Linear(in_features=784, out_features=10)

X_batch, y_batch = next(iter(data_loader['train']))
X_batch = X_batch.reshape(-1, 784) # (128, 28, 28) -> (128, 784)
outputs = f(X_batch)
W = f.weight
b = f.bias
print(f'W: {W.shape} b: {b.shape}')
Z = X_batch @ W.T + b
assert torch.allclose(Z, outputs)

8케라스로 갈아타기¶

케라스는 여러 백엔드(backend) 위에서 동작하는 고수준 신경망 API입니다. 환경변수 KERAS_BACKEND='torch'로 설정하면 케라스의 연산을 PyTorch가 실제로 수행합니다. 즉, 케라스로 모델을 기술하되 텐서 연산과 자동미분은 우리가 이미 익숙한 PyTorch가 담당합니다.

import os
# 환경 변수 설정
os.environ['KERAS_BACKEND'] = 'torch'

import torch
import keras

print('PyTorch', torch.__version__)
print('Keras', keras.__version__)

9keras.layers.Dense와 전치의 차이¶

케라스의 완전연결 층은 Dense이고, nn.Linear와 같은 일을 합니다. 다만 가중치 행렬을 두는 방향이 다릅니다.

\text{keras.layers.Dense}(X) = X W + b

(4)

전치가 없습니다. 두 식을 나란히 놓으면 관계가 분명해집니다.

프레임워크	연산	가중치
`nn.Linear`	$X W^{\top} + b$	`weight`
`keras.layers.Dense`	$X W + b$	`kernel`

즉 같은 층의 가중치를 서로 옮기려면 전치가 필요합니다: nn.Linear.weight = keras.layers.Dense.kernel.T. 아래에서 get_weights()로 꺼낸 행렬로 직접 계산해 출력과 일치함을 확인합니다.

f = keras.layers.Dense(10)

X_batch, y_batch = next(iter(data_loader['train']))
X_batch = X_batch.reshape(-1, 784)

outputs = f(X_batch)
print(type(outputs), outputs.shape)
W, b = f.get_weights() # 매개변수 획득
print(f'W: {W.shape} b: {b.shape}')
Z = X_batch @ W + b # nn.Linear.weight = keras.layers.Dense.kernel.T
assert np.allclose(Z, outputs.detach().cpu().numpy(), atol=1e-5)

10케라스 모델: compile / fit / evaluate¶

케라스의 진짜 이점은 학습 루프를 직접 쓰지 않는다는 데 있습니다. 모델을 Sequential로 선언하고, compile로 손실·최적화기·평가지표를 묶고, fit 한 번으로 학습합니다. 앞서 손으로 짠 이중 for 문 전체가 fit 한 줄로 압축됩니다.

손실은 SparseCategoricalCrossentropy(from_logits=True)를 씁니다. from_logits=True는 출력이 아직 소프트맥스를 거치지 않은 로짓임을 알려 주는 설정으로, PyTorch에서 CrossEntropyLoss가 로짓을 받던 것과 같은 맥락입니다. metrics=['accuracy']로 손실 외에 정확도도 함께 추적합니다.

model = keras.Sequential([
    keras.Input(shape=(1, 28, 28)),
    keras.layers.Flatten(),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.Dense(10)
])
model.summary()

model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.RMSprop(learning_rate=1e-3),
    metrics=['accuracy'] # 손실 외 추가 평가 지표
)

%run load_mnist.py
data_loader = {}
for split in mnist.keys():
    data_loader[split] = DataLoader(
        mnist[split], batch_size=128, shuffle=(split=='train'))
    
history = model.fit(data_loader['train'], epochs=5)

평가 역시 evaluate 한 번이면 손실과 정확도를 돌려줍니다.

test_results = model.evaluate(data_loader['test'], return_dict=True)
pd.Series(test_results).round(3)

백엔드가 PyTorch이므로 모델에 PyTorch 텐서 배치를 그대로 넣을 수 있고, 출력도 PyTorch 텐서로 나옵니다. 케라스 모델과 PyTorch 자료형이 매끄럽게 섞입니다.

X_batch, y_batch = next(iter(data_loader['test']))
outputs = model(X_batch) # 연산은 벡엔드가 수행
print(type(outputs), outputs.shape)

케라스가 이렇게 간결한데도 연구·실무에서 PyTorch를 많이 쓰는 한 가지 이유는 그 텐서 API가 NumPy와 거의 똑같이 생겼다는 점입니다. NumPy에 익숙한 사람이라면 np.linspace를 torch.linspace로 바꿔 쓰는 정도의 감각으로 바로 적응할 수 있습니다.

import numpy as np

x = np.linspace(0, 1, 5)
print(x)
print(type(x), x.shape, x.dtype)

import torch

x = torch.linspace(0, 1, 5)
print(x)
print(type(x), x.shape, x.dtype)

11미니 배치¶

학습 데이터를 통째로 한 번에 넣지 않고 작은 묶음(배치)으로 잘라 흘려 보내는 것이 미니배치입니다. DataLoader가 이 분할과 섞기(shuffle)를 담당합니다. 배치 수는 전체 표본 수를 배치 크기로 나눈 값(올림)이며, 마지막 배치는 크기가 모자랄 수 있습니다.

from torch.utils.data import DataLoader

배치크기 = 128
data_loader = {}
data_loader['train'] = DataLoader(mnist['train'], batch_size=배치크기, shuffle=True)#, drop_last=True)
print(f'배치크기: {배치크기} -> 배치 수: {len(data_loader["train"])} = {len(mnist["train"])}/{배치크기}')
assert len(data_loader['train']) == np.ceil(len(mnist['train']) / 배치크기)

for X_batch, y_batch in data_loader['train']:
    try:
        assert 배치크기 == len(X_batch) == len(y_batch)
    except AssertionError:
        print(f'배치 크기 불일치: {배치크기} != {len(X_batch)}')

배치 크기는 학습의 안정성과 속도를 가르는 손잡이입니다. 한 번의 순전파에 들어가는 표본 수를 바꿔 가며 손실과 계산 시간을 재 보면, 배치가 커질수록 한 번에 처리하는 양이 늘어 단위 표본당 연산이 효율적으로 묶이는 양상을 관찰할 수 있습니다.

import time

# 배치 크기에 따른 손실과 계산 시간 측정  
손실 = {}
for 배치크기 in [1, 10, 100, 300, 600, 1000, 6000, 60000]:
    print(f'배치크기: {배치크기}')
    data_loader = DataLoader(mnist['train'], batch_size=배치크기, shuffle=False)
    X_batch, y_batch = next(iter(data_loader))
    start_time = time.time()
    outputs = model(X_batch.reshape(-1, 784))
    loss = 손실함수(outputs, y_batch)
    손실[배치크기] = {'loss': loss.item(), 'time (ms)': (time.time() - start_time) * 1000}

pd.DataFrame(손실).round(2)

12적용: Fashion MNIST¶

배운 것을 조금 더 어려운 문제에 적용합니다. Fashion MNIST는 MNIST와 같은 28×28 회색조·10범주 형식이지만 숫자 대신 옷·신발 등의 의류 이미지여서 분류가 더 까다롭습니다.

from torchvision.datasets import FashionMNIST
import torchvision.transforms as transforms

전처리 = transforms.ToTensor()

fashion_mnist = {}
for split in ['train', 'test']:
    fashion_mnist[split] = FashionMNIST(
        root='data', train=(split=='train'), download=True, transform=전처리)
    print(f'{split:<5}: {len(fashion_mnist[split]):,}')

display(pd.DataFrame({'label': FashionMNIST.classes}).T)
plt.figure(figsize=(10, 4))
for i, (sample, label) in zip(range(10), fashion_mnist['train']):
    # print(type(sample), sample.shape, sample.dtype)
    sample = sample.squeeze() # (1, 28, 28) -> (28, 28)
    # print(type(sample), sample.shape, sample.dtype)
    # 이미지 시각화
    plt.subplot(2, 5, i + 1) # 2행 5열의 i+1번째 위치
    plt.imshow(sample, cmap='gray')
    plt.title(f'label: {label}')

plt.tight_layout()
plt.show()

먼저 기준선(baseline)으로 신경망이 아닌 로지스틱 회귀를 적용해 봅니다. 이미지를 펴서 784차원 벡터로 만든 뒤 선형 분류기를 학습시킨 정확도가 앞으로 만들 신경망이 넘어야 할 출발선이 됩니다.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=10000)

# 전처리: sklearn 모델에 입력할 수 있도록 데이터를 numpy 배열로 변환
X_train = fashion_mnist['train'].data.numpy()
X_train = X_train / 255.0 # 픽셀 값을 0과 1 사이로 정규화
# flatten
X_train = X_train.reshape(-1, 28*28) # (60000, 28, 28) -> (60000, 784)
print(X_train.shape, X_train.dtype, X_train.min(), X_train.max())
y_train = fashion_mnist['train'].targets.numpy()

model.fit(X_train, y_train)
scores = {'train': model.score(X_train, y_train)}
print(pd.Series(scores).round(4))

이제 은닉층을 둘 쌓은 케라스 신경망으로 같은 문제를 풉니다. 이전 모델과 달라진 점이 몇 가지 있습니다.

출력층에 activation='softmax'를 직접 두었고, 손실을 SparseCategoricalCrossentropy()로(from_logits 없이) 맞췄습니다. 즉 “소프트맥스를 모델 안에 넣고 손실은 확률을 받는” 구성입니다.
최적화기를 Adam으로 바꿨습니다.
validation_data로 평가셋을 함께 넘겨, 매 에폭마다 검증 손실·정확도를 같이 기록합니다.

학습 곡선을 보며 과적합 여부를 확인할 수 있다는 점이 핵심입니다.

from keras import layers

model = keras.Sequential()
model.add(keras.Input(shape=(1, 28, 28)))
# 벡터화
model.add(layers.Flatten())
# 은닉층 추가
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(256, activation='relu'))
# 출력층
model.add(layers.Dense(10, activation='softmax'))

model.summary()

model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    metrics=['accuracy']
)

# 모델 학습을 위한 배치 생성기
dataloaders = {}
for split in fashion_mnist.keys():
    dataloaders[split] = torch.utils.data.DataLoader(
        fashion_mnist[split], batch_size=128, shuffle=(split=='train'))
    print(f'{split:<5}: {len(dataloaders[split]):,} batches')

history = model.fit(
    dataloaders['train'], epochs=10, 
    validation_data=dataloaders['test']
)

fit이 돌려주는 history에는 에폭별 학습/검증 지표가 담겨 있습니다. 학습 손실은 계속 내려가는데 검증 손실이 다시 오르기 시작하면 과적합의 신호입니다.

results = pd.DataFrame(history.history)

plt.figure(figsize=(10, 4))
왼쪽 = plt.subplot(1, 2, 1) # 1행 2열의 1번째 위치
results.plot(y=['loss', 'val_loss'], ax=왼쪽)
오른쪽 = plt.subplot(1, 2, 2) # 1행 2열의 2번째 위치
results.plot(y=['accuracy', 'val_accuracy'], ax=오른쪽)
plt.tight_layout()
plt.show()

display(results.round(2))

test_results = model.evaluate(dataloaders['test'], return_dict=True)
print(pd.Series(test_results).round(4))

마지막으로, 백엔드가 PyTorch일 때 케라스 모델이 장치(device)를 어떻게 다루는지 확인합니다. 입력 배치는 CPU에 있지만 케라스 모델은 가용한 GPU를 자동으로 탐지해 연산을 수행하고, 출력 텐서의 위치로 그것이 드러납니다.

X_train, y_train = next(iter(dataloaders['train']))
print(f'입력 데이터: {X_train.device}')
# Keras 모델은 자동으로 GPU를 탐지하고, 활용
outputs = model(X_train)
print(f'출력 데이터: {outputs.device}') # cuda

13정리¶

이 장에서는 같은 분류 신경망을 추상화 단계를 높여 가며 세 가지 방식으로 구현했습니다.

단계	구성	학습
로지스틱 회귀	선형 점수 + 소프트맥스	`fit` (사이킷런)
PyTorch	`nn.Sequential`	직접 짠 학습 루프
케라스	`Sequential`	`compile` + `fit`

세 방식의 출력부는 모두 선형 점수(로짓) → 소프트맥스 → 교차엔트로피라는 동일한 골격을 공유합니다. nn.Linear( $X W^{\top} + b$ )와 keras.layers.Dense( $X W + b$ )는 가중치를 두는 방향만 다른 같은 층이며, KERAS_BACKEND='torch' 덕분에 케라스의 간결한 선언과 PyTorch의 친숙한 텐서 API를 한 코드 안에서 함께 쓸 수 있습니다. 손으로 짜는 학습 루프의 동작을 이해한 위에서 fit 한 줄의 편의를 누리는 것이 이 장의 결론입니다.