Module 4: Kubernetes 리소스 모니터링

1

K8s 리소스 모델 이해

Kubernetes에서 컨테이너 리소스를 이해하는 것은 안정적인 서비스 운영의 핵심입니다. requests와 limits를 올바르게 설정하지 않으면 예상치 못한 장애가 발생할 수 있습니다.

requests vs limits

requests

컨테이너가 보장받는 최소 리소스. 스케줄러는 이 값을 기준으로 Pod을 배치합니다.

• 스케줄링 기준
• 최소 보장 리소스
• QoS Class 결정 요소

limits

컨테이너가 사용할 수 있는 최대 리소스. 초과 시 제한당합니다.

• CPU 초과 시: throttling
• Memory 초과 시: OOMKilled
• 리소스 격리 보장

YAML 예시

resources:
  requests:
    cpu: "250m"      # 0.25 vCPU
    memory: "256Mi"  # 256 MiB
  limits:
    cpu: "500m"      # 0.5 vCPU
    memory: "512Mi"  # 512 MiB

QoS Class 비교

QoS Class	조건	Eviction 우선순위	적합한 워크로드
Guaranteed	requests = limits (모든 컨테이너)	가장 낮음 (마지막에 퇴거)	프로덕션 핵심 서비스
Burstable	requests < limits	중간	일반 서비스
BestEffort	requests/limits 미설정	가장 높음 (먼저 퇴거)	배치, 테스트

중요: 프로덕션 서비스에는 반드시 requests와 limits를 설정하세요. BestEffort Pod는 노드 압박 시 가장 먼저 퇴거됩니다.

2

Pod 레벨 지표

CPU 지표

핵심 메트릭

›
kubernetes.cpu.usage.total
실제 CPU 사용량 (Datadog K8s Integration)
›

CPU Throttling: limits에 도달하면 Linux CFS가 CPU 사용을 제한
›

CPU 사용률: 실제 사용 / requests × 100%

Datadog Query 예시

avg:kubernetes.cpu.usage.total{pod_name:myapp-xxx}

주의: CPU throttling은 limits에 의해 발생합니다. 응답 지연의 숨은 원인인 경우가 많습니다.

Memory 지표

핵심 메트릭

›
kubernetes.memory.working_set
실제 사용 중인 메모리 (캐시 제외, Datadog K8s Integration)
›
kubernetes.memory.rss
RSS (Resident Set Size, Datadog K8s Integration)
›

OOMKilled: memory limits 초과 시 커널이 프로세스를 강제 종료

OOMKilled 원인과 대응

증상

• Pod Status = OOMKilled
• CrashLoopBackOff

확인 방법

kubectl describe pod <name>

Last State: Terminated, Reason: OOMKilled

원인

• memory limits가 실제 필요량보다 낮음
• 메모리 누수 (Memory Leak)
• 대량 데이터 처리

대응 순서

1) limits 상향 조정
2) 메모리 프로파일링
3) 코드 수정

OOMKilled 디버깅 플로우차트

Pod OOMKilled 발생
  │
  ├─ 반복 발생? ──Yes──▶ 메모리 누수 의심
  │                      → 힙 덤프 분석
  │                      → 프로파일러 적용
  │
  └─ 처음 발생? ──Yes──▶ 일시적 부하?
                         ├─Yes─▶ limits 상향 (20-30%)
                         └─No──▶ 데이터 크기 확인
                                 → batch size 조정

3

Node 레벨 지표

Node Conditions

Condition	의미	트리거
MemoryPressure	노드 메모리 부족	available memory < threshold
DiskPressure	노드 디스크 부족	available disk < 15%
PIDPressure	프로세스 수 한계	PID 수 > threshold
Ready	노드 정상 동작	kubelet 정상

Node Pressure 발생 시 동작

1. kubelet이 QoS 순서대로 Pod 퇴거(Eviction) 시작
2. BestEffort → Burstable → Guaranteed 순서
3. 새 Pod 스케줄링 거부

Allocatable 리소스

Node 전체 리소스 ≠ Pod가 사용 가능한 리소스
Allocatable = Capacity - System Reserved - Kube Reserved

kubectl describe node 출력 예시

Capacity:
  cpu:    4
  memory: 16Gi
Allocatable:
  cpu:    3920m
  memory: 14.5Gi

Tip: kubectl describe node로 Allocatable 리소스를 확인하세요. Capacity와 차이가 클 수 있습니다.

4

HPA 동작 이해

HPA(Horizontal Pod Autoscaler) 기본 원리

(현재 메트릭 / 목표 메트릭) × 현재 Replica 수 = 필요 Replica 수

예시:

CPU 사용률 80%, 목표 50%, 현재 3개 → 80/50 × 3 = 4.8 → 5개

HPA YAML 예시

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300

HPA가 스케일아웃해도 지연이 줄지 않는 경우

1

원인: Pod 시작 시간이 길다 (Cold Start)

→ readinessProbe 확인

2

원인: 병목이 Pod 수가 아니라 DB/외부 서비스

→ 백엔드 메트릭 확인

3

원인: maxReplicas에 도달

→ 더 이상 스케일 불가

4

원인: Node 리소스 부족으로 Pending

→ Cluster Autoscaler 필요

핵심: HPA는 문제를 감추는 것이 아니라 시간을 벌어주는 것입니다. 근본 원인은 따로 찾아야 합니다.

5

핵심 모니터링 도구

도구별 역할

도구	역할	주요 메트릭
kubectl top	실시간 Pod/Node 리소스	CPU, Memory (순간값)
metrics-server	클러스터 메트릭 API	kubectl top의 데이터 소스
kube-state-metrics	K8s 오브젝트 상태	Pod status, Deployment replicas, HPA
Prometheus + Grafana	시계열 수집 + 시각화	모든 메트릭 (장기 보관)
Datadog K8s Integration	통합 모니터링	위 전체 + APM/RUM 연동

kubectl 명령어 모음

# Pod CPU/Memory 사용량 (실시간)

kubectl top pods -n production

# Node 리소스 사용량

kubectl top nodes

# Pod 상세 (이벤트, 상태)

kubectl describe pod myapp-abc123 -n production

# 최근 이벤트 (OOMKill, Eviction 등)

kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

# HPA 상태 확인

kubectl get hpa -n production

6

실전: 서비스 지연 → K8s 리소스 진단 워크스루

시나리오: API 응답 시간 증가 알림 수신

1

kubectl top pods로 리소스 확인

kubectl top pods -n production

결과: myapp 파드 CPU 490m/500m (98%)

2

CPU throttling 확인 (Datadog Query)

avg:kubernetes.cpu.cfs.throttled.seconds{pod_name:myapp-xxx} > 0

결과: throttling 발생 중

3

Pod의 resources 확인

resources:
  requests:
    cpu: "250m"
  limits:
    cpu: "500m"

4

limits을 1000m으로 상향 조정

결과: CPU throttling 해소

5

HPA 확인

kubectl get hpa

결과: 현재 8/10 replicas (maxReplicas 근접)

6

maxReplicas 상향 + Cluster Autoscaler 확인

maxReplicas를 15로 상향

7

근본 원인 조사

특정 API의 처리 시간 증가 → 코드 최적화 필요

핵심 플로우: kubectl top → describe → Datadog Metrics Query 순서로 점진적으로 깊이 있는 진단을 수행하세요.

퀴즈

학습한 내용을 확인해봅시다. 5문제를 모두 풀어보세요.