[Retry, k8s] 34. 헬스 체크, 컨테이너 라이프사이클 - 헬스 체크

Date: 2023.11.11 Updated: 2023.11.11

카테고리: KUBERNETES

01. 헬스체크

(1) 헬스 체크 방법

Probe 종류	역할
Liveness Probe	파드 내부의 컨테이너가 정상 동작 중인지 확인 (실패 시, 컨테이너 재기동)
Readiness Probe	파드가 요청을 받아들일 수 있는지 확인 (실패 시, 트래픽 차단하고 파드를 재기동하지 않음)
Startup Probe	파드의 첫 번째 기동이 완료되었는지 확인 (실패 시, 다른 Probe 실행을 시작하지 않음)

‘Liveness Probe’는 프로세스의 장기간 운영 중 발생할 수 있는 문제, 예를 들어 메모리 누수 등으로 인해 응답하지 않게 된 경우를 대비하여 사용된다. 이를 통해 문제가 발생한 컨테이너를 재시작하여 복구할 수 있다.

‘Readiness Probe’는 백엔드 데이터베이스 접속, 캐시 로딩, 또는 기동 시간이 긴 프로세스의 기동 완료 등을 체크한다. 파드 기동 후에 이 Probe가 실패하면, 해당 파드는 서비스 트래픽을 받지 않게 되지만 ‘Liveness Probe’와는 달리 문제 발생 시 컨테이너를 재기동하지 않는다. (정상적으로 요청을 처리하는 수를 제어하기 위해 ‘Readiness Probe’를 실패시키는 방법도 사용할 수 있다.)

sssss

로드밸런서 타입의 서비스에서는 로드밸런서가 쿠버네티스 노드의 헬스 체크를 ICMP를 통해 간단하게만 수행하므로, 파드 상태를 체크하고 트래픽을 제어하려면 ‘Readiness Probe’나 ‘Liveness Probe’를 설정해야 한다.

초기 기동 시간이 긴 경우, ‘Liveness Probe’를 사용하려면 체크 시작 시간을 연장하거나 실패 판단 시간을 연장하면 문제를 완화할 수 있지만, 이런 방식은 초기 기동이 긴 경우 재기동을 반복하거나, 장애 발생 시 즉각적인 실패 판단이 어려운 문제를 야기할 수 있다.

이러한 문제를 해결하기 위해 ‘Startup Probe’가 도입되었으며, ‘Startup Probe’는 파드의 초기 기동 완료 여부를 체크하는 헬스 체크로, 이 헬스 체크가 끝나기 전에는 ‘Liveness Probe’나 ‘Readiness Probe’가 시작되지 않으며, 파드가 서비스되거나 정지되지 않는다.

dadads

(2) 헬스 체크 방식

헬스 체크 방식	역할
exec	명령어를 실행하고 종료 코드가 0이 아니면 실패
httpGet	HTTP GET 요청을 실행하고 Status Code가 200~399가 아니면 실패
tcpSocket	TCP 세션이 연결되지 않으면 실패

# exec - 명령어 기반 체크
# 명령어는 컨테이너별로 실행되어 명령어에서 사용하는 파일 등도 컨테이너별로 필요하다.

livenessProbe:
    exec:
      command: ["test", "-e", "/ok.txt"]

# httpGet - HTTP 기반의 체크
# GET 요청은 kubelet에서 이루어 지며, 컨테이너에는 kubelet이 가진 파드용 네트워크 인터페이스 IP 주소로 HTTP GET 요청이 오게 된다.

livenessProbe:
    httpGet:
      path: /health
      port: 80
      scheme: HTTP
      host: web.example.com
      httpHeaders:
        - name: Authorization
          value: Bearer TOKEN

# tcpSocket - TCP 기반의 체크
# TCP 세션 활성화를 검증

livenessProbe:
    tcpSocket:
      prot: 80

(3) 헬스 체크 간격

설정 항목	내용
initialDelaySeconds	첫 번째 헬스 체크 시작까지의 지연(최대 failureThreshold만큼 연장)
periodSeconds	헬스 체크 간격 시간(초)
timeoutSeconds	타임아웃까지의 시간(초)
successThreshold	성공이라고 판단하기까지의 체크 횟수
failureThreshold	실패라고 판단하기까지의 체크 횟수

(4) 헬스 체크 생성

Liveness Probe로 http://localhost:80/index.html에 HTTP GET 요청을 체크하고, Readiness Probe로 /usr/share/nginx/html/50x.html 파일이 있는지 확인한다.

# sample-healthcheck.yaml

apiVersion: v1
kind: Pod
metadata:
    name: sample-healthcheck
spec:
    containers:
    - name: nginx-container
      image: nginx:1.16
      livenessProbe:
        httpGet:
            path: /index.html
            port: 80
            scheme: HTTP
        timeoutSeconds: 1
        successThreshold: 1
        failureThreshold: 2
        initialDelaySeconds: 5
        periodSeconds: 3
      readinessProbe:
        exec:
            command: ["ls", "/usr/share/nginx/html/50x.html"]
        timeoutSeconds: 1
        successThreshold: 2
        failureThreshold: 1
        initialDelaySeconds: 5
        periodSeconds: 3

# 파드에 설정된 Probe 확인
$ kubectl describe pod sample-healthcheck | egrep -E "Liveness|Readiness"
    Liveness:       http-get http://:80/index.html delay=5s timeout=1s period=3s #success=1 #failure=2
    Readiness:      exec [ls /usr/share/nginx/html/50x.html] delay=5s timeout=1s period=3s #success=2 #failure=1

(5) Liveness Probe 실패

HTTP GET 요청에 대해 200~399 Status Code가 반환되는지 체크한다.

# sample-liveness.yaml

apiVersion: v1
kind: Pod
metadata:
    name: sample-liveness
spec:
    containers:
    - name: nginx-container
      image: nginx:1.16
      livenessProbe:
        httpGet:
            path: /index.html
            port: 80
            scheme: HTTP
        timeoutSeconds: 1
        successThreshold: 1
        failureThreshold: 2
        initialDelaySeconds: 5
        periodSeconds: 3

# 체크가 실패하도록 index.html을 삭제하고 404를 반환하게 하여 동작을 확인한다.
$ kubectl exec -it sample-liveness -- rm -f /usr/share/nginx/html/index.html


# 다른 터미널로 --watch 옵션을 사용하여 파드 상태 변화를 확인
# Liveness Probe가 실패한 경우 파드가 컨테이너를 재시작한다. (RESTARTS 카운트가 증가한 것을 확인할 수 있다.)
$ kubectl get pods sample-liveness --watch
NAME              READY   STATUS              RESTARTS   AGE
sample-liveness   0/1     ContainerCreating   0          1s
sample-liveness   1/1     Running             0          2s
sample-liveness   1/1     Running             1 (0s ago)   13s # 직전에 Liveness Probe 실패


# Liveness Probe 이력 확인
$ kubectl describe pod sample-liveness

...~(생략)~

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  2m25s                  default-scheduler  Successfully assigned default/sample-liveness to k8s-worker02
  Normal   Pulled     2m12s (x2 over 2m24s)  kubelet            Container image "nginx:1.16" already present on machine
  Normal   Created    2m12s (x2 over 2m24s)  kubelet            Created container nginx-container
  Normal   Started    2m12s (x2 over 2m24s)  kubelet            Started container nginx-container
  Warning  Unhealthy  2m12s (x2 over 2m15s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 404
  Normal   Killing    2m12s                  kubelet            Container nginx-container failed liveness probe, will be restarted

(6) Readiness Probe 실패

/usr/share/nginx/html/50x.html 파일이 있는지 명령어 기반으로 체크

# sample-readiness.yaml

apiVersion: v1
kind: Pod
metadata:
    name: sample-liveness
spec:
    containers:
    - name: nginx-container
      image: nginx:1.16
      readinessProbe:
        exec:
            command: ["ls", "/usr/share/nginx/html/50x.html"]
        timeoutSeconds: 1
        successThreshold: 2
        failureThreshold: 1
        initialDelaySeconds: 5
        periodSeconds: 3

# Readiness Probe에서 감시하고 있는 파일 삭제(Probe가 실패하도록 한다.)
$ kubectl exec -it sample-readiness -- rm -f /usr/share/nginx/html/50x.html


# Readiness Probe에서 감시하고 있는 파일 생성(Probe가 성공하도록 한다.)
$ kubectl exec -it sample-readiness -- touch /usr/share/nginx/html/50x.html


# Readiness Probe가 실패했을 때의 파드 상태
$ kubectl get pods sample-readiness --watch
NAME               READY   STATUS              RESTARTS   AGE
sample-readiness   0/1     ContainerCreating   0          1s
sample-readiness   0/1     Running             0          2s
sample-readiness   1/1     Running             0          9s
sample-readiness   0/1     Running             0          24s  # 직전에 Readiness Probe 실패
sample-readiness   1/1     Running             0          36s  # 직전에 Readiness Probe 성공


# Readiness Probe가 실패했을 때 파드 이력 정보
$ kubectl describe pod sample-readiness

...~(생략)~

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  3m58s                  default-scheduler  Successfully assigned default/sample-readiness to k8s-worker02
  Normal   Pulled     3m57s                  kubelet            Container image "nginx:1.16" already present on machine
  Normal   Created    3m57s                  kubelet            Created container nginx-container
  Normal   Started    3m57s                  kubelet            Started container nginx-container
  Warning  Unhealthy  3m28s (x4 over 3m34s)  kubelet            Readiness probe failed: ls: cannot access '/usr/share/nginx/html/50x.html': No such file or directory

(7) 추가 Ready 조건을 추가하는 파드 ready++(ReadinessGate)

‘파드 ready++’는 특정 상황에서 파드의 Ready 상태를 추가로 확인하는 기능으로, 이는 클라우드 외부 로드밸런서와의 연계에 시간이 필요한 경우나, 단순히 파드의 Ready 판단만으로는 롤링 업데이트 시 기존 파드의 안전한 관리가 어려운 경우에 사용한다.

# sample-readinessgate.yaml

---
apiVersion: v1
kind: Pod
metadata:
    name: sample-readinessgate
    labels: 
        app: sample-readinessgate
spec:
    readinessGates:
    - conditionType: "amsy.dev/sample-condition"
    containers:
    - name: nginx-container
      image: nginx:1.16
---
apiVersion: v1
kind: Service
metadata:
    name: sample-readinessgate
spec:
    type: ClusterIP
    ports:
    - name: "http-port"
      protocol: "TCP"
      port: 8080
      targetPort: 80
    selector:
      app: sample-readinessgate

ReadinessGate가 설정된 파드를 기동하면 READINESS GATES 항목에서 추가 상태가 표시된다. 기동하고 있는 상태에서도 서비스에 전송 대상이 등록되어 있지만, API 서버에 직접 상태 변경을 요청하고 정보를 업데이트하면 서비스에 전송 대상이 등록된 것을 확인할 수 있다.

# 파드의 ReadinessGate 상태를 확인
$ kubectl get pod sample-readinessgate -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP          NODE           NOMINATED NODE   READINESS GATES
sample-readinessgate   1/1     Running   0          3m32s   10.46.0.1   k8s-worker02   <none>           0/1


# 서비스 전송 대상 확인
$ kubectl describe service sample-readinessgate
...(생략)...
TargetPort:        80/TCP
Endpoints:
Session Affinity:  None
Events:            <none>


# API 서버에 프록시를 로컬로 생성
$ kubectl proxy


# ReadinessGate 상태를 업데이트
$ curl -X PATCH -H "Content-Type: application/json-patch+json" \
-d '[{"op": "add", "path": "/status/conditions/-", "value": {"type": "amsy.dev/sample-condition", "status": "True"}}]' \
http://localhost:8001/api/v1/namespaces/default/pods/sample-readinessgate/status



# ReadinessGate 상태 확인
$ kubectl get pod sample-readinessgate -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP          NODE           NOMINATED NODE   READINESS GATES
sample-readinessgate   1/1     Running   0          7m12s   10.46.0.1   k8s-worker02   <none>           1/1

$ kubectl describe service sample-readinessgate
...(생략)...
TargetPort:        80/TCP
Endpoints:         10.46.0.1:80
Session Affinity:  None
Events:            <none>

(8) Readiness Probe를 무시한 서비스 생성

‘Readiness Probe’가 실패하는 경우에는 서비스(엔드포인트)에서 제외된 상태가 된다. 반면 스테이트풀셋에서 Headless 서비스를 사용할 때는 파드가 Ready 상태가 되지 않아도 클러스터를 구성하기 위해 각 파드의 이름 해석이 필요한 경우가 있다. (‘Readiness Probe’가 실패한 경우에도 서비스에 연결되게 하려면 spec.publishNotReadyAddresses를 true로 설정한다.)

# sample-publish-notready.yaml

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
    name: sample-publish-notready
spec:
    serviceName: sample-publish-notready
    replicas: 3
    podManagementPolicy: Parallel
    selector:
        matchLabels:
            app: publish-notready
    template:
        metadata:
            labels:
                app: publish-notready
        spec:
            containers:
            - name: nginx-container
              image: amsy810/echo-nginx:v2.0
              readinessProbe:
                exec:
                    command: ["sh", "-c", "exit 1"]
---
apiVersion: v1
kind: Service
metadata:
    name: sample-publish-notready
spec:
    type: ClusterIP
    publishNotReadyAddresses: true
    ports:
    - name: "http-port"
      protocol: "TCP"
      port: 8080
      targetPort: 80
    selector:
      app: publish-notready

publishNotReadyAddresses가 True로 된 경우 파드가 Ready 상태가 아니어도 이름 해석이 가능하다. 이는 로드밸런서에도 포함되어 버리기 때문에 특별한 경우에만 사용하자.

$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP          NODE           NOMINATED NODE   READINESS GATES
sample-publish-notready-0   0/1     Running   0          14s   10.46.0.2   k8s-worker02   <none>           <none>
sample-publish-notready-1   0/1     Running   0          14s   10.40.0.1   k8s-worker01   <none>           <none>
sample-publish-notready-2   0/1     Running   0          14s   10.46.0.1   k8s-worker02   <none>           <none>


# 서비스와 연결된 엔드포인트 목록 확인
$ kubectl get endpoints sample-publish-notready
NAME                      ENDPOINTS                                AGE
sample-publish-notready   10.40.0.1:80,10.46.0.1:80,10.46.0.2:80   3m36s


# 서비스 이름 해석 확인
$ kubectl run --image=amsy810/tools:v2.0 --restart=Never --rm -i testpod \
-- dig sample-publish-notready-0.sample-publish-notready.default.svc.cluster.local

...(생략)...

;; QUESTION SECTION:
;sample-publish-notready-0.sample-publish-notready.default.svc.cluster.local. IN        A

;; ANSWER SECTION:
sample-publish-notready-0.sample-publish-notready.default.svc.cluster.local. 30 IN A 10.46.0.2


# publishNotReadyAddresses 비활성화
$ kubectl patch service sample-publish-notready --patch '{"spec": {"publishNotReadyAddresses": false}}'
service/sample-publish-notready patched


# 서비스에 연결된 엔드포인트 목록 확인
$ kubectl get endpoints sample-publish-notready
NAME                      ENDPOINTS   AGE
sample-publish-notready               8m27s


# 서비스 이름 해석 확인
$ kubectl run --image=amsy810/tools:v2.0 --restart=Never --rm -i testpod \
-- dig sample-publish-notready-0.sample-publish-notready.default.svc.cluster.local

...(생략)...

;; QUESTION SECTION:
;sample-publish-notready-0.sample-publish-notready.default.svc.cluster.local. IN        A

;; AUTHORITY SECTION:
cluster.local.          30      IN      SOA     ns.dns.cluster.local. hostmaster.cluster.local. 1704292651 7200 1800 86400 30

(9) Startup Probe를 사용한 지연 체크와 실패

/root/startup 파일이 존재하는지를 명령어 기반으로 체크하고, 명령어 실행 시간과 어느 Probe가 실행되었는지가 /root/log 파일에 기록되어 있다.

# sample-startup.yaml

apiVersion: v1
kind: Pod
metadata:
    name: sample-startup
spec:
    containers:
    - name: tools-container
      image: amsy810/tools:v2.0
      livenessProbe:
        exec:
            command: ["sh", "-c", "echo [$(date)] liveness >> /root/log; test ! -e /root/liveness"]
        periodSeconds: 3
      readinessProbe:
        exec:
            command: ["sh", "-c", "echo [$(date)] readiness >> /root/log; test ! -e /root/readiness"]
        periodSeconds: 3
      startupProbe:
        exec:
            command: ["sh", "-c", "echo [$(date)] startup >> /root/log; test ! -e /root/startup"]
        failureThreshold: 100
        periodSeconds: 3

파드 상태 확인

# Startup Probe에서 감시하고 있는 파일 생성(Probe 성공)
$ {
	kubectl apply -f sample-startup.yaml;
	sleep 3;
	kubectl exec -it sample-startup -- touch /root/startup;
}


# Startup Probe가 성공한 상태
$ kubectl get pod sample-startup --watch
NAME             READY   STATUS              RESTARTS   AGE
sample-startup   0/1     ContainerCreating   0          1s
sample-startup   0/1     Running             0          2s      # Startup Probe 첫 번째[실패] (기동 완료)
sample-startup   0/1     Running             1 (1s ago)   52s   # Startup Probe 두 번째[실패] (표시 안됨)
sample-startup   0/1     Running             1 (5s ago)   56s   # Startup Probe 세 번째[성공] (직전에 /root/startup 생성)
sample-startup   1/1     Running             1 (6s ago)   57s   # 첫 회 헬스 체크 완료


$ kubectl exec -it sample-startup -- head /root/log
[Wed Jan 3 17:03:46 UTC 2024] startup       # Startup Probe 첫 번째
[Wed Jan 3 17:03:51 UTC 2024] startup       # Startup Probe 두 번째
[Wed Jan 3 17:03:56 UTC 2024] startup       # Startup Probe 세 번째 (성공)
[Wed Jan 3 17:04:07 UTC 2024] readiness
[Wed Jan 3 17:04:11 UTC 2024] readiness
[Wed Jan 3 17:04:11 UTC 2024] liveness
[Wed Jan 3 17:04:16 UTC 2024] liveness

# sample-startup-shortfail.yaml

#> startup Probe 경우도 Liveness Probe, Readiness Probe와 마찬가지로 파라미터를 사용할 수 있음

apiVersion: v1
kind: Pod
metadata:
    name: sample-startup-shortfail
spec:
    containers:
    - name: tools-container
      image: amsy810/tools:v2.0
      readinessProbe:
        exec:
            command: ["sh", "-c", "echo [$(date)] readiness >> /root/log; test ! -e /root/readiness"]
        periodSeconds: 3
      startupProbe:
        exec:
            command: ["sh", "-c", "echo [$(date)] startup >> /root/log; test ! -e /root/startup"]
        failureThreshold: 100
        periodSeconds: 3
        initialDelaySeconds: 5
        periodSeconds: 3

revenge1005 🧗

[Retry, k8s] 34. 헬스 체크, 컨테이너 라이프사이클 - 헬스 체크

01. 헬스체크

(1) 헬스 체크 방법

(2) 헬스 체크 방식

(3) 헬스 체크 간격

(4) 헬스 체크 생성

(5) Liveness Probe 실패

(6) Readiness Probe 실패

(7) 추가 Ready 조건을 추가하는 파드 ready++(ReadinessGate)

(8) Readiness Probe를 무시한 서비스 생성

(9) Startup Probe를 사용한 지연 체크와 실패

KUBERNETES 카테고리 내 다른 글 보러가기

댓글 남기기

최근 글 10 개 :)