Назад | Перейти на главную страницу

EKS внезапно выходит из строя из-за давления на диск

У нас есть кластер EKS с двумя узлами t3.small с временным хранилищем 20Gi. На данный момент в кластере работают только два небольших приложения Nodejs (node: 12-alpine).

Это работало отлично в течение нескольких недель, а теперь внезапно мы получаем ошибки давления на диск.

$ kubectl describe nodes
Name:               ip-192-168-101-158.ap-southeast-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 31 Mar 2019 17:14:58 +0800
Taints:             node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable:      false
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:14:58 +0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:14:58 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Sun, 12 May 2019 12:22:47 +0800   Sun, 12 May 2019 06:51:38 +0800   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:14:58 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:15:31 +0800   KubeletReady                 kubelet is posting ready status
  InternalDNS:  ip-192-168-101-158.ap-southeast-1.compute.internal
  ExternalDNS:  ec2-54-169-250-255.ap-southeast-1.compute.amazonaws.com
  Hostname:     ip-192-168-101-158.ap-southeast-1.compute.internal
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           20959212Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      2002320Ki
 pods:                        11
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           19316009748
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      1899920Ki
 pods:                        11
System Info:
 Machine ID:                 ec2aa2ecfbbbdd798e2da086fc04afb6
 System UUID:                EC2AA2EC-FBBB-DD79-8E2D-A086FC04AFB6
 Boot ID:                    62c5eb9d-5f19-4558-8883-2da48ab1969c
 Kernel Version:             4.14.106-97.85.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.1
 Kubelet Version:            v1.12.7
 Kube-Proxy Version:         v1.12.7
ProviderID:                  aws:///ap-southeast-1a/i-0a38342b60238d83e
Non-terminated Pods:         (0 in total)
  Namespace                  Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests  Limits
  --------                    --------  ------
  cpu                         0 (0%)    0 (0%)
  memory                      0 (0%)    0 (0%)
  ephemeral-storage           0 (0%)    0 (0%)
  attachable-volumes-aws-ebs  0         0
  Type     Reason                Age                    From                                                         Message
  ----     ------                ----                   ----                                                         -------
  Warning  ImageGCFailed         5m15s (x333 over 40h)  kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 1423169945 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet  17s (x2809 over 3d4h)  kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal  Attempting to reclaim ephemeral-storage

Name:               ip-192-168-197-198.ap-southeast-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 31 Mar 2019 17:15:02 +0800
Taints:             node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable:      false
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Sun, 12 May 2019 12:22:42 +0800   Thu, 09 May 2019 06:50:56 +0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Sun, 12 May 2019 12:22:42 +0800   Thu, 09 May 2019 06:50:56 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Sun, 12 May 2019 12:22:42 +0800   Sat, 11 May 2019 21:53:44 +0800   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Sun, 12 May 2019 12:22:42 +0800   Sun, 31 Mar 2019 17:15:02 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 12 May 2019 12:22:42 +0800   Thu, 09 May 2019 06:50:56 +0800   KubeletReady                 kubelet is posting ready status
  InternalDNS:  ip-192-168-197-198.ap-southeast-1.compute.internal
  ExternalDNS:  ec2-13-229-138-38.ap-southeast-1.compute.amazonaws.com
  Hostname:     ip-192-168-197-198.ap-southeast-1.compute.internal
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           20959212Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      2002320Ki
 pods:                        11
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           19316009748
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      1899920Ki
 pods:                        11
System Info:
 Machine ID:                 ec27ee0765e86a14ed63d771073e63fb
 System UUID:                EC27EE07-65E8-6A14-ED63-D771073E63FB
 Boot ID:                    7869a0ee-dc2f-4082-ae3f-42c5231ab0e3
 Kernel Version:             4.14.106-97.85.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.1
 Kubelet Version:            v1.12.7
 Kube-Proxy Version:         v1.12.7
ProviderID:                  aws:///ap-southeast-1c/i-0bd4038f4dade284e
Non-terminated Pods:         (0 in total)
  Namespace                  Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests  Limits
  --------                    --------  ------
  cpu                         0 (0%)    0 (0%)
  memory                      0 (0%)    0 (0%)
  ephemeral-storage           0 (0%)    0 (0%)
  attachable-volumes-aws-ebs  0         0
  Type     Reason                Age                      From                                                         Message
  ----     ------                ----                     ----                                                         -------
  Warning  EvictionThresholdMet  5m40s (x4865 over 3d5h)  kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal  Attempting to reclaim ephemeral-storage
  Warning  ImageGCFailed         31s (x451 over 45h)      kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 4006422937 bytes, but freed 0 bytes

Я не совсем уверен, как отладить эту проблему, но похоже, что K8s не может удалить старые неиспользуемые образы Docker на узлах. Все равно проверить это предположение? Есть другие мысли?

Это мой обходной путь:

kubectl drain --delete-local-data --ignore-daemonsets $NODE_NAME && kubectl uncordon $NODE_NAME  

Он истощает все локальные данные и удаляет все модули, а затем повторно запускает все модули. Но я ищу корень проблемы.