У нас есть кластер EKS с двумя узлами t3.small с временным хранилищем 20Gi. На данный момент в кластере работают только два небольших приложения Nodejs (node: 12-alpine).
Это работало отлично в течение нескольких недель, а теперь внезапно мы получаем ошибки давления на диск.
$ kubectl describe nodes
Name: ip-192-168-101-158.ap-southeast-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.small
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ap-southeast-1
failure-domain.beta.kubernetes.io/zone=ap-southeast-1a
kubernetes.io/hostname=ip-192-168-101-158.ap-southeast-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 31 Mar 2019 17:14:58 +0800
Taints: node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:14:58 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:14:58 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Sun, 12 May 2019 12:22:47 +0800 Sun, 12 May 2019 06:51:38 +0800 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:14:58 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:15:31 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.101.158
ExternalIP: 54.169.250.255
InternalDNS: ip-192-168-101-158.ap-southeast-1.compute.internal
ExternalDNS: ec2-54-169-250-255.ap-southeast-1.compute.amazonaws.com
Hostname: ip-192-168-101-158.ap-southeast-1.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2002320Ki
pods: 11
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 19316009748
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1899920Ki
pods: 11
System Info:
Machine ID: ec2aa2ecfbbbdd798e2da086fc04afb6
System UUID: EC2AA2EC-FBBB-DD79-8E2D-A086FC04AFB6
Boot ID: 62c5eb9d-5f19-4558-8883-2da48ab1969c
Kernel Version: 4.14.106-97.85.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.12.7
Kube-Proxy Version: v1.12.7
ProviderID: aws:///ap-southeast-1a/i-0a38342b60238d83e
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ImageGCFailed 5m15s (x333 over 40h) kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 1423169945 bytes, but freed 0 bytes
Warning EvictionThresholdMet 17s (x2809 over 3d4h) kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal Attempting to reclaim ephemeral-storage
Name: ip-192-168-197-198.ap-southeast-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.small
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ap-southeast-1
failure-domain.beta.kubernetes.io/zone=ap-southeast-1c
kubernetes.io/hostname=ip-192-168-197-198.ap-southeast-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 31 Mar 2019 17:15:02 +0800
Taints: node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Sun, 12 May 2019 12:22:42 +0800 Thu, 09 May 2019 06:50:56 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Sun, 12 May 2019 12:22:42 +0800 Thu, 09 May 2019 06:50:56 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Sun, 12 May 2019 12:22:42 +0800 Sat, 11 May 2019 21:53:44 +0800 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Sun, 12 May 2019 12:22:42 +0800 Sun, 31 Mar 2019 17:15:02 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 12 May 2019 12:22:42 +0800 Thu, 09 May 2019 06:50:56 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.197.198
ExternalIP: 13.229.138.38
InternalDNS: ip-192-168-197-198.ap-southeast-1.compute.internal
ExternalDNS: ec2-13-229-138-38.ap-southeast-1.compute.amazonaws.com
Hostname: ip-192-168-197-198.ap-southeast-1.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2002320Ki
pods: 11
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 19316009748
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1899920Ki
pods: 11
System Info:
Machine ID: ec27ee0765e86a14ed63d771073e63fb
System UUID: EC27EE07-65E8-6A14-ED63-D771073E63FB
Boot ID: 7869a0ee-dc2f-4082-ae3f-42c5231ab0e3
Kernel Version: 4.14.106-97.85.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.12.7
Kube-Proxy Version: v1.12.7
ProviderID: aws:///ap-southeast-1c/i-0bd4038f4dade284e
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning EvictionThresholdMet 5m40s (x4865 over 3d5h) kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal Attempting to reclaim ephemeral-storage
Warning ImageGCFailed 31s (x451 over 45h) kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 4006422937 bytes, but freed 0 bytes
Я не совсем уверен, как отладить эту проблему, но похоже, что K8s не может удалить старые неиспользуемые образы Docker на узлах. Все равно проверить это предположение? Есть другие мысли?
Это мой обходной путь:
kubectl drain --delete-local-data --ignore-daemonsets $NODE_NAME && kubectl uncordon $NODE_NAME
Он истощает все локальные данные и удаляет все модули, а затем повторно запускает все модули. Но я ищу корень проблемы.