Назад | Перейти на главную страницу

Повреждение файловой системы Docker overlay2

У меня есть небольшой домашний сервер, на котором я размещаю несколько сервисов, которые мне нравятся. Недавно сервер перешел в состояние только для чтения на главной LVM объем, это объем, где Ubuntu установлен. У меня очень мало опыта исследования подобных проблем.

Проблема

Каждые несколько недель я буду пытаться получить доступ к одной из служб, размещенных на сервере, и замечу, что они не отвечают. По опыту знаю, что это означает /dev/mapper/ubuntu--vg-ubuntu--lv том пошел только для чтения.

Мое исправление

Я могу бегать fsck /dev/mapper/ubuntu--vg-ubuntu--lv что приводит к сообщению и исправлению множества ошибок в /var/lib/docker/overlay2/.

Сокращенный вывод команды:

fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
/dev/mapper/ubuntu--vg-ubuntu--lv: recovering journal
JBD2: Invalid checksum recovering block 6294682 in log
JBD2: Invalid checksum recovering block 5767200 in log
JBD2: Invalid checksum recovering block 5769164 in log
...
Journal checksum error found in /dev/mapper/ubuntu--vg-ubuntu--lv
/dev/mapper/ubuntu--vg-ubuntu--lv contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes
Inode 395236 was part of the orphaned inode list.  FIXED.
...
Pass 2: Checking directory structure
Symlink /var/lib/docker/overlay2/e2983dd0abaadcf1745ffa84433e9c84f8b1a4ab79ee4b9b059f039758b885d1/diff/opt/gitlab/embedded/service/gitlab-shell/.gitlab_shell_secret (inode #1613356) is invalid.
Clear<y>? yes
...
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(7406080--7406211) -(7407552--7407614) -(7408736--7408765) -(7411072--7411125) -(7413760--7413824) -(7414784--7415295)
Fix? yes
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** REBOOT SYSTEM *****
/dev/mapper/ubuntu--vg-ubuntu--lv: 453053/2097152 files (0.1% non-contiguous), 3134566/8388608 blocks

Я следую за этим, бегая sudo docker image rm gitlab/gitlab-ee -f для всех затронутых изображений.

Я явно не могу продолжать это делать, так как мне нужно удалить все поврежденные файлы и повторно развернуть все образы докеров.

Вопрос

Как мне исправить это раз и навсегда? Как мне узнать, где искать?

Обновление 1

Выход dmesg через несколько минут после того, как произошло разрушение. Обратите внимание, что это всего лишь небольшая часть, dmesg возвращает много из них, но все они одного формата Файловая система только для чтения формат.

[250377.826952] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250377.827922] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.236228] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.237456] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.765351] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.766273] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.982447] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.983362] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.639171] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.640142] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.766604] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.767619] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system

Я думаю, что уже слишком поздно бежать dmesg с этой точки зрения. Сомневаюсь, что смогу заметить проблему быстрее.

Обновление 2

Я наконец вовремя уловил ошибку.

[94813.533642] IPv6: ADDRCONF(NETDEV_UP): veth13a279e: link is not ready
[94813.533654] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94813.533658] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94813.534526] br-069ee8939c9c: port 1(veth13a279e) entered disabled state
[94813.603964] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.603967] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94813.604076] device vetha0f7425 entered promiscuous mode
[94813.604292] IPv6: ADDRCONF(NETDEV_UP): vetha0f7425: link is not ready
[94813.604298] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.604299] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94814.483126] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94814.816681] eth0: renamed from vethfda95fe
[94814.831541] IPv6: ADDRCONF(NETDEV_CHANGE): veth13a279e: link becomes ready
[94814.831591] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94814.831593] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94815.842053] eth0: renamed from vethba9b3b5
[94815.859813] IPv6: ADDRCONF(NETDEV_CHANGE): vetha0f7425: link becomes ready
[94815.859882] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94815.859886] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94816.829692] eth0: renamed from veth8fea459
[94816.847459] IPv6: ADDRCONF(NETDEV_CHANGE): vetha6ab738: link becomes ready
[94816.847491] br-54483b957e18: port 1(vetha6ab738) entered blocking state
[94816.847495] br-54483b957e18: port 1(vetha6ab738) entered forwarding state
[95248.835077] r8169 0000:13:00.0 enp19s0: link down
[95257.640648] r8169 0000:13:00.0 enp19s0: link up
[96145.500660] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500670] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 00 a4 7d 40 00 00 30 00
[96145.500674] print_req_error: I/O error, dev sda, sector 10779968
[96145.500826] Aborting journal on device dm-0-8.
[96145.500828] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500834] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 07 64 41 a8 00 00 08 00
[96145.500895] print_req_error: I/O error, dev sda, sector 124010920
[96145.500932] EXT4-fs error (device dm-0) in __ext4_new_inode:1082: Journal has aborted
[96145.500947] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1709169 (offset 0 size 4096 starting block 6849845)
[96145.500982] Buffer I/O error on device dm-1, logical block 6849845
[96145.501023] sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501025] sd 0:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 05 d0 58 88 00 00 08 00
[96145.501027] print_req_error: I/O error, dev sda, sector 97540232
[96145.501058] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 450308 (offset 0 size 0 starting block 7735313)
[96145.501060] Buffer I/O error on device dm-0, logical block 7735313
[96145.501105] sd 0:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501110] sd 0:0:0:0: [sda] tag#9 CDB: Write(10) 2a 00 0a 24 94 80 00 00 08 00
[96145.501114] print_req_error: I/O error, dev sda, sector 170169472
[96145.501147] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 3148652 (offset 0 size 4096 starting block 12619664)
96145.501148] Buffer I/O error on device dm-1, logical block 12619664
[96145.501187] sd 0:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501189] sd 0:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 03 27 16 40 00 00 80 00
[96145.501190] print_req_error: I/O error, dev sda, sector 52893248
[96145.501266] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501268] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 07 64 58 e0 00 00 08 00
[96145.501269] print_req_error: I/O error, dev sda, sector 124016864
[96145.501301] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1706133 (offset 0 size 4096 starting bl
ock 6850588)
[96145.501302] Buffer I/O error on device dm-1, logical block 6850588
[96145.501382] EXT4-fs (dm-0): Delayed block allocation failed for inode 450308 at logical offset 18450 with max blocks 7 with error 
30
[96145.501438] EXT4-fs (dm-0): This should not happen!! Data will be lost
[96145.501477] EXT4-fs error (device dm-0) in ext4_writepages:2895: Journal has aborted
[96145.504583] Aborting journal on device dm-1-8.
[96145.507210] EXT4-fs error (device dm-0): ext4_journal_check_start:61: Detected aborted journal
[96145.507306] EXT4-fs (dm-0): Remounting filesystem read-only
[96145.513169] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5769: Journal has aborted
[96145.514680] EXT4-fs error (device dm-0) in ext4_mkdir:2660: IO failure
[96145.516071] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516095] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516121] EXT4-fs (dm-1): Remounting filesystem read-only
[96145.518645] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.520899] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.531523] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542876] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542955] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.944229] traps: postgres[2215] general protection ip:7fba363d97a1 sp:7ffc943cc880 error:0 in ld-musl-x86_64.so.1[7fba363d2000+4
6000]
[96146.339588] br-ae1db04d3386: port 4(vethb6cf243) entered disabled state
[96146.345493] veth3fdc38e: renamed from eth0
[96146.441684] overlayfs: upper fs is r/o, try multi-lower layers mount
[96149.670401] br-ae1db04d3386: port 7(veth2bcd3f7) entered disabled state
[96149.670617] vethf7402f2: renamed from eth0
[96149.776068] overlayfs: upper fs is r/o, try multi-lower layers mount
[96180.516686] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.516842] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.516985] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517093] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517192] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517558] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517753] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.518982] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.520186] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.521475] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system

Обновление 3

Я подключил диск к хосту Windows и проверил наличие обновлений прошивки. На накопителе установлена ​​последняя версия прошивки.

SMART также сообщает, как мне кажется, очень разумные цифры.

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   000   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5638
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       75
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       26
170 Unknown_Attribute       0x0000   100   100   010    Old_age   Offline      -       18
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       2424903
181 Program_Fail_Cnt_Total  0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       1
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       52
194 Temperature_Celsius     0x0022   077   068   000    Old_age   Always       -       23 (Min/Max 22/32)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0000   004   004   000    Old_age   Offline      -       96
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       3775
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       2489
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       976
244 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       37
245 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       71
246 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       141104

Я установлю новую ОС и восстановлю данные из резервной копии, чтобы посмотреть, решит ли это проблему.

Информация о докере

Поскольку кажется, что поврежден только каталог оверлея докеров, вот конфигурация хранилища докеров:

Server Version: 19.03.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true