У меня есть небольшой домашний сервер, на котором я размещаю несколько сервисов, которые мне нравятся. Недавно сервер перешел в состояние только для чтения на главной LVM объем, это объем, где Ubuntu установлен. У меня очень мало опыта исследования подобных проблем.
Проблема
Каждые несколько недель я буду пытаться получить доступ к одной из служб, размещенных на сервере, и замечу, что они не отвечают. По опыту знаю, что это означает /dev/mapper/ubuntu--vg-ubuntu--lv
том пошел только для чтения.
Мое исправление
Я могу бегать fsck /dev/mapper/ubuntu--vg-ubuntu--lv
что приводит к сообщению и исправлению множества ошибок в /var/lib/docker/overlay2/
.
Сокращенный вывод команды:
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
/dev/mapper/ubuntu--vg-ubuntu--lv: recovering journal
JBD2: Invalid checksum recovering block 6294682 in log
JBD2: Invalid checksum recovering block 5767200 in log
JBD2: Invalid checksum recovering block 5769164 in log
...
Journal checksum error found in /dev/mapper/ubuntu--vg-ubuntu--lv
/dev/mapper/ubuntu--vg-ubuntu--lv contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 395236 was part of the orphaned inode list. FIXED.
...
Pass 2: Checking directory structure
Symlink /var/lib/docker/overlay2/e2983dd0abaadcf1745ffa84433e9c84f8b1a4ab79ee4b9b059f039758b885d1/diff/opt/gitlab/embedded/service/gitlab-shell/.gitlab_shell_secret (inode #1613356) is invalid.
Clear<y>? yes
...
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(7406080--7406211) -(7407552--7407614) -(7408736--7408765) -(7411072--7411125) -(7413760--7413824) -(7414784--7415295)
Fix? yes
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/ubuntu--vg-ubuntu--lv: ***** REBOOT SYSTEM *****
/dev/mapper/ubuntu--vg-ubuntu--lv: 453053/2097152 files (0.1% non-contiguous), 3134566/8388608 blocks
Я следую за этим, бегая sudo docker image rm gitlab/gitlab-ee -f
для всех затронутых изображений.
Я явно не могу продолжать это делать, так как мне нужно удалить все поврежденные файлы и повторно развернуть все образы докеров.
Вопрос
Как мне исправить это раз и навсегда? Как мне узнать, где искать?
Обновление 1
Выход dmesg
через несколько минут после того, как произошло разрушение. Обратите внимание, что это всего лишь небольшая часть, dmesg
возвращает много из них, но все они одного формата Файловая система только для чтения формат.
[250377.826952] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250377.827922] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.236228] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250378.237456] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.765351] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.766273] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.982447] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250381.983362] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.639171] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.640142] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.766604] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
[250382.767619] systemd-journald[445]: Failed to write entry (21 items, 858 bytes), ignoring: Read-only file system
Я думаю, что уже слишком поздно бежать dmesg
с этой точки зрения. Сомневаюсь, что смогу заметить проблему быстрее.
Обновление 2
Я наконец вовремя уловил ошибку.
[94813.533642] IPv6: ADDRCONF(NETDEV_UP): veth13a279e: link is not ready
[94813.533654] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94813.533658] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94813.534526] br-069ee8939c9c: port 1(veth13a279e) entered disabled state
[94813.603964] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.603967] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94813.604076] device vetha0f7425 entered promiscuous mode
[94813.604292] IPv6: ADDRCONF(NETDEV_UP): vetha0f7425: link is not ready
[94813.604298] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94813.604299] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94814.483126] br-54483b957e18: port 3(vetha0f7425) entered disabled state
[94814.816681] eth0: renamed from vethfda95fe
[94814.831541] IPv6: ADDRCONF(NETDEV_CHANGE): veth13a279e: link becomes ready
[94814.831591] br-069ee8939c9c: port 1(veth13a279e) entered blocking state
[94814.831593] br-069ee8939c9c: port 1(veth13a279e) entered forwarding state
[94815.842053] eth0: renamed from vethba9b3b5
[94815.859813] IPv6: ADDRCONF(NETDEV_CHANGE): vetha0f7425: link becomes ready
[94815.859882] br-54483b957e18: port 3(vetha0f7425) entered blocking state
[94815.859886] br-54483b957e18: port 3(vetha0f7425) entered forwarding state
[94816.829692] eth0: renamed from veth8fea459
[94816.847459] IPv6: ADDRCONF(NETDEV_CHANGE): vetha6ab738: link becomes ready
[94816.847491] br-54483b957e18: port 1(vetha6ab738) entered blocking state
[94816.847495] br-54483b957e18: port 1(vetha6ab738) entered forwarding state
[95248.835077] r8169 0000:13:00.0 enp19s0: link down
[95257.640648] r8169 0000:13:00.0 enp19s0: link up
[96145.500660] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500670] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 00 a4 7d 40 00 00 30 00
[96145.500674] print_req_error: I/O error, dev sda, sector 10779968
[96145.500826] Aborting journal on device dm-0-8.
[96145.500828] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.500834] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 07 64 41 a8 00 00 08 00
[96145.500895] print_req_error: I/O error, dev sda, sector 124010920
[96145.500932] EXT4-fs error (device dm-0) in __ext4_new_inode:1082: Journal has aborted
[96145.500947] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1709169 (offset 0 size 4096 starting block 6849845)
[96145.500982] Buffer I/O error on device dm-1, logical block 6849845
[96145.501023] sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501025] sd 0:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 05 d0 58 88 00 00 08 00
[96145.501027] print_req_error: I/O error, dev sda, sector 97540232
[96145.501058] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 450308 (offset 0 size 0 starting block 7735313)
[96145.501060] Buffer I/O error on device dm-0, logical block 7735313
[96145.501105] sd 0:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501110] sd 0:0:0:0: [sda] tag#9 CDB: Write(10) 2a 00 0a 24 94 80 00 00 08 00
[96145.501114] print_req_error: I/O error, dev sda, sector 170169472
[96145.501147] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 3148652 (offset 0 size 4096 starting block 12619664)
96145.501148] Buffer I/O error on device dm-1, logical block 12619664
[96145.501187] sd 0:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501189] sd 0:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 03 27 16 40 00 00 80 00
[96145.501190] print_req_error: I/O error, dev sda, sector 52893248
[96145.501266] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[96145.501268] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 07 64 58 e0 00 00 08 00
[96145.501269] print_req_error: I/O error, dev sda, sector 124016864
[96145.501301] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 1706133 (offset 0 size 4096 starting bl
ock 6850588)
[96145.501302] Buffer I/O error on device dm-1, logical block 6850588
[96145.501382] EXT4-fs (dm-0): Delayed block allocation failed for inode 450308 at logical offset 18450 with max blocks 7 with error
30
[96145.501438] EXT4-fs (dm-0): This should not happen!! Data will be lost
[96145.501477] EXT4-fs error (device dm-0) in ext4_writepages:2895: Journal has aborted
[96145.504583] Aborting journal on device dm-1-8.
[96145.507210] EXT4-fs error (device dm-0): ext4_journal_check_start:61: Detected aborted journal
[96145.507306] EXT4-fs (dm-0): Remounting filesystem read-only
[96145.513169] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5769: Journal has aborted
[96145.514680] EXT4-fs error (device dm-0) in ext4_mkdir:2660: IO failure
[96145.516071] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516095] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.516121] EXT4-fs (dm-1): Remounting filesystem read-only
[96145.518645] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.520899] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.531523] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542876] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.542955] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[96145.944229] traps: postgres[2215] general protection ip:7fba363d97a1 sp:7ffc943cc880 error:0 in ld-musl-x86_64.so.1[7fba363d2000+4
6000]
[96146.339588] br-ae1db04d3386: port 4(vethb6cf243) entered disabled state
[96146.345493] veth3fdc38e: renamed from eth0
[96146.441684] overlayfs: upper fs is r/o, try multi-lower layers mount
[96149.670401] br-ae1db04d3386: port 7(veth2bcd3f7) entered disabled state
[96149.670617] vethf7402f2: renamed from eth0
[96149.776068] overlayfs: upper fs is r/o, try multi-lower layers mount
[96180.516686] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.516842] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.516985] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517093] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517192] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.517558] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.517753] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.518982] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
[96180.520186] systemd-journald[392]: Failed to write entry (21 items, 603 bytes), ignoring: Read-only file system
[96180.521475] systemd-journald[392]: Failed to write entry (21 items, 694 bytes), ignoring: Read-only file system
Обновление 3
Я подключил диск к хосту Windows и проверил наличие обновлений прошивки. На накопителе установлена последняя версия прошивки.
SMART также сообщает, как мне кажется, очень разумные цифры.
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 000 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 5638
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 75
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 26
170 Unknown_Attribute 0x0000 100 100 010 Old_age Offline - 18
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 2424903
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0000 100 100 000 Old_age Offline - 1
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 52
194 Temperature_Celsius 0x0022 077 068 000 Old_age Always - 23 (Min/Max 22/32)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
231 Temperature_Celsius 0x0000 004 004 000 Old_age Offline - 96
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 3775
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 2489
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 976
244 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 37
245 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 71
246 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 141104
Я установлю новую ОС и восстановлю данные из резервной копии, чтобы посмотреть, решит ли это проблему.
Информация о докере
Поскольку кажется, что поврежден только каталог оверлея докеров, вот конфигурация хранилища докеров:
Server Version: 19.03.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true