Назад | Перейти на главную страницу

Диск не проходит самотестирование SMART, но при перезаписи затронутого сервера он не перераспределяется

У меня есть привод (Seagate Momentus 7200.4 2,5 дюйма, 7200 об / мин, 500 ГБ, модель устройства ST9500420AS), который ведет себя странно. Я знаю, что ответ, вероятно, просто «выбросьте его», но мне любопытно понять, есть ли какое-либо объяснение. к тому, что я вижу. В настоящее время у меня нет данных на диске, поэтому я могу свободно поиграть с ними.

Диск начал перераспределять некоторые сектора и давать сбой при самотестировании SMART, поэтому я начал вручную писать в эти сектора, используя hdparm --write-sector. Однако сейчас я застрял на этапе, когда самопроверка SMART не работает в секторе, считывая сектор с hdparm --read-sector не удается, запись сектора с hdparm --write-sector кажется успешным, но не увеличивается Reallocated_EVent_Count, а затем повторный запуск теста не выполняется в том же секторе.

В частности, вот статус диска перед тестом:

$ sudo smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.0-8-arm64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
[...]

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   112   090   006    Pre-fail  Always       -       132093263
  3 Spin_Up_Time            0x0002   099   097   000    Old_age   Always       -       0
  4 Start_Stop_Count        0x0033   093   093   000    Pre-fail  Always       -       8061
  5 Reallocated_Sector_Ct   0x0033   091   091   036    Pre-fail  Always       -       187
  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       90354048054
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22236
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0033   099   037   020    Pre-fail  Always       -       1725
183 Runtime_Bad_Block       0x0032   100   253   000    Old_age   Always       -       19
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       260
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       12885098584
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   050   045   045    Old_age   Always   In_the_past 50 (Min/Max 24/55)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       212
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       47
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       340982
194 Temperature_Celsius     0x0022   050   055   000    Old_age   Always       -       50 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   052   032   000    Old_age   Always       -       132093263
196 Reallocated_Event_Count 0x0033   091   091   036    Pre-fail  Always       -       187
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       53
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

Провожу долгую самопроверку:

$ sudo smartctl -t long /dev/sda
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.0-8-arm64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 112 minutes for test to complete.
Test will complete after Wed Apr 29 14:37:18 2020

Use smartctl -X to abort test.

Через некоторое время тест не удался:

$ sudo smartctl -a /dev/sda     
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.0-8-arm64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
[...]

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   090   006    Pre-fail  Always       -       132093263
  3 Spin_Up_Time            0x0002   099   097   000    Old_age   Always       -       0
  4 Start_Stop_Count        0x0033   093   093   000    Pre-fail  Always       -       8061
  5 Reallocated_Sector_Ct   0x0033   091   091   036    Pre-fail  Always       -       187
  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       90354048074
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22237
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0033   099   037   020    Pre-fail  Always       -       1725
183 Runtime_Bad_Block       0x0032   100   253   000    Old_age   Always       -       19
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       261
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       12885098584
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   045   045    Old_age   Always   In_the_past 44 (Min/Max 24/55)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       212
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       47
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       340983
194 Temperature_Celsius     0x0022   044   055   000    Old_age   Always       -       44 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   052   032   000    Old_age   Always       -       132093263
196 Reallocated_Event_Count 0x0033   091   091   036    Pre-fail  Always       -       187
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       53
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

[...

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     22236         918494442
[...]

Не удается прочитать соответствующий сектор:

$ sudo hdparm --read-sector 918494442 /dev/sda

/dev/sda:
reading sector 918494442: The running kernel lacks CONFIG_IDE_TASK_IOCTL support for this device.
FAILED: Invalid argument

Запись в сектор успешна:

$ sudo hdparm --write-sector 918494442 --yes-i-know-what-i-am-doing /dev/sda

/dev/sda:
re-writing sector 918494442: succeeded

Теперь чтение этого сектора работает:

$ sudo hdparm --read-sector 918494442 /dev/sda                              

/dev/sda:
reading sector 918494442: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
[...]

Журнал SMART показывает, что Current_Pending_Sector уменьшилось, но Reallocated_Event_Count не увеличилось:

$ sudo smartctl -a /dev/sda                              
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.0-8-arm64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
[...]

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   110   090   006    Pre-fail  Always       -       132096978
  3 Spin_Up_Time            0x0002   099   097   000    Old_age   Always       -       0
  4 Start_Stop_Count        0x0033   093   093   000    Pre-fail  Always       -       8061
  5 Reallocated_Sector_Ct   0x0033   091   091   036    Pre-fail  Always       -       187
  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       90354049556
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22241
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0033   099   037   020    Pre-fail  Always       -       1725
183 Runtime_Bad_Block       0x0032   100   253   000    Old_age   Always       -       19
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       262
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       12885098584
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   049   045   045    Old_age   Always   In_the_past 51 (Min/Max 24/55)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       212
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       47
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       340984
194 Temperature_Celsius     0x0022   051   055   000    Old_age   Always       -       51 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   052   032   000    Old_age   Always       -       132096978
196 Reallocated_Event_Count 0x0033   091   091   036    Pre-fail  Always       -       187
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       53
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

Итак, я снова запускаю самотестирование:

$ sudo smartctl -t long /dev/sda
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.0-8-arm64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 112 minutes for test to complete.
Test will complete after Tue Apr 28 21:06:52 2020

Use smartctl -X to abort test.

А через некоторое время тест снова провалился, на том же самом месте:

$ sudo smartctl -a /dev/sda           
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.0-8-arm64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
[...]

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   104   090   006    Pre-fail  Always       -       143254342
  3 Spin_Up_Time            0x0002   099   097   000    Old_age   Always       -       0
  4 Start_Stop_Count        0x0033   093   093   000    Pre-fail  Always       -       8061
  5 Reallocated_Sector_Ct   0x0033   091   091   036    Pre-fail  Always       -       187
  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       90354061019
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22283
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0033   099   037   020    Pre-fail  Always       -       1725
183 Runtime_Bad_Block       0x0032   100   253   000    Old_age   Always       -       19
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       275
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       12885098584
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   058   045   045    Old_age   Always   In_the_past 42 (Min/Max 24/55)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       212
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       31
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       341000
194 Temperature_Celsius     0x0022   042   055   000    Old_age   Always       -       42 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   035   032   000    Old_age   Always       -       143254342
196 Reallocated_Event_Count 0x0033   091   091   036    Pre-fail  Always       -       187
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       53
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

[...]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     22262         918494442
# 2  Extended offline    Completed: read failure       90%     22236         918494442
[...]

Я повторил это несколько раз, но безрезультатно (кроме одного раза, когда тест прошел успешно, и одного раза, когда он не прошел в другом месте). Вот полный журнал тестов на данный момент (с первым тестом, соответствующим сбоям, когда запись в сектор для принудительного перераспределения сработала):

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     22272         918494442
# 2  Extended offline    Completed: read failure       10%     22270         918494442
# 3  Extended offline    Completed: read failure       10%     22267         918494442
# 4  Extended offline    Completed: read failure       90%     22265         871827620
# 5  Extended offline    Completed: read failure       90%     22262         918494442
# 6  Extended offline    Completed: read failure       90%     22236         918494442
# 7  Extended offline    Completed: read failure       10%     22220         918494442
# 8  Extended offline    Completed: read failure       10%     22216         918494442
# 9  Extended offline    Completed: read failure       10%     22214         918494442
#10  Extended offline    Completed: read failure       10%     22211         918494442
#11  Extended offline    Completed: read failure       10%     22200         918494442
#12  Extended offline    Completed without error       00%     22198         -
#13  Extended offline    Completed: read failure       10%     22196         871847239
#14  Extended offline    Completed: read failure       10%     22193         871814225
#15  Extended offline    Completed: read failure       90%     22189         918480478
#16  Extended offline    Completed: read failure       90%     22188         918480478
#17  Extended offline    Completed: read failure       90%     22175         918512077
#18  Extended offline    Completed: read failure       90%     22169         918509168
#19  Extended offline    Completed: read failure       90%     22169         918442466
#20  Extended offline    Completed: read failure       90%     22169         918440526
#21  Extended offline    Completed: read failure       90%     22169         918441496
9 of 20 failed self-tests are outdated by newer successful extended offline self-test #12

Я тоже пробовал бежать sudo badblocks -swv /dev/sda (по крайней мере, первый проход), но, похоже, он не вызвал никаких ошибок или перераспределения секторов.

Опять же, я знаю, что этому диску больше нельзя доверять, но я просто не понимаю, почему диск ведет себя так странно. Любые идеи? Спасибо!