Назад | Перейти на главную страницу

mpt3sas_scsih_issue_tm: тайм-аут при выполнении очистки ZFS

Обзор

В настоящее время мы проводим очистку пула ZFS с 12 дисками RAID-Z1 vdev, и каждый vdev имеет 12 дисков. Каждому vdev соответствует вложение. Оборудование представляет собой Dell PowerEdge 730xd с двумя контроллерами Dell 12 Гбит / с SAS (LSI SAS3008) и 12 корпусами Dell MD1400. Операционная система - CentOS 7.6.1810.

Нам не удалось очистить пул, потому что через некоторое время диски становятся FAULTED в ZFS, и мы должны zpool clear продолжать. Приводы, которые становятся FAULTED кажутся случайными, и smartctl говорит, что их статус SMART в порядке.

Единственная общая черта в том, что перед маркировкой дисков FAULTED, сообщение об ошибке mpt3sas_scsih_issue_tm: timeout появляется в dmesgс последующим сбросом контроллера и потоком ошибок ZED и ошибок чтения.

В настоящее время я застрял на следующем:

Что мы пробовали

Мы пробовали следующее:

Я также посмотрел на этот ответ но это не помогло.

подробности

Вот journalctl с момента начала мероприятия:

Apr 12 04:42:07 kernel: sd 5:0:18:0: attempting task abort! scmd(ffff8d36c295a4c0)
Apr 12 04:42:07 kernel: sd 5:0:4:0: attempting task abort! scmd(ffff8d3745b20540)
Apr 12 04:42:07 kernel: sd 5:0:4:0: [sdac] CDB: Read(32)
Apr 12 04:42:07 kernel: sd 5:0:4:0: [sdac] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Apr 12 04:42:07 kernel: sd 5:0:4:0: [sdac] CDB[10]: 60 2a b8 c8 60 2a b8 c8 00 00 00 00 00 00 00 08
Apr 12 04:42:07 kernel: scsi target5:0:4: handle(0x000e), sas_address(0x5000c500a6bb846e), phy(4)
Apr 12 04:42:07 kernel: scsi target5:0:4: enclosure logical id(0x5204747299f56500), slot(4) 
Apr 12 04:42:07 kernel: scsi target5:0:4: enclosure level(0x0000), connector name( 1   )
Apr 12 04:42:07 kernel: sd 5:0:18:0: [sdap] CDB: Read(32)
Apr 12 04:42:07 kernel: sd 5:0:18:0: [sdap] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Apr 12 04:42:07 kernel: sd 5:0:18:0: [sdap] CDB[10]: 60 2b f7 f8 60 2b f7 f8 00 00 00 00 00 00 00 08
Apr 12 04:42:07 kernel: scsi target5:0:18: handle(0x001d), sas_address(0x5000c500a6bb68ce), phy(5)
Apr 12 04:42:07 kernel: scsi target5:0:18: enclosure logical id(0x5204747299f5dd00), slot(0) 
Apr 12 04:42:07 kernel: scsi target5:0:18: enclosure level(0x0001), connector name( 1   )
Apr 12 04:42:37 kernel: mpt3sas_cm1: mpt3sas_scsih_issue_tm: timeout
Apr 12 04:42:37 kernel: mf:

Apr 12 04:42:37 kernel: 0100000e 
Apr 12 04:42:37 kernel: 00000100 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 

Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 000000b6 
Apr 12 04:42:37 kernel: 
Apr 12 04:42:47 kernel: mpt3sas_cm1: sending diag reset !!
Apr 12 04:42:48 kernel: mpt3sas_cm1: diag reset: SUCCESS
Apr 12 04:42:48 kernel: mpt3sas_cm1: LSISAS3008: FWVersion(16.00.04.00), ChipRevision(0x02), BiosVersion(18.00.00.00)
Apr 12 04:42:48 kernel: mpt3sas_cm1: Protocol=(
Apr 12 04:42:48 kernel: Initiator
Apr 12 04:42:48 kernel: ,Target
Apr 12 04:42:48 kernel: ), 
Apr 12 04:42:48 kernel: Capabilities=(
Apr 12 04:42:48 kernel: TLR
Apr 12 04:42:48 kernel: ,EEDP
Apr 12 04:42:48 kernel: ,Snapshot Buffer
Apr 12 04:42:48 kernel: ,Diag Trace Buffer
Apr 12 04:42:48 kernel: ,Task Set Full
Apr 12 04:42:48 kernel: ,NCQ
Apr 12 04:42:48 kernel: )
Apr 12 04:42:48 kernel: mpt3sas_cm1: sending port enable !!
Apr 12 04:42:55 kernel: mpt3sas_cm1: port enable: SUCCESS
Apr 12 04:42:55 kernel: mpt3sas_cm1: search for end-devices: start
Apr 12 04:42:55 kernel: scsi target5:0:0: handle(0x000a), sas_addr(0x5000c500a6bc5ef6)
Apr 12 04:42:55 kernel: scsi target5:0:0: enclosure logical id(0x5204747299f56500), slot(9)
Apr 12 04:42:55 kernel: scsi target5:0:1: handle(0x000b), sas_addr(0x5000c500a6bc6e66)
Apr 12 04:42:55 kernel: scsi target5:0:1: enclosure logical id(0x5204747299f56500), slot(5)
Apr 12 04:42:55 kernel: scsi target5:0:2: handle(0x000c), sas_addr(0x5000c500a6bbd86e)
Apr 12 04:42:55 kernel: scsi target5:0:2: enclosure logical id(0x5204747299f56500), slot(1)

В handle и enclosure строки повторяются для каждого диска, подключенного к контроллеру.

Затем следует:

Apr 12 04:42:57 kernel: mpt3sas_cm1: search for end-devices: complete
Apr 12 04:42:57 kernel: mpt3sas_cm1: search for expanders: start
Apr 12 04:42:57 kernel:         expander present: handle(0x0009), sas_addr(0x5204747299f565ff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0016), sas_addr(0x5204747299f5ddff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0024), sas_addr(0x520474729a0a68ff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0032), sas_addr(0x520474729a0b61ff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0040), sas_addr(0x520474729a09f1ff)
Apr 12 04:42:57 kernel: mpt3sas_cm1: search for expanders: complete
Apr 12 04:42:57 kernel: sd 5:0:4:0: task abort: SUCCESS scmd(ffff8d3745b20540)
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: start
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: end-devices
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: expanders
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: complete
Apr 12 04:42:57 kernel: mpt3sas_cm1: scan devices: start
Apr 12 04:42:57 kernel: sd 5:0:18:0: task abort: SUCCESS scmd(ffff8d36c295a4c0)
Apr 12 04:42:57 kernel: scsi_io_completion: 13 callbacks suppressed
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] CDB: Read(32)
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] CDB[10]: 60 2b f7 f8 60 2b f7 f8 00 00 00 00 00 00 00 08
Apr 12 04:42:57 kernel: blk_update_request: 13 callbacks suppressed
Apr 12 04:42:57 kernel: blk_update_request: I/O error, dev sdap, sector 1613494264
Apr 12 04:42:57 kernel: sd 5:0:21:0: attempting task abort! scmd(ffff8d3acfef0540)
Apr 12 04:42:57 kernel: sd 5:0:21:0: [sdas] CDB: Read(32)
Apr 12 04:42:57 kernel: sd 5:0:21:0: [sdas] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 03
Apr 12 04:42:57 kernel: sd 5:0:21:0: [sdas] CDB[10]: 01 af 8c b0 01 af 8c b0 00 00 00 00 00 00 00 08
Apr 12 04:42:57 kernel: scsi target5:0:21: handle(0x0020), sas_address(0x5000c500a6bc5f82), phy(8)

плюс намного больше таймаутов чтения. Затем мы видим много zed ошибки:

Apr 12 04:42:57 zed[137074]: eid=2425 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137076]: eid=2426 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137078]: eid=2427 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137080]: eid=2428 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137082]: eid=2429 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137084]: eid=2430 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137086]: eid=2431 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137088]: eid=2432 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137090]: eid=2433 class=io pool_guid=0x3317CEBDDE480DA0
Apr 12 04:42:57 zed[137092]: eid=2434 class=io pool_guid=0x3317CEBDDE480DA0
Apr 12 04:42:57 zed[137094]: eid=2435 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137096]: eid=2436 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137098]: eid=2437 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137100]: eid=2438 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137102]: eid=2439 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bb68cf-part1
Apr 12 04:42:57 zed[137104]: eid=2440 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bb68cf-part1

После этого диски помечаются как DEGRADED или FAULTED. Я также включу дополнительную информацию, которая может быть полезна.

Вот результат zpool status для двух vdevs с FAULTED устройства:

    raidz1-4                                         DEGRADED     0     0     0
      scsi-35000cca2513f78b8                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25157bfd0                         ONLINE       0     0     0  (repairing)
      scsi-35000cca251597aa4                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2515de7b0                         FAULTED      0     0     0  too many errors
      scsi-35000cca2516278c8                         DEGRADED     0     0     0  too many errors
      scsi-35000cca25163ea64                         ONLINE       0     0     0  (repairing)
      scsi-35000cca251644664                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2516576a0                         DEGRADED     0     0     0  too many errors
      scsi-35000cca251699f68                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25169bd10                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25169be5c                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25169c09c                         DEGRADED     0     0     0  too many errors  (repairing)
    raidz1-5                                         DEGRADED     0     0     0
      scsi-35000cca2516bc234                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2516bc26c                         ONLINE       0     0     0
      scsi-35000cca2516c8e78                         ONLINE       0     0     0
      scsi-35000cca2516ca244                         ONLINE       0     0     0
      scsi-35000cca2516ca334                         ONLINE       0     0     0  (repairing)
      scsi-35000cca2516ca848                         ONLINE       0     0     0  (repairing)
      scsi-35000cca2516cb3e0                         ONLINE       0     0     0  (repairing)
      scsi-35000cca2516cb420                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2516cc210                         ONLINE       0     0     0
      scsi-35000cca2516ce390                         FAULTED      0     0     0  too many errors  (repairing)
      scsi-35000cca2516ce8e4                         ONLINE       0     0     0
      scsi-35000cca2516cf224                         ONLINE       0     0     0

Вот результат smartctl -a для FAULTED въехать raidz1-4:

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL5200
Revision:             LS15
Compliance:           SPC-4
User Capacity:        9,796,820,402,176 bytes [9.79 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2515de7b0
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Apr 12 13:40:57 2019 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        50 C

Manufactured in week 02 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  5
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  889
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 30677043943309312

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       40         0       294   10394513     118610.223           0
write:         0        0         0         0     239773      43528.082           0
verify:        0        0         0         0      18403        101.563           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  96   18243                 - [-   -    -]
# 2  Background short  Completed                  96   16753                 - [-   -    -]
# 3  Reserved(7)       Completed                  64       2                 - [-   -    -]

Long (extended) Self Test duration: 64033 seconds [1067.2 minutes]

sysctl -a | grep -v 'net.' | grep -v 'kernel.sched_domain.':

abi.vsyscall32 = 1
crypto.fips_enabled = 0
debug.exception-trace = 1
debug.kprobes-optimization = 1
debug.panic_on_rcu_stall = 0
dev.hpet.max-user-freq = 64
dev.mac_hid.mouse_button2_keycode = 97
dev.mac_hid.mouse_button3_keycode = 100
dev.mac_hid.mouse_button_emulation = 0
dev.raid.speed_limit_max = 200000
dev.raid.speed_limit_min = 1000
dev.scsi.logging_level = 0
fs.aio-max-nr = 65536
fs.aio-nr = 0
fs.binfmt_misc.status = enabled
fs.dentry-state = 235028  190450  45  0 0 0
fs.dir-notify-enable = 1
fs.epoll.max_user_watches = 108185722
fs.file-max = 52384239
fs.file-nr = 2080 0 52384239
fs.inode-nr = 102807  662
fs.inode-state = 102807 662 0 0 0 0 0
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192
fs.lease-break-time = 45
fs.leases-enable = 1
fs.may_detach_mounts = 0
fs.mount-max = 100000
fs.mqueue.msg_default = 10
fs.mqueue.msg_max = 10
fs.mqueue.msgsize_default = 8192
fs.mqueue.msgsize_max = 8192
fs.mqueue.queues_max = 256
fs.nfs.nlm_grace_period = 0
fs.nfs.nlm_tcpport = 0
fs.nfs.nlm_timeout = 10
fs.nfs.nlm_udpport = 0
fs.nfs.nsm_local_state = 3
fs.nfs.nsm_use_hostnames = 0
fs.nr_open = 1048576
fs.overflowgid = 65534
fs.overflowuid = 65534
fs.pipe-max-size = 1048576
fs.pipe-user-pages-hard = 0
fs.pipe-user-pages-soft = 16384
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
fs.quota.allocated_dquots = 0
fs.quota.cache_hits = 0
fs.quota.drops = 0
fs.quota.free_dquots = 0
fs.quota.lookups = 0
fs.quota.reads = 0
fs.quota.syncs = 0
fs.quota.warnings = 1
fs.quota.writes = 0
fs.suid_dumpable = 0
fs.xfs.age_buffer_centisecs = 1500
fs.xfs.error_level = 3
fs.xfs.filestream_centisecs = 3000
fs.xfs.inherit_noatime = 1
fs.xfs.inherit_nodefrag = 1
fs.xfs.inherit_nodump = 1
fs.xfs.inherit_nosymlinks = 0
fs.xfs.inherit_sync = 1
fs.xfs.irix_sgid_inherit = 0
fs.xfs.irix_symlink_mode = 0
fs.xfs.panic_mask = 0
fs.xfs.rotorstep = 1
fs.xfs.speculative_prealloc_lifetime = 300
fs.xfs.stats_clear = 0
fs.xfs.xfsbufd_centisecs = 100
fs.xfs.xfssyncd_centisecs = 3000
kernel.acct = 4 2 30
kernel.acpi_video_flags = 0
kernel.auto_msgmni = 1
kernel.bootloader_type = 114
kernel.bootloader_version = 2
kernel.cad_pid = 1
kernel.cap_last_cap = 36
kernel.compat-log = 1
kernel.core_pattern = core
kernel.core_pipe_limit = 0
kernel.core_uses_pid = 1
kernel.ctrl-alt-del = 0
kernel.dmesg_restrict = 0
kernel.domainname = (none)
kernel.ftrace_dump_on_oops = 0
kernel.ftrace_enabled = 1
kernel.hardlockup_all_cpu_backtrace = 0
kernel.hardlockup_panic = 1
kernel.hostname = htc-sblock-node197
kernel.hotplug = 
kernel.hung_task_check_count = 4194304
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 120
kernel.hung_task_warnings = 0
kernel.io_delay_type = 0
kernel.kexec_load_disabled = 0
kernel.keys.gc_delay = 300
kernel.keys.maxbytes = 20000
kernel.keys.maxkeys = 200
kernel.keys.persistent_keyring_expiry = 259200
kernel.keys.root_maxbytes = 25000000
kernel.keys.root_maxkeys = 1000000
kernel.kptr_restrict = 0
kernel.max_lock_depth = 1024
kernel.modprobe = /sbin/modprobe
kernel.modules_disabled = 0
kernel.msg_next_id = -1
kernel.msgmax = 8192
kernel.msgmnb = 16384
kernel.msgmni = 32768
kernel.ngroups_max = 65536
kernel.nmi_watchdog = 1
kernel.ns_last_pid = 176562
kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
kernel.numa_balancing_settle_count = 4
kernel.osrelease = 3.10.0-957.5.1.el7.x86_64
kernel.ostype = Linux
kernel.overflowgid = 65534
kernel.overflowuid = 65534
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 1
kernel.panic_on_stackoverflow = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.perf_cpu_time_max_percent = 25
kernel.perf_event_max_sample_rate = 32000
kernel.perf_event_mlock_kb = 516
kernel.perf_event_paranoid = 2
kernel.pid_max = 196608
kernel.poweroff_cmd = /sbin/poweroff
kernel.print-fatal-signals = 0
kernel.printk = 7 4 1 7
kernel.printk_delay = 0
kernel.printk_ratelimit = 5
kernel.printk_ratelimit_burst = 10
kernel.pty.max = 4096
kernel.pty.nr = 4
kernel.pty.reserve = 1024
kernel.random.boot_id = 5bd2b4ab-221e-4157-98ad-fe4a81da7784
kernel.random.entropy_avail = 4034
kernel.random.poolsize = 4096
kernel.random.read_wakeup_threshold = 64
kernel.random.urandom_min_reseed_secs = 60
kernel.random.uuid = 4f4a6d22-d974-452d-b550-0e19b7a3c74e
kernel.random.write_wakeup_threshold = 896
kernel.randomize_va_space = 2
kernel.real-root-dev = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_child_runs_first = 0
kernel.sched_latency_ns = 24000000
kernel.sched_migration_cost_ns = 500000
kernel.sched_min_granularity_ns = 3000000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_shares_window_ns = 10000000
kernel.sched_time_avg_ms = 1000
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 4000000
kernel.seccomp.actions_avail = kill trap errno trace allow
kernel.seccomp.actions_logged = kill trap errno trace
kernel.sem = 250  32000 32  128
kernel.sem_next_id = -1
kernel.shm_next_id = -1
kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
kernel.softlockup_all_cpu_backtrace = 0
kernel.softlockup_panic = 0
kernel.spl.hostid = 0
kernel.spl.kmem.slab_kmem_alloc = 0
kernel.spl.kmem.slab_kmem_max = 0
kernel.spl.kmem.slab_kmem_total = 0
kernel.spl.kmem.slab_vmem_alloc = 305947392
kernel.spl.kmem.slab_vmem_max = 732324608
kernel.spl.kmem.slab_vmem_total = 347979264
kernel.spl.version = SPL v0.7.12-1
kernel.stack_tracer_enabled = 0
kernel.sysctl_writes_strict = 1
kernel.sysrq = 16
kernel.tainted = 12289
kernel.threads-max = 4126958
kernel.timer_migration = 1
kernel.traceoff_on_warning = 0
kernel.unknown_nmi_panic = 0
kernel.usermodehelper.bset = 4294967295 31
kernel.usermodehelper.inheritable = 4294967295  31
kernel.version = #1 SMP Fri Feb 1 14:54:57 UTC 2019
kernel.watchdog = 1
kernel.watchdog_cpumask = 0-191
kernel.watchdog_thresh = 10
kernel.yama.ptrace_scope = 0
sunrpc.max_resvport = 1023
sunrpc.min_resvport = 665
sunrpc.nfs_debug = 0x0000
sunrpc.nfsd_debug = 0x0000
sunrpc.nlm_debug = 0x0000
sunrpc.rpc_debug = 0x0000
sunrpc.tcp_fin_timeout = 15
sunrpc.tcp_max_slot_table_entries = 65536
sunrpc.tcp_slot_table_entries = 2
sunrpc.transports = tcp 1048576
sunrpc.transports = udp 32768
sunrpc.transports = tcp-bc 1048576
sunrpc.udp_slot_table_entries = 16
user.max_ipc_namespaces = 2063479
user.max_mnt_namespaces = 2063479
user.max_pid_namespaces = 2063479
user.max_user_namespaces = 0
user.max_uts_namespaces = 2063479
vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.drop_caches = 0
vm.extfrag_threshold = 500
vm.hugepages_treat_as_movable = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256 256 32
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 90112
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.mmap_rnd_bits = 28
vm.mmap_rnd_compat_bits = 8
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
vm.swappiness = 60
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0

Дайте мне знать, могу ли я включить что-нибудь еще, что было бы полезно.

Это халява, поскольку я думаю, что объем работ распространяется на платный ZFS-консалтинг:

  • Как подключены ваши шкафы?
  • У вас 12 внешних JBOD, но нет никаких указаний на то, что многопутевый режим включен
  • Подумайте, где диски, которые отключаются, по отношению к корпусам и zpool
  • Я всегда рекомендовал бы топологию SAS-кабеля с кольцом при работе с таким большим количеством корпусов.
  • Если этого нет, я бы работал над этим
  • Ваш пул также должен состоять из многопутевого /dev/mapper устройства в этой ситуации
  • Вы можете показать свой /etc/modprobe.d/zfs.conf?
  • Все диски SAS?

Пример многолучевого подключения SAS: