Один из наших производственных кластеров, управляемый XCP, внезапно перестал отвечать. После перезагрузки и некоторого исследования мы обнаружили такие логи в системном журнале машины dom0:
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659040] irq 339: nobody cared (try booting with the "irqpoll" option)
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659058] Pid: 0, comm: swapper/3 Tainted: G C O 3.2.0-24-generic #37-Ubuntu
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659060] Call Trace:
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659062] <IRQ> [<ffffffff810db37d>] __report_bad_irq+0x3d/0xe0
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659071] [<ffffffff810db605>] note_interrupt+0x135/0x190
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659074] [<ffffffff810d8e69>] handle_irq_event_percpu+0xa9/0x220
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659078] [<ffffffff8130ff3b>] ? radix_tree_lookup+0xb/0x10
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659081] [<ffffffff810d9031>] handle_irq_event+0x51/0x80
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659084] [<ffffffff810dc187>] handle_edge_irq+0x87/0x140
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659089] [<ffffffff813a8829>] __xen_evtchn_do_upcall+0x199/0x250
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659092] [<ffffffff813aa96f>] xen_evtchn_do_upcall+0x2f/0x50
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659096] [<ffffffff81666d3e>] xen_do_hypervisor_callback+0x1e/0x30
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659097] <EOI> [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659104] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659107] [<ffffffff8100a1d0>] ? xen_safe_halt+0x10/0x20
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659110] [<fff
IRQ 339 в cat / proc / interrupts:
339: ... xen-pirq-msi-x eth0
где eth0 - аппаратная сетевая карта.
Пока хост-машина, кажется, зависает, гостевые машины продолжают работать, поэтому наш крошечный внутренний мониторинг на одном из виртуальных хостов зарегистрировал что-то вроде этого:
[2012-10-26 20:31:51] [OK......] 200 OK : 113159149 ns
[2012-10-26 20:32:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 47763284432 ns
...
[2012-10-26 20:34:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 46894835070 ns
[2012-10-26 20:34:57] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 16821741955 ns
...
[2012-10-26 20:38:18] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 20103298289 ns
[2012-10-26 20:38:37] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 17895754943 ns
Хост и гостевая ОС: Ubuntu 12.04 LTS,
05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Subsystem: ASUSTeK Computer Inc. Device 8369
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 17
Region 0: Memory at fe500000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at e000 [size=32]
Region 3: Memory at fe520000 (32-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: e1000e
Kernel modules: e1000e
Любые подсказки, как это отладить?