у нас есть сервер (виртуальная машина ESXi), который время от времени зависает из-за «паники ядра: нехватка памяти и отсутствие процессов, которые можно убить ...»
Память хоста составляет 12 ГБ.
Конфигурация виртуальной машины
SuSe 11.3 (64 бит) + ядро 2.6.34-12
firebird, postresql, db2
комп не сильно используется, вылетает раз в день, раз в два дня. Иногда это случается за неделю.
Как я могу узнать, что вызывает сбой сервера?
извлечь из файла vmware.log
Apr 03 07:21:22.266: vcpu-0| Vix: [17514025 vmxCommands.c:7612]: VMAutomation_HandleCLIHLTEvent. Do nothing.
Apr 03 07:21:22.266: vcpu-0| Msg_Hint: msg.monitorevent.halt (sent)
Apr 03 07:21:22.266: vcpu-0| The CPU has been disabled by the guest operating system. You will need to power off or reset the virtual machine at this point.
Apr 03 07:21:22.266: vcpu-0| ---------------------------------------
Apr 03 07:21:37.167: vmx| GuestRpcSendTimedOut: message to toolbox timed out.
Apr 03 07:21:37.167: vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down
Apr 03 22:30:06.017: mks| MKS: Base polling period is 10000us
извлечь из / var / log / messages, где все (вероятно) начинается. Я собираюсь удалить /opt/eduserver/bin/php
из cron, и мы увидим, повторится ли сбой снова.
Apr 9 22:15:02 testing /usr/sbin/cron[4312]: (root) CMD (/opt/eduserver/bin/php /srv/www/htdocs/imacs/radek/trunk/lib/views/edu_scheduler/controllers/action_scheduler.php >/var/lib/edumate/imacs/radek/trunk/scheduler )
Apr 9 22:15:20 testing kernel: [115148.493482] oom_kill_process: 3 callbacks suppressed
Apr 9 22:15:20 testing kernel: [115148.493485] php invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:15:20 testing kernel: [115148.493488] Pid: 4317, comm: php Not tainted 2.6.34-12-desktop #1
Apr 9 22:15:20 testing kernel: [115148.493490] Call Trace:
Apr 9 22:15:20 testing kernel: [115148.493511] [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr 9 22:15:20 testing kernel: [115148.493516] [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr 9 22:15:20 testing kernel: [115148.493522] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr 9 22:15:20 testing kernel: [115148.493525] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr 9 22:15:20 testing kernel: [115148.493529] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr 9 22:15:20 testing kernel: [115148.493533] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr 9 22:15:20 testing kernel: [115148.493536] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr 9 22:15:20 testing kernel: [115148.493541] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr 9 22:15:20 testing kernel: [115148.493545] [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr 9 22:15:20 testing kernel: [115148.493548] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr 9 22:15:20 testing kernel: [115148.493553] [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr 9 22:15:20 testing kernel: [115148.493557] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr 9 22:15:20 testing kernel: [115148.493561] [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr 9 22:15:20 testing kernel: [115148.493565] [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr 9 22:15:20 testing kernel: [115148.493587] [<00007f52b7d4cce5>] 0x7f52b7d4cce5
Apr 9 22:15:20 testing kernel: [115148.493588] Mem-Info:
Apr 9 22:15:20 testing kernel: [115148.493590] Node 0 DMA per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493592] CPU 0: hi: 0, btch: 1 usd: 0
Apr 9 22:15:20 testing kernel: [115148.493593] CPU 1: hi: 0, btch: 1 usd: 0
Apr 9 22:15:20 testing kernel: [115148.493595] Node 0 DMA32 per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493597] CPU 0: hi: 186, btch: 31 usd: 155
Apr 9 22:15:20 testing kernel: [115148.493598] CPU 1: hi: 186, btch: 31 usd: 161
Apr 9 22:15:20 testing kernel: [115148.493600] Node 0 Normal per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493601] CPU 0: hi: 186, btch: 31 usd: 173
Apr 9 22:15:20 testing kernel: [115148.493603] CPU 1: hi: 186, btch: 31 usd: 57
Apr 9 22:15:20 testing kernel: [115148.493607] active_anon:1465647 inactive_anon:288016 isolated_anon:0
Apr 9 22:15:20 testing kernel: [115148.493607] active_file:129 inactive_file:784 isolated_file:0
Apr 9 22:15:20 testing kernel: [115148.493608] unevictable:0 dirty:0 writeback:0 unstable:0
Apr 9 22:15:20 testing kernel: [115148.493609] free:11853 slab_reclaimable:4721 slab_unreclaimable:64985
Apr 9 22:15:20 testing kernel: [115148.493609] mapped:14998 shmem:15500 pagetables:161144 bounce:0
Apr 9 22:15:20 testing kernel: [115148.493611] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 9 22:15:20 testing kernel: [115148.493618] lowmem_reserve[]: 0 3000 8050 8050
Apr 9 22:15:20 testing kernel: [115148.493621] Node 0 DMA32 free:24432kB min:4272kB low:5340kB high:6408kB active_anon:2097640kB inactive_anon:524448kB active_file:52kB inactive_file:64kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:448kB shmem:360kB slab_reclaimable:1988kB slab_unreclaimable:97472kB kernel_stack:17712kB pagetables:239608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:144 all_unreclaimable? no
Apr 9 22:15:20 testing kernel: [115148.493629] lowmem_reserve[]: 0 0 5050 5050
Apr 9 22:15:20 testing kernel: [115148.493631] Node 0 Normal free:7168kB min:7192kB low:8988kB high:10788kB active_anon:3764948kB inactive_anon:627616kB active_file:464kB inactive_file:3072kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59544kB shmem:61640kB slab_reclaimable:16896kB slab_unreclaimable:162468kB kernel_stack:28984kB pagetables:404968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1440 all_unreclaimable? yes
Apr 9 22:15:20 testing kernel: [115148.493639] lowmem_reserve[]: 0 0 0 0
Apr 9 22:15:20 testing kernel: [115148.493641] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr 9 22:15:20 testing kernel: [115148.493648] Node 0 DMA32: 272*4kB 140*8kB 31*16kB 127*32kB 84*64kB 42*128kB 11*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 24432kB
Apr 9 22:15:20 testing kernel: [115148.493655] Node 0 Normal: 840*4kB 26*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7168kB
Apr 9 22:15:20 testing kernel: [115148.493662] 19767 total pagecache pages
Apr 9 22:15:20 testing kernel: [115148.493663] 3345 pages in swap cache
Apr 9 22:15:20 testing kernel: [115148.493664] Swap cache stats: add 531666, delete 528321, find 103411/104065
Apr 9 22:15:20 testing kernel: [115148.493666] Free swap = 0kB
Apr 9 22:15:20 testing kernel: [115148.493667] Total swap = 2103292kB
Apr 9 22:15:20 testing kernel: [115148.514162] 2097136 pages RAM
Apr 9 22:15:20 testing kernel: [115148.514164] 48271 pages reserved
Apr 9 22:15:20 testing kernel: [115148.514165] 106772 pages shared
Apr 9 22:15:20 testing kernel: [115148.514166] 2006923 pages non-shared
Apr 9 22:15:20 testing kernel: [115148.514169] Out of memory: kill process 3016 (cron) score 308233 or a child
Apr 9 22:15:20 testing kernel: [115148.514171] Killed process 15546 (cron) vsz:50064kB, anon-rss:272kB, file-rss:32kB
Apr 9 22:16:01 testing /usr/sbin/cron[4347]: (root) CMD (/usr/bin/ruby /root/radek/scripts/freemem.rb)
Apr 9 22:17:07 testing kernel: [115255.428734] vmtoolsd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:17:07 testing kernel: [115255.428738] Pid: 2772, comm: vmtoolsd Not tainted 2.6.34-12-desktop #1
Apr 9 22:17:08 testing kernel: [115255.428740] Call Trace:
Apr 9 22:17:08 testing kernel: [115255.428751] [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr 9 22:17:08 testing kernel: [115255.428756] [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr 9 22:17:08 testing kernel: [115255.428762] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr 9 22:17:08 testing kernel: [115255.428765] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr 9 22:17:08 testing kernel: [115255.428769] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr 9 22:17:08 testing kernel: [115255.428773] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr 9 22:17:08 testing kernel: [115255.428777] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr 9 22:17:08 testing kernel: [115255.428781] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr 9 22:17:08 testing kernel: [115255.428785] [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr 9 22:17:08 testing kernel: [115255.428788] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr 9 22:17:08 testing kernel: [115255.428793] [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr 9 22:17:08 testing kernel: [115255.428802] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr 9 22:17:08 testing kernel: [115255.428824] [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr 9 22:17:08 testing kernel: [115255.428828] [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr 9 22:17:08 testing kernel: [115255.428841] [<00007f09951973c0>] 0x7f09951973c0
Apr 9 22:17:08 testing kernel: [115255.428842] Mem-Info:
Apr 9 22:17:08 testing kernel: [115255.428844] Node 0 DMA per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428846] CPU 0: hi: 0, btch: 1 usd: 0
Apr 9 22:17:08 testing kernel: [115255.428847] CPU 1: hi: 0, btch: 1 usd: 0
Apr 9 22:17:08 testing kernel: [115255.428848] Node 0 DMA32 per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428850] CPU 0: hi: 186, btch: 31 usd: 44
Apr 9 22:17:08 testing kernel: [115255.428852] CPU 1: hi: 186, btch: 31 usd: 174
Apr 9 22:17:08 testing kernel: [115255.428853] Node 0 Normal per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428855] CPU 0: hi: 186, btch: 31 usd: 146
Apr 9 22:17:08 testing kernel: [115255.428856] CPU 1: hi: 186, btch: 31 usd: 171
Apr 9 22:17:08 testing kernel: [115255.428860] active_anon:1464570 inactive_anon:287629 isolated_anon:0
Apr 9 22:17:08 testing kernel: [115255.428861] active_file:66 inactive_file:2047 isolated_file:64
Apr 9 22:17:08 testing kernel: [115255.428862] unevictable:0 dirty:0 writeback:0 unstable:0
Apr 9 22:17:08 testing kernel: [115255.428862] free:11882 slab_reclaimable:4727 slab_unreclaimable:64987
Apr 9 22:17:08 testing kernel: [115255.428863] mapped:15715 shmem:15500 pagetables:161192 bounce:0
Apr 9 22:17:08 testing kernel: [115255.428865] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428872] lowmem_reserve[]: 0 3000 8050 8050
Apr 9 22:17:08 testing kernel: [115255.428875] Node 0 DMA32 free:24448kB min:4272kB low:5340kB high:6408kB active_anon:2091648kB inactive_anon:522644kB active_file:176kB inactive_file:7944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:3496kB shmem:360kB slab_reclaimable:2004kB slab_unreclaimable:97488kB kernel_stack:17712kB pagetables:239656kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:210 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428882] lowmem_reserve[]: 0 0 5050 5050
Apr 9 22:17:08 testing kernel: [115255.428885] Node 0 Normal free:7268kB min:7192kB low:8988kB high:10788kB active_anon:3766632kB inactive_anon:627872kB active_file:88kB inactive_file:244kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59364kB shmem:61640kB slab_reclaimable:16904kB slab_unreclaimable:162460kB kernel_stack:29000kB pagetables:405112kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:129 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428893] lowmem_reserve[]: 0 0 0 0
Apr 9 22:17:08 testing kernel: [115255.428895] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr 9 22:17:08 testing kernel: [115255.428902] Node 0 DMA32: 278*4kB 127*8kB 33*16kB 119*32kB 81*64kB 44*128kB 6*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 24448kB
Apr 9 22:17:08 testing kernel: [115255.428909] Node 0 Normal: 881*4kB 20*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7268kB
Apr 9 22:17:08 testing kernel: [115255.428915] 18755 total pagecache pages
Apr 9 22:17:08 testing kernel: [115255.428916] 1043 pages in swap cache
Apr 9 22:17:08 testing kernel: [115255.428918] Swap cache stats: add 531680, delete 530637, find 103628/104282
Apr 9 22:17:08 testing kernel: [115255.428919] Free swap = 0kB
Apr 9 22:17:08 testing kernel: [115255.428920] Total swap = 2103292kB
Apr 9 22:17:08 testing kernel: [115255.447686] 2097136 pages RAM
Apr 9 22:17:08 testing kernel: [115255.447688] 48271 pages reserved
Apr 9 22:17:08 testing kernel: [115255.447689] 64969 pages shared
Apr 9 22:17:08 testing kernel: [115255.447690] 2006202 pages non-shared
Apr 9 22:17:08 testing kernel: [115255.447693] Out of memory: kill process 3016 (cron) score 308364 or a child
Apr 9 22:17:08 testing kernel: [115255.447696] Killed process 15547 (cron) vsz:50064kB, anon-rss:316kB, file-rss:4kB
Apr 9 22:17:08 testing kernel: [115255.753860] db2sysc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:17:08 testing kernel: [115255.753864] Pid: 3346, comm: db2sysc Not tainted 2.6.34-12-desktop #1
Вы должны найти виновника использования слишком большого объема памяти. Вы можете сделать это с помощью простого скрипта, записывающего вывод ps
время от времени использовать средства мониторинга, такие как Мунин.
Без точного наблюдения за тем, что происходит, нелегко узнать, кто пожирает вашу память и подкачивает до такой степени, что ничего не остается доступным, даже я склонен сначала гадать по базам данных.
Сколько памяти отведено инстансу Suse? Учитывая, что вы запускаете на нем много ресурсоемких сервисов (3 СУБД плюс memcached), для работы потребуется значительный объем 8 ГБ памяти.
Вам нужно будет проверить как резервирование памяти, так и настройку лимита в ESXi для экземпляра Suse - помните, что настройка лимита может заставить машину отключиться или даже выйти из строя, если она установлена слишком низко.