У меня есть три сервера MariaDB в кластере Galera. Я использую один сервер в качестве «основного» мастера (т.е. Galera предназначена только для аварийного переключения, приложение не использует активно несколько мастеров).
Примерно раз в две недели основной мастер выходит из строя. С двумя другими серверами в кластере все в порядке, я могу перезапустить сбойный сервер, и он исправится.
Я переключался между тем, какой из трех серверов является «основным» главным, и сбой происходит независимо от того, какой сервер я выберу. Поэтому кажется маловероятным, что это связано с оборудованием.
Вопрос в том, почему это происходит? Как его отследить? Должен ли я просто отправить это в MariaDB как ошибку?
2015-04-09 02:02:38 7f788745a700 InnoDB: Assertion failure in thread 140155642291968 in file rem0rec.cc line 580
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
150409 2:02:38 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see http://kb.askmonty.org/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.0.16-MariaDB-1~trusty-wsrep-log
key_buffer_size=52428800
read_buffer_size=131072
max_used_connections=128
max_threads=402
thread_count=11
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 934441 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0x7f75176b3008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f7887459df0 thread_stack 0x30000
150409 2:02:44 [Warning] WSREP: last inactive check more than PT1.5S ago (PT5.98149S), skipping check
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: view(view_id(NON_PRIM,70802785-d454-11e4-9152-2b6d076ff37a,26) memb {
c86d2afe-da1f-11e4-befa-264d853d1e46,0
} joined {
} left {
} partitioned {
70802785-d454-11e4-9152-2b6d076ff37a,0
e18a3f1a-c314-11e4-a25a-c6a751e32d91,0
})
150409 2:02:44 [Note] WSREP: view(view_id(NON_PRIM,c86d2afe-da1f-11e4-befa-264d853d1e46,27) memb {
c86d2afe-da1f-11e4-befa-264d853d1e46,0
} joined {
} left {
} partitioned {
70802785-d454-11e4-9152-2b6d076ff37a,0
e18a3f1a-c314-11e4-a25a-c6a751e32d91,0
})
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150409 2:02:44 [Note] WSREP: Flow-control interval: [16, 16]
150409 2:02:44 [Note] WSREP: Received NON-PRIMARY.
150409 2:02:44 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 497086935)
150409 2:02:44 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150409 2:02:44 [Note] WSREP: Flow-control interval: [16, 16]
150409 2:02:44 [Note] WSREP: Received NON-PRIMARY.
150409 2:02:44 [Note] WSREP: New cluster view: global state: ec05ddd0-c265-11e4-b715-e69a238eb511:497086935, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
150409 2:02:44 [Warning] WSREP: Send action {(nil), 250, TORDERED} returned -107 (Transport endpoint is not connected)
150409 2:02:44 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150409 2:02:44 [Note] WSREP: New cluster view: global state: ec05ddd0-c265-11e4-b715-e69a238eb511:497086935, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
150409 2:02:44 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.177.11:4567 tcp://192.168.179.12:4567
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x7f7898d74c7e]
/usr/sbin/mysqld(handle_fatal_signal+0x457)[0x7f78988ac8a7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f7897059340]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)[0x7f78966b0cc9]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f78966b40d8]
/usr/sbin/mysqld(+0x8832eb)[0x7f7898b9f2eb]
/usr/sbin/mysqld(+0x8858ff)[0x7f7898ba18ff]
/usr/sbin/mysqld(+0x802c9e)[0x7f7898b1ec9e]
/usr/sbin/mysqld(+0x892af5)[0x7f7898baeaf5]
/usr/sbin/mysqld(+0x895133)[0x7f7898bb1133]
/usr/sbin/mysqld(+0x8bece8)[0x7f7898bdace8]
/usr/sbin/mysqld(+0x8c3361)[0x7f7898bdf361]
/usr/sbin/mysqld(+0x8c3c27)[0x7f7898bdfc27]
/usr/sbin/mysqld(+0x8a4689)[0x7f7898bc0689]
/usr/sbin/mysqld(+0x804fb7)[0x7f7898b20fb7]
/usr/sbin/mysqld(_ZN7handler13ha_delete_rowEPKh+0x3f7)[0x7f78988b7b27]
/usr/sbin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xf3e)[0x7f78989f047e]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x23cb)[0x7f7898723fcb]
/usr/sbin/mysqld(+0x40f7b7)[0x7f789872b7b7]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1ebb)[0x7f789872dd1b]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x20f)[0x7f789872e9bf]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x1fb)[0x7f78987fcbcb]
/usr/sbin/mysqld(handle_one_connection+0x40)[0x7f78987fcdb0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f7897051182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f789677447d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f750940f020): is an invalid pointer
Connection ID (thread ID): 25689442
Status: NOT_KILLED
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
150409 02:02:46 mysqld_safe Number of processes running now: 0
150409 02:02:46 mysqld_safe WSREP: not restarting wsrep node automatically
150409 02:02:46 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
Да. всегда отправляйте трассировку стека в mariadb как ошибку.
Я не вижу ничего подобного. Я определенно сначала обновлюсь до последней стабильной версии 10.0.
Попробуйте запустить с включенными обновлениями log-slave и двоичным ведением журнала. Это должно помочь идентифицировать инструкцию SQL, вызвавшую сбой.