У меня есть 2 сервера Asterisk на CentOS в группе HA с миротворцем. Кластер находится в активном пассивном режиме с настроенными группами ресурсов. Проблема в том, что какое-то время сервер переключается с первичного на вторичный без очевидной для меня причины. Журнал начинается с :
PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
с последующей остановкой групп ресурсов и
Oct 3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: Trying to unmount /drbd
Oct 3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: unmounted /drbd successfully
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation DrbdFS_stop_0 (call=2046, rc=0, cib-update=2735, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 32: demote drbd_drbd0_demote_0 on pbx1(local)
Oct 3 19:30:39 PBX1 kernel: block drbd0: role( Primary -> Secondary )
Oct 3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct 3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_demote_0 (call=2050, rc=0, cib-update=2736, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 159: notify drbd_drbd0_post_notify_demote_0 on pbx2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 161: notify drbd_drbd0_post_notify_demote_0 on pbx1 (local)
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2053, rc=0, cib-update=0, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 156: notify drbd_drbd0_pre_notify_stop_0 on pbx2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 157: notify drbd_drbd0_pre_notify_stop_0 on pbx1 (local)
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2056, rc=0, cib-update=0, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 31: stop drbd_drbd0_stop_0 on pbx2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 33: stop drbd_drbd0_stop_0 on pbx1 (local)
Oct 3 19:30:39 PBX1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Oct 3 19:30:39 PBX1 kernel: block drbd0: asender terminated
Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_asender
Oct 3 19:30:39 PBX1 kernel: block drbd0: Connection closed
Oct 3 19:30:39 PBX1 kernel: block drbd0: conn( Disconnecting -> StandAlone )
Oct 3 19:30:39 PBX1 kernel: block drbd0: receiver terminated
Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_receiver
Oct 3 19:30:39 PBX1 kernel: block drbd0: disk( UpToDate -> Failed )
Oct 3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct 3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct 3 19:30:39 PBX1 kernel: block drbd0: disk( Failed -> Diskless )
Oct 3 19:30:39 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 0
Oct 3 19:30:39 PBX1 kernel: block drbd0: worker terminated
Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_worker
Oct 3 19:30:39 PBX1 attrd[2503]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (<null>)
Oct 3 19:30:39 PBX1 attrd[2503]: notice: attrd_perform_update: Sent delete 329: node=pbx1, attr=master-drbd_drbd0, id=<n/a>, set=(null), section=status
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_stop_0 (call=2059, rc=0, cib-update=2737, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: run_graph: Transition 2112 (Complete=63, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3294.bz2): Stopped
Oct 3 19:30:39 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:39 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2113: /var/lib/pacemaker/pengine/pe-input-3295.bz2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: run_graph: Transition 2113 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3295.bz2): Complete
Oct 3 19:30:39 PBX1 crmd[2505]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000
Oct 3 19:30:41 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 3 19:30:41 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:0#011(pbx2)
Oct 3 19:30:41 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2114: /var/lib/pacemaker/pengine/pe-input-3296.bz2
Oct 3 19:30:41 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:0#011(pbx1)
Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:1#011(pbx2)
Oct 3 19:30:41 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2115: /var/lib/pacemaker/pengine/pe-input-3297.bz2
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 7: start drbd_drbd0_start_0 on pbx1 (local)
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 9: start drbd_drbd0:1_start_0 on pbx2
Oct 3 19:30:41 PBX1 kernel: block drbd0: Starting worker thread (from cqueue [2046])
Oct 3 19:30:41 PBX1 kernel: block drbd0: disk( Diskless -> Attaching )
Oct 3 19:30:41 PBX1 kernel: block drbd0: Found 4 transactions (192 active extents) in activity log.
Oct 3 19:30:41 PBX1 kernel: block drbd0: Method to ensure write ordering: flush
Oct 3 19:30:41 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 28411864
Oct 3 19:30:41 PBX1 kernel: block drbd0: resync bitmap: bits=3551483 words=55492 pages=109
Oct 3 19:30:41 PBX1 kernel: block drbd0: size = 14 GB (14205932 KB)
Oct 3 19:30:41 PBX1 kernel: block drbd0: bitmap READ of 109 pages took 13 jiffies
Oct 3 19:30:41 PBX1 kernel: block drbd0: recounting of set bits took additional 1 jiffies
Oct 3 19:30:41 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct 3 19:30:41 PBX1 kernel: block drbd0: disk( Attaching -> UpToDate )
Oct 3 19:30:41 PBX1 kernel: block drbd0: attached to UUIDs 760BE94B0D6CDF8A:0000000000000000:819257FD75678FC9:819157FD75678FC9
Oct 3 19:30:41 PBX1 kernel: block drbd0: conn( StandAlone -> Unconnected )
Oct 3 19:30:41 PBX1 kernel: block drbd0: Starting receiver thread (from drbd0_worker [14408])
Oct 3 19:30:41 PBX1 kernel: block drbd0: receiver (re)started
Oct 3 19:30:41 PBX1 kernel: block drbd0: conn( Unconnected -> WFConnection )
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (1000)
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 336: master-drbd_drbd0=1000
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 338: master-drbd_drbd0=1000
Oct 3 19:30:41 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_start_0 (call=2062, rc=0, cib-update=2741, confirmed=true) ok
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 106: notify drbd_drbd0_post_notify_start_0 on pbx1 (local)
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 107: notify drbd_drbd0:1_post_notify_start_0 on pbx2
Oct 3 19:30:41 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2065, rc=0, cib-update=0, confirmed=true) ok
Oct 3 19:30:41 PBX1 crmd[2505]: notice: run_graph: Transition 2115 (Complete=10, Pending=0, Fired=0, Skipped=2, Incompl
часть для ping из конфигурации ПК
Attributes: drbd_resource=drbd0
Operations: monitor interval=10s (drbd_drbd0-monitor-interval-10s)
Clone: Connectivity
Resource: p_ping (class=ocf provider=pacemaker type=ping)
Attributes: host_list="10.66.9.4 10.66.9.5 10.66.9.11 10.66.9.252 10.66.9.253 10.66.9.254" multiplier=1000 dampen=5s
Operations: monitor interval=1s (p_ping-monitor-interval-1s)
Важная информация заключается в том, что только 10.66.9.11 жив (я планирую удалить недопустимый адрес, но это другая история) с этих адресов из-за того, что редизайн произошел раньше, чем год назад, но проблема видна через месяц или два макс. Может быть так или скажите, если у вас есть другие идеи. P.S Кабели хорошие, и в них нет потери интерфейса icmp для 10.66.9.11 в соответствии с cacti witch, настроенным для мониторинга
Oct 3 19:18:31 PBX1 asterisk(asterisk)[1338]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:19:01 PBX1 asterisk(asterisk)[1960]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:19:14 PBX1 Xinetd(Xinetd)[2200]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:19:32 PBX1 asterisk(asterisk)[2347]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:20:02 PBX1 asterisk(asterisk)[2994]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:20:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:20:32 PBX1 asterisk(asterisk)[3369]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:21:02 PBX1 asterisk(asterisk)[3970]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:21:08 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct 3 19:21:08 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:21:08 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2110: /var/lib/pacemaker/pengine/pe-input-3292.bz2
Oct 3 19:21:08 PBX1 crmd[2505]: notice: run_graph: Transition 2110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3292.bz2): Complete
Oct 3 19:21:08 PBX1 crmd[2505]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 3 19:21:14 PBX1 Xinetd(Xinetd)[4176]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:21:32 PBX1 asterisk(asterisk)[4318]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:22:02 PBX1 asterisk(asterisk)[4989]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:22:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:22:32 PBX1 asterisk(asterisk)[5352]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:23:02 PBX1 asterisk(asterisk)[5939]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:23:14 PBX1 Xinetd(Xinetd)[6160]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:23:32 PBX1 asterisk(asterisk)[6302]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:24:02 PBX1 asterisk(asterisk)[6933]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:24:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:24:32 PBX1 asterisk(asterisk)[7276]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:25:02 PBX1 asterisk(asterisk)[7895]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:25:14 PBX1 Xinetd(Xinetd)[8118]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:25:32 PBX1 asterisk(asterisk)[8259]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/A/IN': 2a01:8840:6::1#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/AAAA/IN': 2a01:8840:6::1#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct 3 19:26:02 PBX1 asterisk(asterisk)[8873]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:26:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:26:32 PBX1 asterisk(asterisk)[9233]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:27:02 PBX1 asterisk(asterisk)[9841]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:27:14 PBX1 Xinetd(Xinetd)[10088]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:27:32 PBX1 asterisk(asterisk)[10230]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:28:02 PBX1 asterisk(asterisk)[10862]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:28:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:28:32 PBX1 asterisk(asterisk)[11224]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:29:03 PBX1 asterisk(asterisk)[11810]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:29:14 PBX1 Xinetd(Xinetd)[12031]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:29:33 PBX1 asterisk(asterisk)[12174]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:30:03 PBX1 asterisk(asterisk)[12818]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:30:23 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
Oct 3 19:30:23 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 3 19:30:23 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop drbd_drbd0:0#011(pbx2..bg)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2111: /var/lib/pacemaker/pengine/pe-input-3293.bz2
Oct 3 19:30:23 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop drbd_drbd0:0#011(pbx2..bg)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Demote drbd_drbd0:1#011(Master -> Stopped pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusNetIP#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP-R#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP-R2#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop Fixr_ClusPBXIP#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R5#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R6#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R7#011(pbx1)
В PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
вроде отдает это. Здесь оценка pingd была равна 0 (нет ping). а затем показывает, что останавливает все ресурсы (3 октября 19:30:23).
Позже мы видим, что показатель pingd достигает 1000, а затем службы снова перезагружаются: PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000
(3 октября 19:30:41).
У меня нет полного CIB Pacemaker, но я подозреваю, что у вас есть какое-то правило местоположения / колокации, согласно которому ресурсы не могут работать там, где оценка pingd равна 0. Таким образом, когда ваша оценка pingd достигает нуля, ресурсы остановите весь кластер, а затем перезапустите, когда ping снова вернусь. Хотя возможно, что ресурсы могут запускаться на одноранговом узле, я не считаю, что это проблема аварийного переключения, а скорее останавливаются и перезапускаются службы из-за ресурса и правил pingd.
Хотя вы упомянули, что у cacti нет проблем с пингами:
P.S Кабели хорошие, и в них нет потери интерфейса icmp для 10.66.9.11 в соответствии с cacti witch, настроенным для мониторинга
Как часто за этим следят кактусы? Кардиостимулятор среагирует, если пинги не вернутся в течение 5 секунд. Кроме того, данные cacti, вероятно, представляют собой только эхо-запросы с сервера cacti на сервер 10.66.9.11. Это не обязательно означает, что между кластером АТС и системой 10.66.9.11 нет проблемы.
Я бы предложил либо добавить больше узлов ping в host_list, либо увеличить интервал мониторинга p_ping и время демпфирования, либо просто полностью отказаться от ресурса и правил p_ping.