Автоматическое переключение настраивается при настройке первичного / вторичного сервера HDR.
Кластер HDR будет подключен, и репликация включена. Если служба корректно остановлена на первичном сервере (NODE-A) или виртуальная машина прекращает работу, происходит переключение на резерв, и вторичный сервер (NODE-B) становится первичным.
Следующее работает нормально, пока подключен диспетчер подключений / ICM:
service informix stop
или
onmade -yuk
или просто выключите / завершите работу основного узла БД (NODE-A)
Однако, если в случае события аварийного восстановления, когда и основной сервер БД (NODE-A), и диспетчер подключений / ICM вышли из строя одновременно, тогда компьютер диспетчера подключений был восстановлен, он не переключается на резервный узел (УЗЕЛ-B)
На серверах Informix awwdst10a
и awwdst10b
У меня следующие параметры репликации
DRAUTO 3
DRINTERVAL 0
DRTIMEOUT 30
HA_ALIAS
HA_FOC_ORDER
DRLOSTFOUND /logs/informix/t10/
DRIDXAUTO 1
LOG_INDEX_BUILDS 1
SDS_ENABLE 0
SDS_TIMEOUT 20
SDS_PAGING
UPDATABLE_SECONDARY 0
FAILOVER_CALLBACK
TEMPTAB_NOLOG 1
В sqlhosts
файл для обоих серверов Informix:
cluster_1 group - - i=10
awwdst10a_t10 onsoctcp awwdst10a informix_t10 g=cluster_1
awwdst10b_t10 onsoctcp awwdst10b informix_t10 g=cluster_1
sla_cluster_1 onsoctcp localhost informix_t10
Репликация настроена и работает, на secondary
awwdst10b бегу:
onstat -g cluster
IBM Informix Dynamic Server Version 12.10.FC10 -- Read-Only (Sec) -- Up 02:42:00 -- 1556768 Kbytes
Primary Server:awwdst10a_t10
Index page logging status: Enabled
Index page logging was enabled at: 2018/06/06 09:47:35
Server ACKed Log Supports Status
(log, page) Updates
awwdst10b_t10 54494,1135 No ASYNC(HDR),Connected,On
на primary
awwdst10a запускаю:
$ onstat -g cluster
IBM Informix Dynamic Server Version 12.10.FC10 -- On-Line (Prim) -- Up 02:37:16 -- 1556768 Kbytes
Primary Server:awwdst10a_t10
Current Log Page:54494,1135
Index page logging status: Enabled
Index page logging was enabled at: 2018/06/06 09:47:35
Server ACKed Log Applied Log Supports Status
(log, page) (log, page) Updates
awwdst10b_t10 54494,1135 54494,1135 No ASYNC(HDR),Connected,On
Итак, репликация работает и серверы синхронизированы.
На третьей машине awguht10
Я настраиваю диспетчер подключений / ICM со следующими параметрами:
NAME whics-t10
DEBUG 1
LOG 1
LOGFILE /logs/informix/cmlog
CM_TIMEOUT 60
EVENT_TIMEOUT 60
SECONDARY_EVENT_TIMEOUT 60
SQLHOSTS local
CLUSTER cluster_1
{
INFORMIXSERVER cluster_1
SLA sla_cluster_1 DBSERVERS=primary
FOC ORDER=ENABLED \
PRIORITY=1
#CMALARMPROGRAM $INFORMIXDIR/etc/cmalarmprogram.sh
}
И sqlhosts
файл для диспетчера соединений:
cluster_1 group - - i=10
awwdst40a_t40 onsoctcp awwdst40a informix_t10 g=cluster_1
awwdst40b_t40 onsoctcp awwdst40b informix_t10 g=cluster_1
sla_cluster_whics onsoctcp localhost informix_t10
Теперь я насильно сбиваю awwdst10a
Я получаю следующее в онлайн-журнале сервера Informix awwdst10b
:
Mon Oct 22 22:46:30 2018
22:46:30 Insufficient free huge pages in /proc/meminfo for shared memory segment.
Requested: 56774656 bytes. Available: 0 bytes.
The default memory page size will be used.
22:46:30 Requested shared memory segment size rounded from 55443KB to 55444KB
22:46:30 Segment locked: addr=0x73877000, size=56774656
22:46:30 Successfully added a bufferpool of page size 2K.
22:46:30 Insufficient free huge pages in /proc/meminfo for shared memory segment.
Requested: 107974656 bytes. Available: 0 bytes.
The default memory page size will be used.
22:46:30 Requested shared memory segment size rounded from 105443KB to 105444KB
22:46:30 Segment locked: addr=0x76e9c000, size=107974656
22:46:30 Successfully added a bufferpool of page size 4K.
22:46:30 Insufficient free huge pages in /proc/meminfo for shared memory segment.
Requested: 210374656 bytes. Available: 0 bytes.
The default memory page size will be used.
22:46:30 Requested shared memory segment size rounded from 205443KB to 205444KB
22:46:30 Segment locked: addr=0x7d595000, size=210374656
22:46:30 Successfully added a bufferpool of page size 8K.
22:46:30 Insufficient free huge pages in /proc/meminfo for shared memory segment.
Requested: 415174656 bytes. Available: 0 bytes.
The default memory page size will be used.
22:46:30 Requested shared memory segment size rounded from 405443KB to 405444KB
22:46:30 Segment locked: addr=0x89e36000, size=415174656
22:46:30 Successfully added a bufferpool of page size 16K.
22:46:30 Warning: ONCONFIG dump directory (DUMPDIR) '/dump/informix/t10' has insecure permissions
22:46:30 Event alarms enabled. ALARMPROG = '/usr/informix/etc/no_log.sh'
22:46:30 Booting Language <c> from module <>
22:46:30 Loading Module <CNULL>
22:46:30 Booting Language <builtin> from module <>
22:46:30 Loading Module <BUILTINNULL>
22:46:36 Affinitied VP 1 to phys proc 1
22:46:36 DR: DRAUTO is 3 (CMSM)
22:46:36 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
22:46:36 Event notification facility epoll enabled.
22:46:36 Trusted host cache successfully built:/etc/hosts.equiv.
22:46:36 CCFLAGS2 value set to 0x200
22:46:36 SQL_FEAT_CTRL value set to 0x8008
22:46:36 SQL_DEF_CTRL value set to 0x4b0
22:46:36 IBM Informix Dynamic Server Version 12.10.FC10 Software Serial Number AAA#B000000
22:46:37 Performance Advisory: The current size of the physical log buffer is smaller than recommended.
22:46:37 Results: Transaction performance might not be optimal.
22:46:37 Action: For better performance, increase the physical log buffer size to 128.
22:46:38 IBM Informix Dynamic Server Initialized -- Shared Memory Initialized.
22:46:38 Started 1 B-tree scanners.
22:46:38 B-tree scanner threshold set at 5000.
22:46:38 B-tree scanner range scan size set to -1.
22:46:38 B-tree scanner ALICE mode set to 6.
22:46:38 B-tree scanner index compression level set to med.
22:46:38 Physical Recovery Started at Page (3:164234).
22:46:38 Physical Recovery Complete: 0 Pages Examined, 0 Pages Restored.
22:46:38 DR: Trying to connect to primary server = awwdst10a_t10
22:48:39 DR: Cannot connect to primary server
22:48:39 DR: Turned off on secondary server
22:48:39 Dataskip is now OFF for all dbspaces
22:48:39 Restartable Restore has been ENABLED
22:48:39 Recovery Mode
23:23:50 Booting Language <spl> from module <>
23:23:50 Loading Module <SPLNULL>
23:24:12 CM:Connection manager whics-t10 registered with the server
Connection Manager
- отказ сообщения
Mon Oct 22 23:23:44 2018
23:23:44 IBM Informix Connection Manager
23:23:44 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC10
23:23:44 Build Number: N046
23:23:44 Build Host: hans
23:23:44 Build OS: Linux 2.6.18-128.el5
23:23:44 Build Date: Mon Oct 23 07:06:25 CDT 2017
23:23:44 GLS Version: glslib-6.00.FC11
23:23:44
23:23:44 set CM_TIMEOUT 60
23:23:44 set global EVENT_TIMEOUT 60
23:23:44 set global SECONDARY_EVENT_TIMEOUT 60
23:23:44 DEBUG[TID139636001822496]:add type primary to SLA sla_cluster_1 [cmsm_sla.c:parse_sla:2412]
23:23:44 SLA sla_cluster_1 listener mode REDIRECT
23:23:44 Connection Manager name is whics-t10
23:23:44 Starting Connection Manager...
23:23:44 DEBUG[TID139636001822496]:Password File /opt/informix-12.10.fc10/etc/passwd_file failed error:No such file or directory[2] [cmsm_main.c:cmsm_pw_init:1782]
23:23:44 Warning: Password Manager failed; working in trusted node mode
23:23:44 the current maximum number of file descriptors is 32767
23:23:44 DEBUG[TID139636001822496]:new daemon pid is 3804 [cmsm_main.c:cmsm_daemonize:4653]
23:23:44 DEBUG[TID139636001822496]:dbservername = cluster_1 [cmsm_server.ec:cmsm_primary:2819]
23:23:44 DEBUG[TID139636001822496]:nettype = [cmsm_server.ec:cmsm_primary:2820]
23:23:44 DEBUG[TID139636001822496]:hostname = - [cmsm_server.ec:cmsm_primary:2821]
23:23:44 DEBUG[TID139636001822496]:servicename = - [cmsm_server.ec:cmsm_primary:2822]
23:23:44 DEBUG[TID139636001822496]:options = i=10 [cmsm_server.ec:cmsm_primary:2823]
23:23:44 DEBUG[TID139636001822496]:connect to cluster_1 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2839]
23:23:44 DEBUG[TID139636001822496]:connect to group cluster_1 server awwdst10a_t10 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2860]
23:23:44 DEBUG[TID139636001822496]:create new thread -1995749632 for awwdst10a_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:23:44 DEBUG[TID139636001822496]:connect to group cluster_1 server awwdst10b_t10 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2860]
23:23:44 DEBUG[TID139636001822496]:create new thread -1997850880 for awwdst10b_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:23:44 listener sla_cluster_1 initializing
23:23:44 DEBUG[TID139635976775424]:listenter sla_cluster_1 host=localhost service=informix_t10 nettype=soctcp engtypeon [cmsm_main.c:cmsm_listener_thread:1166]
23:23:49 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t101 SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:23:49 ALARM 4001 unable to connect to Informix server awwdst10a_t10 from awguht10 error -908
23:23:49 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not primary [cmsm_arb.c:arb_server_down:1399]
23:23:50 DEBUG[TID139635978876672]:CONNECT to @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:23:50 DEBUG[TID139635978876672]:database sysmaster @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
23:23:50 Listener sla_cluster_1 DBSERVERS=primary is active with 4 worker threads
23:23:50 DEBUG[TID139635978876672]:awwdst10b_t10 protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2427]
23:23:50 DEBUG[TID139635978876672]:skip awwdst10b_t10 alias awwdst10b_t10_drda with protocol 1 [cmsm_server.ec:cmsm_add_dbaliases:2694]
23:23:50 DEBUG[TID139635978876672]:SQL fetch sysconfig_cursor sqlcode = (100,0,) [cmsm_server.ec:cmsm_add_dbaliases:2540]
23:23:52 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t102 SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:23:52 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not primary [cmsm_arb.c:arb_server_down:1399]
23:24:12 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator [cmsm_er.ec:cmsm_er_monitor:2766]
23:24:12 DEBUG[TID139636001822496]:fcntl(/opt/informix-12.10.fc10/tmp/cmsm.pid.whics-t10) success error:Success[0] [cmsm_main.c:cmsm_pidfile:2193]
23:24:12 DEBUG[TID139635968210688]:Warning: no server returned for SLA sla_cluster_1 [cmsm_sla.c:select_server_from_sla:3037]
**23:24:12 Connection Manager successfully connected to awwdst10b_t10**
**23:24:12 DEBUG[TID139635966109440]:Warning: no server returned for SLA sla_cluster_1 [cmsm_sla.c:select_server_from_sla:3037]**
**23:24:12 DEBUG[TID139635966109440]:SLA sla_cluster_1 cannot find database server for SQLI client [cmsm_main.c:cmsm_process_sqli:1522]**
23:24:12 Connection Manager started successfully
23:24:12 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]
23:24:12 DEBUG[TID139635968210688]:SLA sla_cluster_1 cannot find database server for SQLI client [cmsm_main.c:cmsm_process_sqli:1522]
23:24:12 The server type of cluster cluster_1 server awwdst10a_t10 is Primary.
23:24:12 DEBUG[TID139635978876672]:create new thread -1995749632 for awwdst10a_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:24:12 DEBUG[TID139635978876672]:cluster cluster_1 primary is not active [cmsm_arb.c:arb_set_primary:774]
23:24:12 DEBUG[TID139635978876672]:server awwdst10b_t10 version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2971]
23:24:12 DEBUG[TID139635978876672]:ACTIVE MONITOR awwdst10b_t10 awwdst10b_t10 172.29.26.122 53089 [cmsm_er.ec:cmsm_er_monitor:3013]
23:24:12 DEBUG[TID139635978876672]:register CM successfully whics-t10 with token 1838996547 [cmsm_server.ec:cmsm_cm_register:2330]
23:24:12 DEBUG[TID139635978876672]:Register server awwdst10b_t10 for awwdst10b_t10 count = 1 [cmsm_er.ec:cmsm_er_event_process:1280]
23:24:12 DEBUG[TID139635978876672]:get awwdst10b_t10 CLUST_CHG event 1:4[DISCONNECT] awwdst10a_t10|Primary| [cmsm_er.ec:cmsm_er_event_process:1733]
23:24:17 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t103 SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:24:17 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not a DBSERVERALIAS of primary [cmsm_arb.c:arb_server_down:1417]
23:24:17 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator [cmsm_er.ec:cmsm_er_monitor:2766]
23:24:17 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]
23:24:46 The server type of cluster cluster_1 server awwdst10a_t10 is Primary.
23:24:46 DEBUG[TID139635978876672]:create new thread -1995749632 for awwdst10a_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:24:52 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t104 SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:24:52 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not a DBSERVERALIAS of primary [cmsm_arb.c:arb_server_down:1417]
23:24:52 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator [cmsm_er.ec:cmsm_er_monitor:2766]
23:24:52 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]
23:25:02 DEBUG[TID139635896350464]:Warning: no server returned for SLA sla_cluster_1 [cmsm_sla.c:select_server_from_sla:3037]
23:25:02 DEBUG[TID139635896350464]:SLA sla_cluster_1 cannot find database server for SQLI client [cmsm_main.c:cmsm_process_sqli:1522]
23:25:20 The server type of cluster cluster_1 server awwdst10a_t10 is Primary.
23:25:23 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not a DBSERVERALIAS of primary [cmsm_arb.c:arb_server_down:1417]
23:25:23 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator [cmsm_er.ec:cmsm_er_monitor:2766]
23:25:23 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]
Connection Manager
- до отказа
17:09:02 IBM Informix Connection Manager
17:09:02 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC10
.....
17:09:02 set CM_TIMEOUT 60
17:09:02 set global EVENT_TIMEOUT 60
17:09:02 set global SECONDARY_EVENT_TIMEOUT 60
17:09:02 DEBUG[TID139992173627168]:add type primary to SLA sla_cluster_1 [cmsm_sla.c:parse_sla:2412]
17:09:02 SLA sla_cluster_1 listener mode REDIRECT
17:09:02 Connection Manager name is whics-t10
17:09:02 Starting Connection Manager...
17:09:02 DEBUG[TID139992173627168]:dbservername = cluster_1 [cmsm_server.ec:cmsm_primary:2819]
17:09:02 DEBUG[TID139992173627168]:nettype = [cmsm_server.ec:cmsm_primary:2820]
17:09:02 DEBUG[TID139992173627168]:hostname = - [cmsm_server.ec:cmsm_primary:2821]
17:09:02 DEBUG[TID139992173627168]:servicename = - [cmsm_server.ec:cmsm_primary:2822]
17:09:02 DEBUG[TID139992173627168]:options = i=10 [cmsm_server.ec:cmsm_primary:2823]
17:09:02 DEBUG[TID139992173627168]:connect to cluster_1 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2839]
17:09:02 DEBUG[TID139992173627168]:connect to group cluster_1 server awwdst10b_t10 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2860]
17:09:02 DEBUG[TID139992173627168]:create new thread 1986635520 for awwdst10b_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
17:09:02 listener sla_cluster_1 initializing
17:09:02 DEBUG[TID139992152782592]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t101 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
17:09:02 DEBUG[TID139992152782592]:database sysmaster @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t101 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
17:09:02 DEBUG[TID139992152782592]:awwdst10a_t10 protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2427]
17:09:02 DEBUG[TID139992152782592]:skip awwdst10a_t10 alias awwdst10a_t10_drda with protocol 1 [cmsm_server.ec:cmsm_add_dbaliases:2694]
.......
setting primary name = awwdst10a_t10 [cmsm_arb.c:arb_set_primary:795]
17:09:02 Cluster cluster_1 Arbitrator FOC ORDER=ENABLED PRIORITY=1
17:09:02 Connection Manager successfully connected to awwdst10a_t10
17:09:02 The server type of cluster cluster_1 server awwdst10b_t10 is HDR.
17:09:02 DEBUG[TID139992150681344]:CONNECT to @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
17:09:02 DEBUG[TID139992152782592]:server awwdst10a_t10 version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2971]
17:09:02 DEBUG[TID139992152782592]:register CM successfully whics-t10 with token 216854841 [cmsm_server.ec:cmsm_cm_register:2330]
17:09:02 DEBUG[TID139992152782592]:Register server awwdst10a_t10 for awwdst10a_t10 count = 1 [cmsm_er.ec:cmsm_er_event_process:1280]
17:09:03 DEBUG[TID139992150681344]:database sysmaster @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
17:09:03 Listener sla_cluster_1 DBSERVERS=primary is active with 4 worker threads
17:09:03 DEBUG[TID139992150681344]:awwdst10b_t10 protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2427]
17:09:03 DEBUG[TID139992150681344]:skip awwdst10b_t10 alias awwdst10b_t10_drda with protocol 1 [cmsm_server.ec:cmsm_add_dbaliases:2694]
17:09:14 Connection Manager successfully connected to awwdst10b_t10
17:09:14 DEBUG[TID139992173627168]:fcntl(/opt/informix-12.10.fc10/tmp/cmsm.pid.whics-t10) success error:Success[0] [cmsm_main.c:cmsm_pidfile:2193]
17:09:14 Connection Manager started successfully
17:09:14 SLA sla_cluster_1 redirect SQLI client from ::1 to awwdst10a_t10 172.29.26.129.51001
17:09:14 DEBUG[TID139992150681344]:server awwdst10b_t10 version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2971]
..........
17:11:01 SLA sla_cluster_1 redirect SQLI client from ::1 to awwdst10a_t10 172.29.26.129.51001
awwdst10a
- до отказа
16:47:21 IBM Informix Dynamic Server Started.
16:47:21 Requested shared memory segment size rounded from 275020KB to 280576K
Mon Oct 22 16:47:23 2018
16:47:23 Requested shared memory segment size rounded from 405443KB to 405504KB
16:47:23 Shared memory segment will use huge pages.
16:47:23 Segment locked: addr=0x8a600000, size=415236096
16:47:23 Successfully added a bufferpool of page size 16K.
......
16:47:28 Affinitied VP 1 to phys proc 1
16:47:28 DR: DRAUTO is 3 (CMSM)
16:47:28 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
16:47:28 Event notification facility epoll enabled.
16:47:28 Trusted host cache successfully built:/etc/hosts.equiv.
16:47:28 IBM Informix Dynamic Server Version 12.10.FC10 Software
16:47:29 Performance Advisory: The current size of the physical log buffer is smaller than recommended.
16:47:29 Results: Transaction performance might not be optimal.
16:47:29 Action: For better performance, increase the physical log buffer size to 128.
16:47:30 IBM Informix Dynamic Server Initialized -- Shared Memory Initialized.
16:47:30 Started 1 B-tree scanners.
16:47:30 B-tree scanner threshold set at 5000.
16:47:30 B-tree scanner range scan size set to -1.
16:47:30 B-tree scanner ALICE mode set to 6.
16:47:30 B-tree scanner index compression level set to med.
16:47:30 DR: Reservation of the last logical log for log backup turned on
16:47:30 DR: Trying to connect to secondary server = awwdst10b_t10
16:47:30 DR: Cannot connect to secondary server
16:47:30 DR: Turned off on primary server
16:47:30 Physical Recovery Started at Page (3:152094).
16:47:30 Physical Recovery Complete: 0 Pages Examined, 0 Pages Restored.
16:47:30 Logical Recovery Started.
16:47:30 10 recovery worker threads will be started.
16:47:30 Logical Recovery has reached the transaction cleanup phase.
16:47:30 Logical Recovery Complete.
0 Committed, 0 Rolled Back, 0 Open, 0 Bad Locks
16:47:31 Dataskip is now OFF for all dbspaces
16:47:32 Checkpoint Completed: duration was 0 seconds.
16:47:32 Mon Oct 22 - loguniq 54478, logpos 0xe30c0, timestamp: 0xd41004bc Interval: 399667
16:47:39 Logical Log 54478 Complete, timestamp: 0xd41102e7.
16:47:41 Action: Increase the size of the individual logical log files so
that it takes at least 30 seconds to fill each one. Look at the
online log to determine how quickly the log files are filling, and
then increase the size of the files proportionately.
16:47:41 Logical Log 54479 Complete, timestamp: 0xd411f36b.
16:47:44 DR: Primary server connected
16:47:44 DR: Secondary server needs failure recovery
16:47:46 DR_ERR set to -1
16:47:46 DR: Failure recovery error (14)
16:47:46 Logical Log 54480 Complete, timestamp: 0xd4131922.
16:47:47 DR: Turned off on primary server
16:47:47 DR: Cannot connect to secondary server
16:47:54 Logical Log 54483 Complete, timestamp: 0xd415bc16.
16:47:58 DR: Primary server connected
16:47:58 DR: Secondary server needs failure recovery
16:48:00 DR: Sending log 54478, size 2500 pages, 100.00 percent used
16:48:00 Logical Log 54484 Complete, timestamp: 0xd4172711.
.
16:48:05 Logical Log 54485 Complete, timestamp: 0xd4180cc0.
16:48:07 DR: Sending log 54482, size 2500 pages, 100.00 percent used
16:48:20 Checkpoint Completed: duration was 1 seconds.
17:09:02 CM:Connection manager whics-t10 registered with the server
17:09:18 Logical Log 54478 - Backup Started
17:09:22 Logical Log 54488 - Backup Completed
17:09:22 Logical Log 54489 - Backup Started
17:33:32 Checkpoint Completed: duration was 0 seconds.
17:33:32 Mon Oct 22 - loguniq 54494, logpos 0x3aa018, timestamp: 0xd41e0c9d Interval: 399671
17:33:32 Maximum server connections 2
17:33:32 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 24, Llog used 9
awwdst10b
- до отказа
16:47:46 IBM Informix Dynamic Server Started.
Mon Oct 22 16:47:47 2018
16:47:48 Requested shared memory segment size rounded from 405443KB to 405504KB
16:47:48 Shared memory segment will use huge pages.
16:47:48 Segment locked: addr=0x8a600000, size=415236096
16:47:48 Successfully added a bufferpool of page size 16K.
16:47:48 Warning: ONCONFIG dump directory (DUMPDIR) '/dump/informix/t10' has insecure permissions
16:47:48 Event alarms enabled. ALARMPROG = '/usr/informix/etc/no_log.sh'
16:47:48 Booting Language <c> from module <>
16:47:48 Loading Module <CNULL>
16:47:48 Booting Language <builtin> from module <>
16:47:48 Loading Module <BUILTINNULL>
16:47:53 Affinitied VP 1 to phys proc 1
16:47:53 DR: DRAUTO is 3 (CMSM)
16:47:53 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
16:47:53 Event notification facility epoll enabled.
16:47:53 Trusted host cache successfully built:/etc/hosts.equiv.
16:47:53 CCFLAGS2 value set to 0x200
16:47:53 SQL_FEAT_CTRL value set to 0x8008
16:47:53 SQL_DEF_CTRL value set to 0x4b0
16:47:53 IBM Informix Dynamic Server Version 12.10.FC10 Software Serial Number AAA#B000000
16:47:54 Performance Advisory: The current size of the physical log buffer is smaller than recommended.
16:47:54 Results: Transaction performance might not be optimal.
16:47:54 Action: For better performance, increase the physical log buffer size to 128.
16:47:54 IBM Informix Dynamic Server Initialized -- Shared Memory Initialized.
16:47:54 Started 1 B-tree scanners.
16:47:54 B-tree scanner threshold set at 5000.
16:47:54 B-tree scanner range scan size set to -1.
16:47:54 B-tree scanner ALICE mode set to 6.
16:47:54 B-tree scanner index compression level set to med.
16:47:54 Physical Recovery Started at Page (3:151957).
16:47:54 Physical Recovery Complete: 0 Pages Examined, 0 Pages Restored.
16:47:54 DR: Trying to connect to primary server = awwdst10a_t10
16:47:54 smx creates 1 transports to server awwdst10a_t10
16:47:55 Dataskip is now OFF for all dbspaces
16:47:55 Restartable Restore has been ENABLED
16:47:55 Recovery Mode
16:47:57 DR: Secondary server connected
16:47:58 DR: Secondary server needs failure recovery
16:47:59 DR: Failure recovery from disk in progress ...
16:47:59 Logical Recovery Started.
16:47:59 10 recovery worker threads will be started.
16:47:59 Start Logical Recovery - Start Log 54478, End Log ?
16:47:59 Starting Log Position - 54478 0xdd018
16:48:00 Started processing open transactions on secondary during startup
Server in fast recovery until these open transactions finish
16:48:00 Finished processing open transactions on secondary during startup.
16:48:00 Checkpoint Completed: duration was 0 seconds.
16:48:00 Mon Oct 22 - loguniq 54478, logpos 0xdf0c0, timestamp: 0xd410117e Interval: 399666
16:48:00 Checkpoint Completed: duration was 0 seconds.
16:48:00 Mon Oct 22 - loguniq 54478, logpos 0xe30c0, timestamp: 0xd41011cf Interval: 399668
16:48:00 Maximum server connections 0
16:48:00 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 2, Llog used 0
16:48:03 Logical Log 54478 Complete, timestamp: 0xd410f549.
16:48:17 Logical Log 54486 Complete, timestamp: 0xd418e8e4.
16:48:19 B-tree scanners disabled.
16:48:20 DR: HDR secondary server operational
16:48:20 Checkpoint Completed: duration was 0 seconds.
16:48:20 Mon Oct 22 - loguniq 54487, logpos 0xd018, timestamp: 0xd418e973 Interval: 399669
16:48:20 Maximum server connections 0
16:48:20 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 7761, Llog used 0
16:48:51 Logical Log 54492 Complete, timestamp: 0xd41d1f30.
16:56:04 Logical Log 54493 Complete, timestamp: 0xd41dafde.
17:03:31 Checkpoint Completed: duration was 0 seconds.
17:03:31 Mon Oct 22 - loguniq 54494, logpos 0x35f598, timestamp: 0xd41dff42 Interval: 399670
17:09:03 Booting Language <spl> from module <>
17:09:03 Loading Module <SPLNULL>
17:09:14 CM:Connection manager whics-t10 registered with the server
17:18:32 Checkpoint Completed: duration was 0 seconds.
17:18:32 Mon Oct 22 - loguniq 54494, logpos 0x3a1018, timestamp:
17:18:32 Maximum server connections 1
17:18:32 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 27, Llog used 0
17:33:32 Checkpoint Completed: duration was 0 seconds.
17:33:32 Mon Oct 22 - loguniq 54494, logpos 0x3aa018, timestamp: 0xd41e0c7e Interval: 399672
Для аварийного переключения необходим хотя бы один диспетчер подключений, так как роль диспетчера подключений (помимо других) заключается в обнаружении сбоя узла и автоматическом переключении для обеспечения высокой доступности базы данных. Если у вас есть только один диспетчер подключений перед кластером базы данных, после его выключения автоматического переключения при отказе не произойдет, поэтому вам придется выполнить
onmode -d make primary force
на вторичном сервере, чтобы повысить его до первичного. Если у вас есть несколько правильно настроенных диспетчеров соединений, аварийное переключение гарантировано.
Первичный сервер назначает роль «активного арбитра» диспетчеру соединений (CM), поскольку менеджеров соединений может быть несколько. Если первичный сервер и CM «активный арбитр» будут убиты одновременно, то после перезапуска CM не сможет повысить HDR до первичного, поскольку CM не назначена роль активного арбитра. Informix не поддерживает автоматическое восстановление после многоточечного сбоя. CM необходимо запускать на оборудовании, отличном от оборудования основного сервера.
Сделал лабораторную настройку, и я опубликую некоторые настройки и журналы, чтобы вы могли сравнить.
У меня есть сервер Informix ifx_a
в качестве основного на machine1
и сервер Informix ifx_b
как вторичный HDR на machine2
.
На серверах Informix ifx_a
У меня следующие параметры репликации. HA_ALIAS
равно DBSERVERNAME
, чтобы было проще). Сервер Informix ifx_b
параметры такие же, кроме HA_ALIAS
и DBSERVERNAME
.
DRAUTO 3
DRINTERVAL 0
HDR_TXN_SCOPE ASYNC
DRTIMEOUT 30
HA_ALIAS ifx_a
HA_FOC_ORDER SDS,HDR,RSS
DRLOSTFOUND $INFORMIXDIR/etc/dr.lostfound
DRIDXAUTO 0
LOG_INDEX_BUILDS 1
SDS_ENABLE
SDS_TIMEOUT 20
SDS_TEMPDBS
SDS_PAGING
SDS_LOGCHECK 10
SDS_ALTERNATE NONE
SDS_FLOW_CONTROL 0
UPDATABLE_SECONDARY 0
FAILOVER_CALLBACK
FAILOVER_TX_TIMEOUT 0
TEMPTAB_NOLOG 1
DELAY_APPLY 0
STOP_APPLY 0
LOG_STAGING_DIR $INFORMIXDIR/tmp
RSS_FLOW_CONTROL 0
SMX_NUMPIPES 1
ENABLE_SNAPSHOT_COPY 0
SMX_COMPRESS 0
SMX_PING_INTERVAL 10
SMX_PING_RETRY 6
CLUSTER_TXN_SCOPE SERVER
В sqlhosts
файл для обоих серверов Informix:
ifx_a onsoctcp machine1.local 15010 k=1
ifx_b onsoctcp machine2.local 15020 k=1
Репликация настроена и работает, на вторичном ifx_b
Я бегу:
$ onstat -g cluster
IBM Informix Dynamic Server Version 12.10.FC12 -- Read-Only (Sec) -- Up 00:05:08 -- 156276 Kbytes
Primary Server:ifx_a
Index page logging status: Enabled
Index page logging was enabled at: 2018/07/31 23:44:14
Server ACKed Log Supports Status
(log, page) Updates
ifx_b 16,6 No ASYNC(HDR),Connected,On
$ onstat -g dri
IBM Informix Dynamic Server Version 12.10.FC12 -- Read-Only (Sec) -- Up 00:12:23 -- 156276 Kbytes
Data Replication at 0x45ac6028:
Type State Paired server Last DR CKPT (id/pg) Supports Proxy Writes
HDR Secondary on ifx_a 16 / 5 N
DRINTERVAL 0
DRTIMEOUT 30
DRAUTO 3
DRLOSTFOUND /opt/IBM/informix/12.10/etc/dr.lostfound
DRIDXAUTO 0
ENCRYPT_HDR 0
Backlog 0
Last Send 2018/08/01 00:12:12
Last Receive 2018/08/01 00:12:12
Last Ping 2018/08/01 00:12:03
Last log page applied(log id,page): 0,0
На первичном ifx_a
Я бегу:
$ onstat -g cluster
IBM Informix Dynamic Server Version 12.10.FC12 -- On-Line (Prim) -- Up 00:21:30 -- 156276 Kbytes
Primary Server:ifx_a
Current Log Page:16,6
Index page logging status: Enabled
Index page logging was enabled at: 2018/07/31 23:44:14
Server ACKed Log Applied Log Supports Status
(log, page) (log, page) Updates
ifx_b 16,6 16,6 No ASYNC(HDR),Connected,On
$ onstat -g dri
IBM Informix Dynamic Server Version 12.10.FC12 -- On-Line (Prim) -- Up 00:27:49 -- 156276 Kbytes
Data Replication at 0x45ac6028:
Type State Paired server Last DR CKPT (id/pg) Supports Proxy Writes
primary on ifx_b 16 / 5 NA
DRINTERVAL 0
DRTIMEOUT 30
DRAUTO 3
DRLOSTFOUND /opt/IBM/informix/12.10/etc/dr.lostfound
DRIDXAUTO 0
ENCRYPT_HDR 0
Backlog 0
Last Send 2018/08/01 00:11:53
Last Receive 2018/08/01 00:11:53
Last Ping 2018/08/01 00:11:32
Last log page applied(log id,page): 16,6
Итак, репликация работает, и серверы синхронизированы.
На третьей машине machine3
Я настраиваю диспетчер соединений со следующими параметрами:
NAME icm_1
LOG 1
LOGFILE ${INFORMIXDIR}/tmp/icm_1.log
DEBUG 1
CM_TIMEOUT 60
EVENT_TIMEOUT 60
SECONDARY_EVENT_TIMEOUT 60
SQLHOSTS LOCAL
CLUSTER ifx_abc
{
INFORMIXSERVER servers_abc
SLA sla_ifx_abc DBSERVERS=PRI
FOC ORDER=ENABLED \
PRIORITY=1
}
И sqlhosts
файл для диспетчера соединений:
servers_abc group - - i=10,c=1,e=ifx_b
ifx_a onsoctcp machine1.local 15010 k=1,g=servers_abc
ifx_b onsoctcp machine2.local 15020 k=1,g=servers_abc
sla_ifx_abc onsoctcp machine3.local 15030 k=1
Теперь я насильно сбиваю machine1
Я получаю следующее в онлайн-журнале сервера Informix ifx_b
:
08/02/18 01:00:26 Maximum server connections 1
08/02/18 01:00:26 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 38, Llog used 0
08/02/18 01:06:39 The SMX connection between high availability servers was closed because the
peer server was unresponsive for the timeout period (60 seconds times the
number of retries).
08/02/18 01:06:39 The SMX connection between high availability servers was closed because the
peer server was unresponsive for the timeout period (60 seconds times the
number of retries).
08/02/18 01:07:28 DR: ping timeout
08/02/18 01:07:28 DR: Receive error
08/02/18 01:07:28 dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
08/02/18 01:07:28 DR_ERR set to -1
08/02/18 01:07:28 DR: Receive Btree error
08/02/18 01:07:29 DR: Turned off on secondary server
08/02/18 01:07:36 SCHAPI: Issued Task() or Admin() command "task( 'ha make primary force', 'ifx_b' )".
08/02/18 01:07:37 Skipping failover callback.
08/02/18 01:07:48 Logical Recovery has reached the transaction cleanup phase.
08/02/18 01:07:48 Checkpoint Completed: duration was 0 seconds.
08/02/18 01:07:48 Thu Aug 2 - loguniq 19, logpos 0xcb4018, timestamp: 0xf287b Interval: 148
08/02/18 01:07:48 Maximum server connections 2
08/02/18 01:07:48 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 72, Llog used 1
08/02/18 01:07:48 Logical Recovery Complete.
9001 Committed, 0 Rolled Back, 0 Open, 0 Bad Locks
08/02/18 01:07:48 Logical Recovery Complete.
08/02/18 01:07:48 Performance Advisory: Based on the current workload, the physical log might be too small to
accommodate the time it takes to flush the buffer pool.
08/02/18 01:07:48 Results: The server might block transactions during checkpoints.
08/02/18 01:07:48 Action: If transactions are blocked during the checkpoint, increase the size of the
physical log to at least 716800 KB.
08/02/18 01:07:48 Performance Advisory: Based on the current workload, the logical log space might be too small to
accommodate the time it takes to flush the buffer pool.
08/02/18 01:07:48 Results: The server might block transactions during checkpoints.
08/02/18 01:07:48 Action: If transactions are blocked during the checkpoint, increase the size of the
logical log space to at least 89600 KB.
08/02/18 01:07:48 Performance Advisory: The physical log is too small for automatic checkpoints.
08/02/18 01:07:48 Results: Automatic checkpoints are disabled.
08/02/18 01:07:48 Action: To enable automatic checkpoints, increase the physical log to at least 716800 KB.
08/02/18 01:07:48 Quiescent Mode
08/02/18 01:07:48 Checkpoint Completed: duration was 0 seconds.
08/02/18 01:07:48 Thu Aug 2 - loguniq 19, logpos 0xcb6018, timestamp: 0xf288d Interval: 149
08/02/18 01:07:48 Maximum server connections 2
08/02/18 01:07:48 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 4, Llog used 2
08/02/18 01:07:48 B-tree scanners enabled.
08/02/18 01:07:48 DR: Reservation of the last logical log for log backup turned on
08/02/18 01:07:48 DR: new type = primary, secondary server name = ifx_a
08/02/18 01:07:48 DR: Trying to connect to secondary server = ifx_a
08/02/18 01:07:48 Starting BldNotification
08/02/18 01:07:49 On-Line Mode
08/02/18 01:07:50 SCHAPI: Started dbScheduler thread.
08/02/18 01:07:50 Auto Registration is synced
08/02/18 01:07:50 SCHAPI: Started 2 dbWorker threads.
08/02/18 01:07:51 DR: Cannot connect to secondary server
08/02/18 01:07:51 DR: Turned off on primary server
08/02/18 01:07:51 DDR Log staging: Using the directory /opt/IBM/informix/12.10/tmp/ifmxddrlog_2.
08/02/18 01:07:51 SCHAPI: dbutil threads is already running.
08/02/18 01:07:51 SCHAPI: dbScheduler threads is already running.
08/02/18 01:07:52 Defragmenter cleaner thread now running
08/02/18 01:07:52 Defragmenter cleaner thread cleaned:0 partitions
08/02/18 01:08:12 DR: Cannot connect to secondary server
08/02/18 01:08:12 DR: Turned off on primary server
08/02/18 01:09:14 DR: Cannot connect to secondary server
08/02/18 01:09:14 DR: Turned off on primary server
08/02/18 01:10:14 DR: Cannot connect to secondary server
08/02/18 01:10:14 DR: Turned off on primary server
08/02/18 01:11:14 DR: Cannot connect to secondary server
08/02/18 01:11:14 DR: Turned off on primary server
08/02/18 01:12:14 DR: Cannot connect to secondary server
08/02/18 01:12:14 DR: Turned off on primary server
08/02/18 01:12:53 Checkpoint Completed: duration was 0 seconds.
08/02/18 01:12:53 Thu Aug 2 - loguniq 19, logpos 0xcf0704, timestamp: 0xf3ab0 Interval: 150
08/02/18 01:12:53 Maximum server connections 2
08/02/18 01:12:53 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 39, Llog used 58
08/02/18 01:13:14 DR: Cannot connect to secondary server
08/02/18 01:13:14 DR: Turned off on primary server
И в журнале диспетчера соединений я получаю:
Thu Aug 2 00:57:42 2018
00:57:42 IBM Informix Connection Manager
00:57:42 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC12
00:57:42 Build Number: N037
00:57:42 Build Host: hans
00:57:42 Build OS: Linux 2.6.18-128.el5
00:57:42 Build Date: Tue Jun 26 08:55:45 CDT 2018
00:57:42 GLS Version: glslib-6.00.FC13
00:57:42
00:57:42 set CM_TIMEOUT 60
00:57:42 set global EVENT_TIMEOUT 60
00:57:42 set global SECONDARY_EVENT_TIMEOUT 60
00:57:42 DEBUG[TID139959615362880]:add type PRI to SLA sla_ifx_abc [cmsm_sla.c:parse_sla:2412]
00:57:42 SLA sla_ifx_abc listener mode REDIRECT
00:57:42 Connection Manager name is icm_1
00:57:42 Starting Connection Manager...
00:57:42 DEBUG[TID139959615362880]:Password File /opt/IBM/informix/4.10/etc/passwd_file failed error:No such file or directory[2] [cmsm_main.c:cmsm_pw_init:1782]
00:57:42 Warning: Password Manager failed; working in trusted node mode
00:57:42 set the maximum number of file descriptors 32767 failed
00:57:42 DEBUG[TID139959615362880]:setrlimit(RLIMIT_NOFILE, 32767) = -1, error:Operation not permitted[1] [cmsm_main.c:cmsm_run:4901]
00:57:42 the current maximum number of file descriptors is 1024
00:57:42 DEBUG[TID139959615362880]:new daemon pid is 11761 [cmsm_main.c:cmsm_daemonize:4653]
00:57:42 DEBUG[TID139959615362880]:dbservername = servers_abc [cmsm_server.ec:cmsm_primary:2825]
00:57:42 DEBUG[TID139959615362880]:nettype = [cmsm_server.ec:cmsm_primary:2826]
00:57:42 DEBUG[TID139959615362880]:hostname = - [cmsm_server.ec:cmsm_primary:2827]
00:57:42 DEBUG[TID139959615362880]:servicename = - [cmsm_server.ec:cmsm_primary:2828]
00:57:42 DEBUG[TID139959615362880]:options = i=10,c=1,e=ifx_b [cmsm_server.ec:cmsm_primary:2829]
00:57:42 DEBUG[TID139959615362880]:connect to servers_abc for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2845]
00:57:42 DEBUG[TID139959615362880]:connect to group servers_abc server ifx_a for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2866]
00:57:42 DEBUG[TID139959615362880]:create new thread -509528320 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
00:57:42 DEBUG[TID139959615362880]:connect to group servers_abc server ifx_b for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2866]
00:57:42 DEBUG[TID139959615362880]:create new thread -511629568 for ifx_b [cmsm_sla.c:cmsm_update_server_ex:996]
00:57:42 listener sla_ifx_abc initializing
00:57:42 DEBUG[TID139959585543936]:listenter sla_ifx_abc host=machine3.local service=15030 nettype=soctcp engtypeon [cmsm_main.c:cmsm_listener_thread:1166]
00:57:42 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a1 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
00:57:42 DEBUG[TID139959589746432]:database sysmaster @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a1 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
00:57:42 DEBUG[TID139959589746432]:ifx_a protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2433]
00:57:42 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator setting primary name = ifx_a [cmsm_arb.c:arb_set_primary:793]
00:57:42 Cluster ifx_abc Arbitrator FOC ORDER=ENABLED PRIORITY=1
00:57:42 Connection Manager successfully connected to ifx_a
00:57:42 DEBUG[TID139959587645184]:CONNECT to @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b1 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
00:57:42 DEBUG[TID139959587645184]:database sysmaster @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b1 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
00:57:42 Listener sla_ifx_abc DBSERVERS=PRI is active with 4 worker threads
00:57:42 DEBUG[TID139959587645184]:ifx_b protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2433]
00:57:42 Connection Manager successfully connected to ifx_b
00:57:42 DEBUG[TID139959589746432]:server ifx_a version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2994]
00:57:42 DEBUG[TID139959589746432]:ACTIVE MONITOR ifx_a ifx_a 192.168.56.101 49464 [cmsm_er.ec:cmsm_er_monitor:3036]
00:57:42 DEBUG[TID139959589746432]:register CM successfully icm_1 with token 1782701540 [cmsm_server.ec:cmsm_cm_register:2336]
00:57:42 DEBUG[TID139959587645184]:server ifx_b version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2994]
00:57:42 DEBUG[TID139959587645184]:ACTIVE MONITOR ifx_b ifx_b 192.168.56.101 44376 [cmsm_er.ec:cmsm_er_monitor:3036]
00:57:42 DEBUG[TID139959589746432]:Register server ifx_a for ifx_a count = 1 [cmsm_er.ec:cmsm_er_event_process:1281]
00:57:42 DEBUG[TID139959587645184]:register CM successfully icm_1 with token 1120237392 [cmsm_server.ec:cmsm_cm_register:2336]
00:57:42 DEBUG[TID139959587645184]:Register server ifx_b for ifx_b count = 1 [cmsm_er.ec:cmsm_er_event_process:1281]
00:57:43 DEBUG[TID139959615362880]:fcntl(/opt/IBM/informix/4.10/tmp/cmsm.pid.icm_1) success error:Success[0] [cmsm_main.c:cmsm_pidfile:2193]
00:57:43 Connection Manager started successfully
00:57:43 DEBUG[TID139959589746432]:SQL get ifx_a event SRV_ADM 16:5 SDS,HDR,RSS [cmsm_er.ec:cmsm_er_event_process:1883]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 CM icm_1 arbitrator for ifx_abc is active
00:57:43 Cluster ifx_abc Arbitrator FOC ORDER=SDS,HDR,RSS PRIORITY=1 TIMEOUT=1
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
01:01:25 DEBUG[TID139959583442688]:SLA sla_ifx_abc 3 ifx_a time=1 latency= 0.00 readyQ=19 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:01:25 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:01:39 DEBUG[TID139959581341440]:SLA sla_ifx_abc 3 ifx_a time=0 latency= 0.00 readyQ=18 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:01:39 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:01:39 DEBUG[TID139959579240192]:SLA sla_ifx_abc 3 ifx_a time=0 latency= 0.00 readyQ=18 session=0 adjustSession=1 [cmsm_sla.c:select_server_from_sla:2947]
01:01:39 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:02:09 DEBUG[TID139959577138944]:SLA sla_ifx_abc 3 ifx_a time=6 latency= 0.00 readyQ=16 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:02:09 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:06:16 Server ifx_a no response over EVENT_TIMEOUT 60 seconds
01:06:16 Connection Manager disconnected from ifx_a
01:06:16 DEBUG[TID139959589746432]:start failover in sqlbreakcallback for ifx_a [cmsm_server.ec:cmsm_event_callback_common:1274]
01:06:16 DEBUG[TID139959575037696]:starting the failover process [cmsm_server.ec:cmsm_failover_incallback:1184]
01:06:17 Cluster ifx_abc Arbitrator detected primary server is down.
01:06:17 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:06:17 Cluster ifx_abc Arbitrator will not perform failover.
01:06:17 DEBUG[TID139959575037696]:failover process returns -1 [cmsm_server.ec:cmsm_failover_incallback:1207]
01:06:39 DEBUG[TID139959587645184]:create new thread -524237056 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
01:06:39 DEBUG[TID139959587645184]:create new thread -526338304 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
01:06:39 DEBUG[TID139959575037696]:Exit server ifx_a monitor machine3.local thread status=8 [cmsm_er.ec:cmsm_er_monitor:2608]
01:07:19 DEBUG[TID139959589746432]:fetch sysrepstats_cursor SQLCODE = (-213,0,) [cmsm_er.ec:cmsm_er_event_process:1711]
01:07:19 DEBUG[TID139959589746432]:unregister event failed, sqlcode = (-1811,0,) [cmsm_server.ec:cmsm_cm_unregister:2383]
01:07:19 DEBUG[TID139959589746432]:Unregister server ifx_a for ifx_a count = 0 [cmsm_er.ec:cmsm_er_event_process:2092]
01:07:19 Connection Manager disconnected from ifx_a
01:07:19 ALARM 3002 detected lost connection to Informix server ifx_a from machine3.local
01:07:19 DEBUG[TID139959589746432]:Connection Manager disconnected from ifx_a sqlcode=-1803 [cmsm_er.ec:cmsm_er_monitor:3090]
01:07:19 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:19 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:19 Cluster ifx_abc Arbitrator will not perform failover.
01:07:19 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:19 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:19 Cluster ifx_abc Arbitrator will not perform failover.
01:07:20 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:20 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:20 Cluster ifx_abc Arbitrator will not perform failover.
01:07:21 DEBUG[TID139959572936448]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a2 SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:21 ALARM 3001 unable to connect to Informix server ifx_a from machine3.local error -908
01:07:21 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:21 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:21 Cluster ifx_abc Arbitrator will not perform failover.
01:07:22 DEBUG[TID139959572936448]:Exit server ifx_a monitor machine3.local thread status=8 [cmsm_er.ec:cmsm_er_monitor:2608]
01:07:24 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a3 SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:24 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:24 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:24 Cluster ifx_abc Arbitrator will not perform failover.
01:07:27 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a4 SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:27 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:27 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:27 Cluster ifx_abc Arbitrator will not perform failover.
01:07:30 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:30 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a5 SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator start failover logic now... [cmsm_arb.c:arb_server_down:1564]
01:07:30 ALARM 2001 failover arbitrator automated failover in progress.
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator try failover on generic SDS [cmsm_arb.c:arb_failover_by_foc:1174]
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator try failover on generic HDR [cmsm_arb.c:arb_failover_by_foc:1200]
01:07:30 DEBUG[TID139959589746432]:CLuster ifx_abc Arbitrator failover to ifx_b, waiting 150 seconds for confirmation [cmsm_arb.c:arb_post_wait:1764]
01:07:30 DEBUG[TID139959589746432]:Arbitrator found failover node = ifx_b [cmsm_server_arb.ec:arb_failover_to:303]
01:07:37 DEBUG[TID139959589746432]:CONNECT to @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b2 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:37 DEBUG[TID139959589746432]:database sysmaster @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b2 SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
01:07:37 DEBUG[TID139959589746432]:Arbitrator connected to failover node = ifx_b [cmsm_server_arb.ec:arb_failover_to:316]
01:07:37 DEBUG[TID139959589746432]:Arbitrator get HA_ALIAS on ifx_b, HA_ALIAS [ifx_b] [cmsm_server_arb.ec:arb_failover_to:326]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:3 1 [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 Server ifx_b is in quiescent mode.
01:07:50 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:3 5 [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 The server type of cluster ifx_abc server ifx_b is Primary.
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator setting primary name = ifx_b [cmsm_arb.c:arb_set_primary:793]
01:07:50 Cluster ifx_abc Arbitrator FOC ORDER=SDS,HDR,RSS PRIORITY=1 TIMEOUT=1
01:07:50 Server ifx_b is in on-line mode.
01:07:50 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:5[NEWPRI] ifx_b [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:50 Arbitrator make primary on node = ifx_b successful
01:07:50 ALARM 2002 failover arbitrator automated failover completed
01:07:50 DEBUG[TID139959589746432]:Monitor ifx_a exit by arbitrator [cmsm_er.ec:cmsm_er_monitor:2789]
01:07:50 DEBUG[TID139959589746432]:all monitors exited, server ifx_a register count = 0 [cmsm_er.ec:cmsm_er_monitor:3156]
01:07:50 DEBUG[TID139959589746432]:enqueue a cleanup task for ifx_a [cmsm_er.ec:cmsm_enqueu_cleanup:2378]
01:07:50 DEBUG[TID139959587645184]:server monitor ifx_b dequeue a cleanup task [cmsm_er.ec:cmsm_er_event_process:2060]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:5 SDS,HDR,RSS [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
01:07:57 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:08:13 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:09:19 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:10:17 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:11:22 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:12:20 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:13:17 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
Итак, есть различия, так как кажется, что у вас была AF (ошибка утверждения) во время продвижения вторичного сервера.
Не могли бы вы опубликовать еще свои журналы? Справа перед отключением первичного звена до тех пор, пока вторичный не застревает при восстановлении.