Header Shadow Image


Setting tune.ssl.default-dh-param to 1024 by default, if your workload permits it you should set it to at least 2048. Please set a value >= 1024 to make this warning disappear.

Getting this?

haproxy-systemd-wrapper[17526]: [WARNING] 186/151058 (17527) : Setting tune.ssl.default-dh-param to 1024 by default, if your workload permits it you should set it to at least 2048. Please set a value >= 1024 to make this warning disappear.

add this to your haproxy config:

global
        log                     127.0.0.1       local3  warning
        pidfile                 /var/run/haproxy.pid
        maxconn                 4000
        user                    haproxy
        group                   haproxy
        daemon                  haproxy

        stats socket            /etc/haproxy/stats
        tune.ssl.default-dh-param 2048

Cheers,
TK

Unexpected error. Unable to verify database connection. / Cannot find CDH’\”s bigtop-detect-javahome.

If you get these errors:

Unexpected error. Unable to verify database connection. 

Checking the logs, we see the following:


2019-06-24 23:42:18,402 INFO scm-web-118667:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/clusters/2/healthStatusBar
2019-06-24 23:42:18,404 INFO scm-web-119745:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/clusters/2/statusContent
2019-06-24 23:42:18,412 INFO scm-web-118667:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/clusters/2/healthStatusBar, Status:200
2019-06-24 23:42:18,420 INFO scm-web-119745:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/clusters/2/statusContent, Status:200
2019-06-24 23:42:19,481 INFO scm-web-119111:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/testConnection
2019-06-24 23:42:19,489 INFO scm-web-119111:com.cloudera.cmf.service.ServiceHandlerRegistry: Executing host command HiveTestDatabaseConnection with sensitive arguments. Host: DbHost{id=28, hostId=a787204f-931d-47e0-8aef-534ab6355a7a, hostName=cm-r01en01.mws.mds.xyz}
2019-06-24 23:42:19,500 INFO scm-web-119111:com.cloudera.cmf.service.ServiceHandlerRegistry: Executing host command OozieTestDatabaseConnection with sensitive arguments. Host: DbHost{id=28, hostId=a787204f-931d-47e0-8aef-534ab6355a7a, hostName=cm-r01en01.mws.mds.xyz}
2019-06-24 23:42:19,506 INFO scm-web-119111:com.cloudera.cmf.service.ServiceHandlerRegistry: Executing host command HueTestDatabaseConnection with sensitive arguments. Host: DbHost{id=28, hostId=a787204f-931d-47e0-8aef-534ab6355a7a, hostName=cm-r01en01.mws.mds.xyz}
2019-06-24 23:42:19,809 INFO scm-web-119111:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/testConnection, Status:200
2019-06-24 23:42:19,826 INFO avro-servlet-hb-processor-1:com.cloudera.server.common.AgentAvroServlet: (45 skipped) AgentAvroServlet: heartbeat processing stats: average=12ms, min=8ms, max=79ms.
2019-06-24 23:42:21,847 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:21,854 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:23,890 INFO scm-web-120477:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:23,896 INFO scm-web-120477:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:24,373 WARN avro-servlet-hb-processor-0:com.cloudera.server.cmf.AgentProtocolImpl: Received Process Heartbeat for unknown (or duplicate) process. Ignoring. This is expected to happen once after old process eviction or process deletion (as happens in restarts). id=242 name=null host=a787204f-931d-47e0-8aef-534ab6355a7a/cm-r01en01.mws.mds.xyz
2019-06-24 23:42:24,374 WARN avro-servlet-hb-processor-0:com.cloudera.server.cmf.AgentProtocolImpl: Received Process Heartbeat for unknown (or duplicate) process. Ignoring. This is expected to happen once after old process eviction or process deletion (as happens in restarts). id=243 name=null host=a787204f-931d-47e0-8aef-534ab6355a7a/cm-r01en01.mws.mds.xyz
2019-06-24 23:42:24,374 WARN avro-servlet-hb-processor-0:com.cloudera.server.cmf.AgentProtocolImpl: Received Process Heartbeat for unknown (or duplicate) process. Ignoring. This is expected to happen once after old process eviction or process deletion (as happens in restarts). id=244 name=null host=a787204f-931d-47e0-8aef-534ab6355a7a/cm-r01en01.mws.mds.xyz
2019-06-24 23:42:25,935 INFO scm-web-119807:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:25,942 INFO scm-web-119807:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:28,049 INFO scm-web-119778:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:28,056 INFO scm-web-119778:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:29,383 INFO scm-web-120247:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/clusters/2/healthStatusBar
2019-06-24 23:42:29,394 INFO scm-web-120247:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/clusters/2/healthStatusBar, Status:200
2019-06-24 23:42:29,395 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/clusters/2/statusContent
2019-06-24 23:42:29,422 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/clusters/2/statusContent, Status:200
2019-06-24 23:42:30,084 INFO scm-web-118667:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:30,091 INFO scm-web-118667:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:32,124 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:32,131 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:33,205 INFO CommandPusher:com.cloudera.cmf.service.AbstractOneOffHostCommand: Unsuccessful 'HueTestDatabaseConnection'
2019-06-24 23:42:33,207 INFO CommandPusher:com.cloudera.cmf.service.AbstractDbConnectionTestCommand: Command exited with code: 1
2019-06-24 23:42:33,207 INFO CommandPusher:com.cloudera.cmf.service.AbstractDbConnectionTestCommand: [24/Jun/2019 23:42:28 +0000] 30846 MainThread redactor     INFO     Started launcher: /opt/cloudera/cm-agent/service/hue/hue.sh is_db_alive
[24/Jun/2019 23:42:28 +0000] 30846 MainThread redactor     ERROR    Redaction rules file doesn't exist, not redacting logs. file: redaction-rules.json, directory: /run/cloudera-scm-agent/process/247-HUE-test-db-connection
[24/Jun/2019 23:42:28 +0000] 30846 MainThread redactor     INFO     Re-exec watcher: /opt/cloudera/cm-agent/bin/cm proc_watcher 30867
+ date
+ date
Mon Jun 24 23:42:30 EDT 2019
++ dirname /opt/cloudera/cm-agent/service/hue/hue.sh
+ cloudera_config=/opt/cloudera/cm-agent/service/hue
++ cd /opt/cloudera/cm-agent/service/hue/../common
++ pwd
+ cloudera_config=/opt/cloudera/cm-agent/service/common
+ . /opt/cloudera/cm-agent/service/common/cloudera-config.sh
++ : /opt/cloudera/cm
++ export CLOUDERA_DIR
++ set -x
+ source_parcel_environment
+ '[‘ ‘!’ -z ” ‘]'
+ locate_cdh_java_home
+ '[‘ -z ” ‘]'
+ '[‘ -z /usr/libexec/bigtop-utils ‘]'
+ local BIGTOP_DETECT_JAVAHOME=
+ for candidate in '"${JSVC_HOME}"' '"${JSVC_HOME}/.."' '"/usr/lib/bigtop-utils"' '"/usr/libexec"'
+ '[‘ -e /usr/libexec/bigtop-utils/bigtop-detect-javahome ‘]'
+ for candidate in '"${JSVC_HOME}"' '"${JSVC_HOME}/.."' '"/usr/lib/bigtop-utils"' '"/usr/libexec"'
+ '[‘ -e /usr/libexec/bigtop-utils/../bigtop-detect-javahome ‘]'
+ for candidate in '"${JSVC_HOME}"' '"${JSVC_HOME}/.."' '"/usr/lib/bigtop-utils"' '"/usr/libexec"'
+ '[‘ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ‘]'
+ for candidate in '"${JSVC_HOME}"' '"${JSVC_HOME}/.."' '"/usr/lib/bigtop-utils"' '"/usr/libexec"'
+ '[‘ -e /usr/libexec/bigtop-detect-javahome ‘]'
+ '[‘ -z ” ‘]'
+ echo 'Cannot find CDH'\''s bigtop-detect-javahome.'
+ exit 1

2019-06-24 23:42:33,207 ERROR CommandPusher:com.cloudera.cmf.model.DbCommand: Command 742(HueTestDatabaseConnection) has completed. finalstate:FINISHED, success:false, msg:Unexpected error. Unable to verify database connection.
2019-06-24 23:42:34,168 INFO scm-web-119807:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:34,175 INFO scm-web-119807:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:36,228 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:36,234 INFO scm-web-120167:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:37,481 INFO CommandPusher:com.cloudera.cmf.service.AbstractOneOffHostCommand: Successful 'OozieTestDatabaseConnection'
2019-06-24 23:42:37,483 INFO CommandPusher:com.cloudera.cmf.service.AbstractDbConnectionTestCommand: Command exited with code: 0
2019-06-24 23:42:37,483 INFO CommandPusher:com.cloudera.cmf.service.AbstractDbConnectionTestCommand: + local MISCJAVA_HOME_CANDIDATES
+ case ${BIGTOP_JAVA_MAJOR} in
+ JAVA_HOME_CANDIDATES=(${JAVA8_HOME_CANDIDATES[@]} ${MISCJAVA_HOME_CANDIDATES[@]} ${OPENJAVA8_HOME_CANDIDATES[@]} ${JAVA11_HOME_CANDIDATES[@]} ${OPENJAVA11_HOME_CANDIDATES[@]})
+ '[‘ -z ” ‘]'
+ for candidate_regex in '${JAVA_HOME_CANDIDATES[@]}'
++ ls -rvd /usr/java/jdk1.8.0_181-cloudera
+ for candidate in '`ls -rvd ${candidate_regex}* 2>/dev/null`'
+ '[‘ -e /usr/java/jdk1.8.0_181-cloudera/bin/java ‘]'
+ export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
+ JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
+ break 2
+ verify_java_home
+ '[‘ -z /usr/java/jdk1.8.0_181-cloudera ‘]'
+ echo JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
+ source_parcel_environment
+ '[‘ ‘!’ -z ” ‘]'
+ JAVA=/usr/java/jdk1.8.0_181-cloudera/bin/java
+ [[ -z /opt/cloudera/cm ]]
+ MGMT_CLASSPATH='/opt/cloudera/cm/lib/*'
+ JDBC_JARS=/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar
+ MGMT_CLASSPATH='/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar:/opt/cloudera/cm/lib/*'
++ pwd
+ MGMT_CLASSPATH='/run/cloudera-scm-agent/process/246-OOZIE.OOZIE_SERVER-test-db-connection:/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar:/opt/cloudera/cm/lib/*'
+ MGMT_JAVA_OPTS='-Djava.net.preferIPv4Stack=true '
+ exec /usr/java/jdk1.8.0_181-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:///dev/urandom -cp '/run/cloudera-scm-agent/process/246-OOZIE.OOZIE_SERVER-test-db-connection:/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar:/opt/cloudera/cm/lib/*' com.cloudera.enterprise.dbutil.DbCommandExecutor db.properties

2019-06-24 23:42:37,483 INFO CommandPusher:com.cloudera.cmf.model.DbCommand: Command 741(OozieTestDatabaseConnection) has completed. finalstate:FINISHED, success:true, msg:Successful
2019-06-24 23:42:37,497 INFO CommandPusher:com.cloudera.cmf.service.AbstractOneOffHostCommand: Successful 'HiveTestDatabaseConnection'
2019-06-24 23:42:37,498 INFO CommandPusher:com.cloudera.cmf.service.AbstractDbConnectionTestCommand: Command exited with code: 0
2019-06-24 23:42:37,498 INFO CommandPusher:com.cloudera.cmf.service.AbstractDbConnectionTestCommand: + MISCJAVA_HOME_CANDIDATES=('/Library/Java/Home' '/usr/java/default' '/usr/lib/jvm/default-java' '/usr/lib/jvm/java-openjdk' '/usr/lib/jvm/jre-openjdk')
+ local MISCJAVA_HOME_CANDIDATES
+ case ${BIGTOP_JAVA_MAJOR} in
+ JAVA_HOME_CANDIDATES=(${JAVA8_HOME_CANDIDATES[@]} ${MISCJAVA_HOME_CANDIDATES[@]} ${OPENJAVA8_HOME_CANDIDATES[@]} ${JAVA11_HOME_CANDIDATES[@]} ${OPENJAVA11_HOME_CANDIDATES[@]})
+ '[‘ -z ” ‘]'
+ for candidate_regex in '${JAVA_HOME_CANDIDATES[@]}'
++ ls -rvd /usr/java/jdk1.8.0_181-cloudera
+ for candidate in '`ls -rvd ${candidate_regex}* 2>/dev/null`'
+ '[‘ -e /usr/java/jdk1.8.0_181-cloudera/bin/java ‘]'
+ export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
+ JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
+ break 2
+ verify_java_home
+ '[‘ -z /usr/java/jdk1.8.0_181-cloudera ‘]'
+ echo JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
+ source_parcel_environment
+ '[‘ ‘!’ -z ” ‘]'
+ JAVA=/usr/java/jdk1.8.0_181-cloudera/bin/java
+ [[ -z /opt/cloudera/cm ]]
+ MGMT_CLASSPATH='/opt/cloudera/cm/lib/*'
+ JDBC_JARS=/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar
+ MGMT_CLASSPATH='/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar:/opt/cloudera/cm/lib/*'
++ pwd
+ MGMT_CLASSPATH='/run/cloudera-scm-agent/process/245-HIVE-test-db-connection:/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar:/opt/cloudera/cm/lib/*'
+ MGMT_JAVA_OPTS='-Djava.net.preferIPv4Stack=true '
+ exec /usr/java/jdk1.8.0_181-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:///dev/urandom -cp '/run/cloudera-scm-agent/process/245-HIVE-test-db-connection:/usr/share/java/mysql-connector-java.jar:/opt/cloudera/cm/lib/postgresql-42.1.4.jre7.jar:/usr/share/java/oracle-connector-java.jar:/opt/cloudera/cm/lib/*' com.cloudera.enterprise.dbutil.DbCommandExecutor db.properties

2019-06-24 23:42:37,498 INFO CommandPusher:com.cloudera.cmf.model.DbCommand: Command 740(HiveTestDatabaseConnection) has completed. finalstate:FINISHED, success:true, msg:Successful
2019-06-24 23:42:38,336 INFO scm-web-118667:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult
2019-06-24 23:42:38,351 INFO scm-web-118667:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/dbTestConn/checkConnectionResult, Status:200
2019-06-24 23:42:40,393 INFO scm-web-120237:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/clusters/2/healthStatusBar
2019-06-24 23:42:40,418 INFO scm-web-119778:com.cloudera.enterprise.JavaMelodyFacade: Entering HTTP Operation: Method:POST, Path:/clusters/2/statusContent
2019-06-24 23:42:40,421 INFO scm-web-120237:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/clusters/2/healthStatusBar, Status:200
2019-06-24 23:42:40,435 INFO scm-web-119778:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/clusters/2/statusContent, Status:200
^C
[root@cm-r01nn01 ~]#

Verify you have the right bigtop packages installed:

[root@cm-r01nn01 cloudera-scm-agent]#
[root@cm-r01nn01 cloudera-scm-agent]#
[root@cm-r01nn01 cloudera-scm-agent]# ls -altri /usr/libexec/bigtop-detect-javahome
ls: cannot access /usr/libexec/bigtop-detect-javahome: No such file or directory
[root@cm-r01nn01 cloudera-scm-agent]# find / -iname bigtop-detect-javahome*
[root@cm-r01nn01 cloudera-scm-agent]# ssh cm-r01nn02
Password:
Last login: Tue Jun 25 23:21:59 2019 from cm-r01nn01.mws.mds.xyz
[root@cm-r01nn02 ~]#
[root@cm-r01nn02 ~]#
[root@cm-r01nn02 ~]#
[root@cm-r01nn02 ~]# find / -iname bigtop-detect-javahome*
/usr/bin/bigtop-detect-javahome
/usr/lib/bigtop-utils/bigtop-detect-javahome
[root@cm-r01nn02 ~]# rpm -qf /usr/bin/bigtop-detect-javahome
rpm -qf bigtop-utils-0.7.0+cdh6.2.0-967373.el7.noarch
[root@cm-r01nn02 ~]# rpm -qf /usr/lib/bigtop-utils/bigtop-detect-javahome
bigtop-utils-0.7.0+cdh6.2.0-967373.el7.noarch
[root@cm-r01nn02 ~]#

 

Secondly, ensure that you have set the JAVA_HOME in the Hosts Configuration of CM to point to the correct JDK home variable.  This ansible code can help create the link to the latest copy of the JDK:

[root@awx01 ansible]# cd /ansible && ansible 'cm*' -m shell -a 'ls -altrid /usr/java/jdk1.8.0_181-cloudera/; ln -s /usr/java/jdk1.8.0_181-cloudera/ /usr/java/latest; ls -altrid /usr/java/latest'

Then use /usr/java/latest in the Hosts Configuration (Look for JAVA_HOME and set it).

Thx,
TK

ERROR: cannot execute CREATE TABLE in a read-only transaction

If you're getting the following on your patroni cluster:

Jun 24 23:30:50 psql02 etcd: failed to send out heartbeat on time (exceeded the 250ms timeout for 1.229927109s)
Jun 24 23:30:50 psql02 etcd: server is likely overloaded
Jun 24 23:30:50 psql02 etcd: failed to send out heartbeat on time (exceeded the 250ms timeout for 1.22987947s)
Jun 24 23:30:50 psql02 etcd: server is likely overloaded

ERROR:  cannot execute CREATE TABLE in a read-only transaction

Then you likely commented out these:

[root@psql03 ~]# cat /etc/haproxy/haproxy.cfg
global
    log         127.0.0.1 local0
    stats       socket /var/run/haproxy.sock mode 0600 level admin
    user        haproxy
    group       haproxy
    daemon
    debug
    maxconn 1024

defaults
    mode tcp
    log global
    option                  dontlognull
    option                  redispatch
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

frontend psql-in
    mode tcp
    bind psql-c01:5432
    option tcplog
    default_backend             psql-back


backend psql-back
    mode        tcp
    option      tcplog
#    option     httpchk
#    http-check         expect status 200

    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-se
    server      psql01.nix.mds.xyz    psql01.nix.mds.xyz:5432 maxconn 1024 check port 8008
    server      psql02.nix.mds.xyz    psql02.nix.mds.xyz:5432 maxconn 1024 check port 8008
    server      psql03.nix.mds.xyz    psql03.nix.mds.xyz:5432 maxconn 1024 check port 8008

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats realm Haproxy\ Statistics
    stats uri /haproxy-stats
    stats auth admin:nopassword
[root@psql03 ~]#

Uncomment the two lines above to resolve the issue.

Thx,
TK

 

User cannot run DDL statements on the specified database. Attempt to create and drop a table failed.

In case you receive the following:


cloudera-scm-server.log:2019-06-21 01:16:57,763 ERROR CommandPusher:com.cloudera.cmf.model.DbCommand: Command 609(HiveTestDatabaseConnection) has completedFINISHED, success:false, msg:User cannot run DDL statements on the specified database. Attempt to create and drop a table failed.

check the logs or test creating a temporary table:

cloudera-scm-server.log:Caused by: org.postgresql.util.PSQLException: ERROR: cannot execute UPDATE in a read-only transaction
cloudera-scm-server.log:Caused by: java.sql.BatchUpdateException: Batch entry 0 update COMMANDS_DETAIL set OPTIMISTIC_LOCK_VERSION=4, INTERNAL_STATE=? where COMMAND_ID=618 and OPTIMISTIC_LOCK_VERSION=3 was aborted: ERROR: cannot execute UPDATE in a read-only transaction  Call getNextException to see other errors in the batch.
cloudera-scm-server.log:Caused by: org.postgresql.util.PSQLException: ERROR: cannot execute UPDATE in a read-only transaction
cloudera-scm-server.log:Caused by: java.sql.BatchUpdateException: Batch entry 0 update COMMANDS_DETAIL set OPTIMISTIC_LOCK_VERSION=6, INTERNAL_STATE=? where COMMAND_ID=618 and OPTIMISTIC_LOCK_VERSION=5 was aborted: ERROR: cannot execute UPDATE in a read-only transaction  Call getNextException to see other errors in the batch.
cloudera-scm-server.log:Caused by: org.postgresql.util.PSQLException: ERROR: cannot execute UPDATE in a read-only transaction
[root@cm-r01nn01 cloudera-scm-server]#

Try to test create a table:

hue_mws01=>
hue_mws01=> CREATE TABLE account(
   user_id serial PRIMARY KEY,
   username VARCHAR (50) UNIQUE NOT NULL,
   password VARCHAR (50) NOT NULL,
   email VARCHAR (355) UNIQUE NOT NULL,
   created_on TIMESTAMP NOT NULL,
   last_login TIMESTAMP
);
ERROR:  cannot execute CREATE TABLE in a read-only transaction
hue_mws01=>

In our case the above was resolved by restarting haproxy, keepalived and Patroni on psql02 (Node 2) of our PostgreSQL cluster.  We then tested and received:

[root@psql01 ~]# sudo su – postgres
Last login: Sat Jun 22 00:16:39 EDT 2019 on pts/0
-bash-4.2$ psql -h psql-c01.nix.mds.xyz -p 5432 -W -U hue_mws01 -d hue_mws01
Password for user hue_mws01:
psql (10.5)
Type "help" for help.

hue_mws01=> CREATE TABLE account(
   user_id serial PRIMARY KEY,
   username VARCHAR (50) UNIQUE NOT NULL,
   password VARCHAR (50) NOT NULL,
   email VARCHAR (355) UNIQUE NOT NULL,
   created_on TIMESTAMP NOT NULL,
   last_login TIMESTAMP
);
CREATE TABLE
hue_mws01=>

Tell us if this solved your issue.

Cheers,
TK

 

Cannot read XML: (41) Specification mandate value for attribute data-pjax-transient [Line: 38 | Column: 40].

When you get this:

Cannot read XML: (41) Specification mandate value for attribute data-pjax-transient [Line: 38 | Column: 40].

ensure you're download the xml or any file from github using the clone or download a zip option, instead of Save As / Save Link As ….

Thx,
TK

ERROR 1044 (42000) at line 951: Access denied for user ‘root’@’%.nix.mds.xyz’ to database ‘performance_schema’

Problems restoring a database?

No worries:

https://forums.mysql.com/read.php?20,675392,675404#msg-675404

Here’s the session:

[root@mysql01 tmp]# ls -altri
total 592120
134766607 -rw-r–r–.  1 root root 606322782 Jun  5 07:53 mysql-c01.sql.1556440201
134604506 drwxr-xr-x.  2 root root        37 Jun  5 07:54 .
201326721 dr-xr-x—. 14 root root      4096 Jun  8 13:03 ..
[root@mysql01 tmp]# mysql -u root -h mysql01.nix.mds.xyz -p zabbix < mysql-c01.sql.1556440201
Enter password:
ERROR 1130 (HY000): Host ‘mysql01.nix.mds.xyz’ is not allowed to connect to this MySQL server
[root@mysql01 tmp]#
[root@mysql01 tmp]#
[root@mysql01 tmp]# mysql -u root -h mysql02.nix.mds.xyz -p zabbix < mysql-c01.sql.1556440201
Enter password:
ERROR 1044 (42000) at line 951: Access denied for user ‘root’@’%.nix.mds.xyz’ to database ‘performance_schema’
[root@mysql01 tmp]# select user,host,authentication_string from mysql.user;^C
[root@mysql01 tmp]# mysql -u root -h mysql02 -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1657
Server version: 5.7.25-log MySQL Community Server – (GPL), wsrep_25.17

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

mysql> select user,host,authentication_string from mysql.user;
+—————+—————+——————————————-+
| user          | host          | authentication_string                     |
+—————+—————+——————————————-+
| root          | localhost     | *725EA9257B824D03786B3160A70CE38949079EAD |
| mysql.session | localhost     | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| mysql.sys     | localhost     | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| haproxy       | %             |                                           |
| haproxy_root  | %             | *74B1C21ACE0C2D6B0678A5E503D2A60E8F9651A3 |
| root          | %.mws.mds.xyz | *725EA9257B824D03786B3160A70CE38949079EAD |
| root          | %.nix.mds.xyz | *725EA9257B824D03786B3160A70CE38949079EAD |
| zabbix        | %.mws.mds.xyz | *DEEF4D7D88CD046ECA02A80393B7780A63E7E789 |
| zabbix        | %.nix.mds.xyz | *DEEF4D7D88CD046ECA02A80393B7780A63E7E789 |
| zabbix        | localhost     | *DEEF4D7D88CD046ECA02A80393B7780A63E7E789 |
| scm           | %             | *45E6E3C68BDF1AC7EBB5C5A3BCBD5E9437B293BE |
| rman          | %             | *819397F8B454D58DA4E9F42A88A4873756B8C96D |
| hive          | %             | *4DF1D66463C18D44E3B001A8FB1BBFBEA13E27FC |
| oozie         | %             | *2B03FE0359FAD3B80620490CE614F8622E0828CD |
| hue           | %             | *15221DE9A04689C4D312DEAC3B87DDF542AF439E |
| sentry        | %             | *78CD2A8CB7403E99F97588EDF05B3EEAC6494302 |
| amon          | %             | *DD0008B5FF196DE1F1FF14303BBA1B463CF83411 |
| nav           | %             | *1FE59D14EF0D6A47A590517437D8518EA3BFF064 |
| navms         | %             | *6F20A779B8CA430934928BE1BE2FB425DA47AC00 |
+—————+—————+——————————————-+
19 rows in set (0.00 sec)

mysql>
mysql>
mysql>
mysql> use zabbix;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql>
mysql>
mysql>
mysql> source /root/tmp/mysql-c01.sql.1556440201
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

In certain cases, you may be forced to reinitialize a database.  THIS ASSUMES YOU HAVE A WORKING BACKUP.  Do the following:

  • Take a gzip backup of all the database files under /var/lib/mysqld.  (Ex )
  • Remove all the database files from /var/lib/mysqld for all three nodes.
  • Look for the node that has safe_to_boostrap: 1 :

    [root@mysql03 mysql]# cat grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    00000000-0000-0000-0000-000000000000
    seqno:   -1
    safe_to_bootstrap: 1
    [root@mysql03 mysql]#

  • Start mysqld in safe mode to reset the pass, if need be:

    [root@mysql03 mysql]# systemctl set-environment MYSQLD_OPTS=”–wsrep-new-cluster –skip-grant-tables”

  • Start mysql on the node with safe_to_bootstrap using:

    /usr/bin/mysqld_bootstrap

  • Login to mysqld:

    [root@mysql03 mysql]# mysql -u root -p

    or

    [root@mysql02 mysql]# mysql -u root -h mysql-c01.nix.mds.xyz -p

  • Restore the database using the steps in the first part of this document.

Regards,
BK

[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():158

Another issue we can run into is the following set of messages:

2019-06-08T04:56:24.518538Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():158
2019-06-08T04:56:24.518591Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
2019-06-08T04:56:24.518764Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel ‘galera_cluster1’ at ‘gcomm://192.168.0.126,192.168.0.107,192.168.0.114’: -110 (Connection timed out)
2019-06-08T04:56:24.518793Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2019-06-08T04:56:24.518812Z 0 [ERROR] WSREP: wsrep::connect(gcomm://192.168.0.126,192.168.0.107,192.168.0.114) failed: 7
2019-06-08T04:56:24.518835Z 0 [ERROR] Aborting

To solve this, take a look at your /etc/my.cnf file. The following three fields have to match the server that you are on:

wsrep_node_address=”192.168.0.126″
wsrep_node_name=”mysql01″
server_id=1
bind-address=192.168.0.126

If they don’t, the above error is thrown.  server_id must be unique for each node.  This can happen when you’re restoring your database from one node to another or trying recovery steps where you copy the /etc/my.cnf over to another host.

Another solution involves copying the data directory from the master or most current host to the target host we want as a new master.  We do this because we wanted to test when the master is offline while working only with 2/3 nodes.  Here is our situation:

mysql01 GOOD
mysql02 BAD
mysql03 BAD

Copy the data dir from mysql01 to mysql02:

mysql02 # cd /var/lib/mysql; scp -rp mysql01:/var/lib/mysql/  .

Set the safe_to_bootstrap flag to 1:

cat grastate.dat
# GALERA saved state
version: 2.1
uuid:    f25fc12b-8a0b-11e9-b58d-bfb801e3b36d
seqno:   -1
safe_to_bootstrap: 1

(IF Grant Tables is an Issue) Start mysqld in safe mode to reset the pass, if need be:

[root@mysql03 mysql]# systemctl set-environment MYSQLD_OPTS=”–wsrep-new-cluster –skip-grant-tables”

(Recommended) OR just set the following variable indicating this is a new bootstrapped cluster:

[root@mysql03 mysql]# systemctl set-environment MYSQLD_OPTS=”–wsrep-new-cluster”

Bootstrap this node, in other words, make it primary:

mysql02 # /usr/bin/mysqld_bootstrap

Then check:

mysql02 # mysql -e “SHOW STATUS LIKE ‘wsrep_cluster_status’;”
+———————-+————-+
| Variable_name | Value |
+———————-+————-+
| wsrep_cluster_status | non-Primary |
+———————-+————-+

On the third node, mysql03, remove all files from the /var/lib/mysql folder because we’ll let it sync up from mysql02:

mysql03 # cd /var/lib; mv mysql mysql-bk01; mkdir mysql; chown mysql:mysql mysql; cd mysql;
OR
mysql03 # cd /var/lib/mysql; rm -rf *

Start mysql on mysql03 so it sync’s from mysql02:

mysql03 # systemctl start mysqld

Let it sync.  You should have an accessible 2/3 node cluster at this point.  If the original node you bootstrapped from exhibits the same issue, also clear it’s directory and restart it to allow data to be copied from the two good nodes.

Restart the original bootstrapped node without the flag set before, rebooting to test after the next two steps:

[root@mysql01 mysql]# systemctl set-environment MYSQLD_OPTS=””;
[root@mysql01 mysql]# systemctl restart mysqld;

Restart the cluster one node at a time at first, then all together to verify cluster comes back up properly.

 

2025 Dec 31st

Another way to recover is to use the last known commit:

mysql02 # /usr/sbin/mysqld –wsrep-recover

And look for a line like this:

mysql02 # 2025-12-31T19:43:57.479659Z 0 [Note] WSREP: Recovered position: 751c8bd5-224f-11ea-b912-c7063c41e2a2:127588425

That recovery number is your recovery commit.  Add it to the following file:

mysql02 # cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 751c8bd5-224f-11ea-b912-c7063c41e2a2
seqno: 127588425
safe_to_bootstrap: 1

Then when you start, you should see the following:

mysql02 # /usr/sbin/mysqld –user=mysql –wsrep-new-cluster –daemonize  –pid-file=/var/run/mysqld/mysqld.pid –log-error=/var/log/mysqld.log
mysql02 # mysql -e “SHOW STATUS LIKE ‘wsrep_cluster_status’;”
+———————-+———+
| Variable_name | Value |
+———————-+———+
| wsrep_cluster_status | Primary |
+———————-+———+

If you get this error:

Dec 31 14:50:21 mysql03.nix.mds.xyz mysqld[30802]: 2025-12-31T19:50:21.276605Z 1 [ERROR] WSREP: Requesting state transfer failed: -77(File descriptor in bad state)
Dec 31 14:50:21 mysql03.nix.mds.xyz mysqld[30802]: 2025-12-31T19:50:21.276730Z 1 [ERROR] WSREP: State transfer request failed unrecoverably: 77 (File descriptor in bad state). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

It is almost certainly selinux related:

type=AVC msg=audit(1767210964.307:165110): avc: denied { write } for pid=19521 comm=”mysql” name=”mysql.sock” dev=”dm-0″ ino=201702945 scontext=system_u:system_r:zabbix_agent_t:s0 tcontext=unconfined_u:object_r:mysqld_db_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1767210965.452:165111): avc: denied { write } for pid=19525 comm=”mysql” name=”mysql.sock” dev=”dm-0″ ino=201702945 scontext=system_u:system_r:zabbix_agent_t:s0 tcontext=unconfined_u:object_r:mysqld_db_t:s0 tclass=sock_file permissive=0

on the node, node 2 in this case, that is being used as a bootstrap node.  Use this command to fix that up:

mysql02 # grep AVC /var/log/audit/audit.log* >> /var/log/audit/audit-denied-all.log; cat /var/log/audit/audit-denied-all.log | audit2allow -M systemd-allow; semodule -i systemd-allow.pp

Successful messages, after clearing mysql03 /var/lib/mysql/ folder of any files, and restarting it ( see earlier ).  Successful messages were:

Dec 31 15:20:07 mysql03.nix.mds.xyz rsyncd[6921]: connect from mysql02.nix.mds.xyz (192.168.0.107)
Dec 31 15:20:07 mysql03.nix.mds.xyz rsyncd[6921]: rsync to rsync_sst/ from mysql02.nix.mds.xyz (192.168.0.107)
Dec 31 15:20:07 mysql03.nix.mds.xyz rsyncd[6921]: receiving file list
Dec 31 15:20:07 mysql03.nix.mds.xyz rsyncd[6921]: sent 48 bytes received 190 bytes total size 47
Dec 31 15:20:07 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:07.250428Z 0 [Note] WSREP: 1.0 (mysql02): State transfer to 0.0 (mysql03) complete.
Dec 31 15:20:07 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:07.251352Z 0 [Note] WSREP: Member 1.0 (mysql02) synced with group.
Dec 31 15:20:07 mysql03.nix.mds.xyz mysqld[4008]: WSREP_SST: [INFO] Joiner cleanup. rsync PID: 4068 (20251231 15:20:07.727)
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: WSREP_SST: [INFO] Joiner cleanup done. (20251231 15:20:08.244)
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.249592Z 0 [Note] WSREP: SST complete, seqno: 127589412
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.301565Z 0 [Note] InnoDB: PUNCH HOLE support available
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.301659Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.301683Z 0 [Note] InnoDB: Uses event mutexes
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.301696Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.301709Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.301727Z 0 [Note] InnoDB: Using Linux native AIO
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.302643Z 0 [Note] InnoDB: Number of pools: 1
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.303038Z 0 [Note] InnoDB: Using CPU crc32 instructions
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.311476Z 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.334984Z 0 [Note] InnoDB: Completed initialization of buffer pool
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.341182Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.426660Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
Dec 31 15:20:08 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:08.442088Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 434182622771
Dec 31 15:20:09 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:09.210975Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 434187865600
Dec 31 15:20:10 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:10.093181Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 434193108480
Dec 31 15:20:10 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:10.904396Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 434198351360
Dec 31 15:20:11 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:11.628316Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 434203594240
Dec 31 15:20:12 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:12.576014Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 434208837120
Dec 31 15:20:13 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:13.030583Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 434211868844
Dec 31 15:20:13 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:13.036111Z 0 [Note] InnoDB: Database was not shutdown normally!
Dec 31 15:20:13 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:13.036199Z 0 [Note] InnoDB: Starting crash recovery.
Dec 31 15:20:13 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:13.665787Z 0 [Note] InnoDB: 4 transaction(s) which must be rolled back or cleaned up in total 41 row operations to undo
Dec 31 15:20:13 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:13.665893Z 0 [Note] InnoDB: Trx id counter is 403032320
Dec 31 15:20:13 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:13.665951Z 0 [Note] InnoDB: Starting an apply batch of log records to the database…
Dec 31 15:20:16 mysql03.nix.mds.xyz mysqld[4008]: InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Dec 31 15:20:16 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:16.631310Z 0 [Note] InnoDB: Apply batch completed
Dec 31 15:20:16 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:16.632266Z 0 [Note] InnoDB: Last MySQL binlog file position 0 11806, file name mysql_binary_log.000432
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.498303Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.498340Z 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.498675Z 0 [Note] InnoDB: Setting file ‘./ibtmp1’ size to 12 MB. Physically writing the file full; Please wait …
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.498740Z 0 [Note] InnoDB: Rolling back trx with id 403031914, 9 rows to undo
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.558833Z 0 [Note] InnoDB: File ‘./ibtmp1’ size is now 12 MB.
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.567082Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.567148Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.569051Z 0 [Note] InnoDB: Waiting for purge to start
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.619491Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 11278ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.622393Z 0 [Note] InnoDB: 5.7.25 started; log sequence number 434211868844
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.624115Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.722659Z 0 [Note] Plugin ‘FEDERATED’ is disabled.
Dec 31 15:20:19 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:19.743210Z 0 [Note] InnoDB: Buffer pool(s) load completed at 251231 15:20:19
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.131948Z 0 [Note] InnoDB: Rollback of trx with id 403031914 completed
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.132056Z 0 [Note] InnoDB: Rolling back trx with id 403031913, 8 rows to undo
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.293691Z 0 [Note] InnoDB: Rollback of trx with id 403031913 completed
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.293810Z 0 [Note] InnoDB: Rolling back trx with id 403031908, 12 rows to undo
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.431091Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.456280Z 0 [Warning] CA certificate ca.pem is self signed.
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.459943Z 0 [Note] Server hostname (bind-address): ‘192.168.0.114’; port: 3306
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.460097Z 0 [Note] – ‘192.168.0.114’ resolves to ‘192.168.0.114’;
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.460310Z 0 [Note] Server socket created on IP: ‘192.168.0.114’.
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.467839Z 0 [Note] InnoDB: Rollback of trx with id 403031908 completed
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.467987Z 0 [Note] InnoDB: Rolling back trx with id 403031888, 12 rows to undo
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.637832Z 0 [Note] InnoDB: Rollback of trx with id 403031888 completed
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.637915Z 0 [Note] InnoDB: Rollback of non-prepared transactions completed
Dec 31 15:20:20 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:20.811238Z 0 [Note] Failed to start slave threads for channel ”
Dec 31 15:20:21 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:21.227860Z 0 [Note] Event Scheduler: Loaded 0 events
Dec 31 15:20:21 mysql03.nix.mds.xyz mysqld[4008]: 2025-12-31T20:20:21.228014Z 0 [Note] WSREP: Signalling provider to continue.
Dec 31 15:20:21 mysql03.nix.mds.xyz systemd[1]: Started MySQL Server.
— Subject: Unit mysqld.service has finished start-up
— Defined-By: systemd
— Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

— Unit mysqld.service has finished starting up.

— The start-up result is done.

Now that mysql03 and mysql01 are synced, and only then, stop mysql02 and bring it back to a standard state:

mysql02 # kill $(cat /var/run/mysqld/mysqld.pid)

Start mysqld using systemd:

mysql02 # systemctl start mysqld

Verify:

mysql02 # mysql -e “SHOW STATUS LIKE ‘wsrep_cluster_size’;”
+——————–+——-+
| Variable_name | Value |
+——————–+——-+
| wsrep_cluster_size | 3 |
+——————–+——-+

Check sync status:

mysql02 # mysql -e “SHOW STATUS LIKE ‘wsrep_local_state_comment’;”
+—————————+——–+
| Variable_name | Value |
+—————————+——–+
| wsrep_local_state_comment | Synced |
+—————————+——–+

Verify by creating a test database on one node:

mysql02 # mysql -e “CREATE DATABASE cluster_test;”

mysql03 # mysql -e “show databases;”
+——————–+
| Database |
+——————–+
| information_schema |
| cluster_test |
| galera_test |
| haproxy |
| mysql |
| opennebula |
| performance_schema |
| sys |
| zabbix |
+——————–+

Then drop the database on mysql03 and verify off mysql02:

mysql03 # mysql -e “drop database cluster_test;”

mysql02 # mysql -e “show databases;”
+——————–+
| Database |
+——————–+
| information_schema |
| galera_test |
| haproxy |
| mysql |
| opennebula |
| performance_schema |
| sys |
| zabbix |
+——————–+

It should be showing as gone on all the nodes.

Thx,
TK

PostgreSQL Full Backup

Let's setup some PostgreSQL backups.  

In this post, we'll set up a pull methodology in order to get backups of our PostgreSQL Cluster .  The backups will be saved remotely so any failures in our cluster will be independent of our PostgreSQL backups.

First, let's prepare the postgres account so we can login remotely without a pass.  This will allow us to run the pg_dump command remotely.  So generate a pair of keys in whatever account of your choice and exchange the keys with the remote hosts like this:

[root@mbpc-pc .ssh]# cat id_rsa.pub
ssh-rsa <SECRET KEY TEXT>
[root@mbpc-pc .ssh]#
[root@mbpc-pc .ssh]#
[root@mbpc-pc .ssh]# ssh postgres@psql01
FIPS integrity verification test failed.
Last login: Sun Jun  2 18:46:53 2019
-bash-4.2$ cat .ssh/authorized_keys
ssh-rsa <SECRET KEY TEXT>
-bash-4.2$ logout
Connection to psql01 closed.
[root@mbpc-pc .ssh]# ssh postgres@psql02
FIPS integrity verification test failed.
Last login: Sun Jun  2 17:15:30 2019 from mbpc-pc.nix.mds.xyz
-bash-4.2$ logout
Connection to psql02 closed.
[root@mbpc-pc .ssh]# ssh postgres@psql03
FIPS integrity verification test failed.
Last login: Sun Jun  2 18:30:51 2019 from psql01.nix.mds.xyz
-bash-4.2$

Once you have the keys exchanged, you'll need a .pgpass within the root folder of the postgres account on each of the above cluster hosts:

-bash-4.2$ cat .pgpass
psql-c01.nix.mds.xyz:5432:*:postgres:<SECRET>
psql01.nix.mds.xyz:5432:*:postgres:<SECRET>
psql02.nix.mds.xyz:5432:*:postgres:<SECRET>
psql03.nix.mds.xyz:5432:*:postgres:<SECRET>
-bash-4.2$

Next, we'll write a short script to login to a postgres node and take a backup saving the gzip file locally on the calling server:

[root@mbpc-pc .ssh]# cat /mnt/postgres-backup.sh
#!/bin/bash

PSQLH="";
PSQLR="";

# For a reason I've yet to investigate, cluster IP doesn't work here.  So determening the node with a running instance the real shitty way.
for KEY in $( echo psql-c01 psql01 psql02 psql03 ); do
        PSQLH=$( ssh postgres@$KEY "hostname" 2>/dev/null );
        PSQLR=$( ssh postgres@$KEY "ps -ef|grep -Ei \"pgsql-10.*postgres\"|grep -v grep" 2>/dev/null);
        [[ PSQLR != “” ]] && {
                echo $PSQLH"|"$PSQLR; break;
        };
done

[[ PSQLH == “” ]] && {
        echo "ERROR: PSQLH var was empty.  Should be a hostname.";
        exit 0;
};

ssh postgres@psql-c01.nix.mds.xyz "pg_dumpall -U postgres -h $PSQLH -p 5432 | gzip -vc" > ./psql-c01.sql.$(date +%s).gz && find /mnt/SomeBigDisk/psql-backup/ -type f -name '*.sql.*.gz' -mtime +180 -exec rm {} \;
[root@mbpc-pc .ssh]#

 

Notice the find line in the above.  It will clear out any old backups after 180 saves.

Schedule this via cron or any other scheduling software you prefer:

[root@mbpc-pc .ssh]# crontab -l|grep postgres
30 3 * * * /mnt/postgres-backup.sh
[root@mbpc-pc .ssh]#

The beauty of this method is that pg_dump will always be the correct version to match the PostgreSQL software running there and this will leave no intermediate files.  You won't need to ensure the pg_dump always matches your DB Cluster.  

Thx,
TK

PANIC:  replication checkpoint has wrong magic 0 instead of 307747550

So we run into a little problem getting out PostgreSQL Patroni w/ ETCD cluster going after a rather serious failure. 

# sudo su – postgres

$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –listen_addresses=192.168.0.118 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100 –hot_standby=on –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 -d 5

This resulted in one of the 3 messages above.  Hence the post here.  If I can start a single instance, I should be fine since I could then 1) replicate over to the other two or 2) simply take a dump, reinitialize all the databases then restore the dump.  

Using the above procedure I get one of three error messages when using the data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

[ PSQL03 ]
FATAL:syntax error inhistory file:f2W 

 

Unfortunately, we couldn't do anything about PSQL03 and PSQL02, the standby's, since the database base/ folder was way out of sync, meaning, there was no tables there:

[ PSQL03 ]

[root@psql03 base]# ls -altri
total 40
    42424 drwx——.  2 postgres postgres 8192 Oct 29  2018 1
 67714749 drwx——.  2 postgres postgres 8192 Oct 29  2018 13805
202037206 drwx——.  5 postgres postgres   38 Oct 29  2018 .
134312175 drwx——.  2 postgres postgres 8192 May 22 01:55 13806
    89714 drwxr-xr-x. 20 root     root     4096 May 22 22:43 ..
[root@psql03 base]#

 

[ PSQL02 ]

 [root@psql02 base]# ls -altri

total 412
201426668 drwx——.  2 postgres postgres  8192 Oct 29  2018 1
   743426 drwx——.  2 postgres postgres  8192 Mar 24 03:47 13805
135326327 drwx——.  2 postgres postgres 16384 Mar 24 20:15 40970
   451699 drwx——.  2 postgres postgres 40960 Mar 25 19:47 16395
  1441696 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131137
 68396137 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131138
135671065 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131139
204353100 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131140
135326320 drwx——. 17 postgres postgres  4096 Apr 14 10:08 .
 68574415 drwx——.  2 postgres postgres 12288 Apr 28 06:06 131142
   288896 drwx——.  2 postgres postgres 16384 Apr 28 06:06 131141
203015232 drwx——.  2 postgres postgres  8192 Apr 28 06:06 131136
135326328 drwx——.  2 postgres postgres 40960 May  5 22:09 24586
 67282461 drwx——.  2 postgres postgres  8192 May  5 22:09 13806
 67640961 drwx——.  2 postgres postgres 20480 May  5 22:09 131134
203500274 drwx——.  2 postgres postgres 16384 May  5 22:09 155710
134438257 drwxr-xr-x. 20 root     root      4096 May 22 01:44 ..
[root@psql02 base]# pwd
/root/postgres-patroni-backup/base
[root@psql02 base]#

 

[ PSQL01 ]

[root@psql01 base]# ls -altri
total 148
134704615 drwx——.  2 postgres postgres  8192 Oct 29  2018 1
201547700 drwx——.  2 postgres postgres  8192 Oct 29  2018 13805
   160398 drwx——.  2 postgres postgres  8192 Feb 24 23:53 13806
 67482137 drwx——.  7 postgres postgres    62 Feb 24 23:54 .
135909671 drwx——.  2 postgres postgres 24576 Feb 24 23:54 24586
134444555 drwx——.  2 postgres postgres 24576 Feb 24 23:54 16395
 67178716 drwxr-xr-x. 20 root     root      4096 May 22 01:53 ..
[root@psql01 base]# pwd
/root/postgresql-patroni-etcd/base
[root@psql01 base]#

So we could only work with PSQL02, the original primary node.  Everyother node has nothing.  

Looks like our replorigin_checkpoint is at fault resulting in a rather nasty replication error:

open("pg_wal/000000BE000000000000004C", O_RDONLY) = 5
open("pg_wal/000000BE000000000000004C", O_RDONLY) = 5
openat(AT_FDCWD, "base", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
openat(AT_FDCWD, "pg_tblspc", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
openat(AT_FDCWD, "pg_replslot", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
openat(AT_FDCWD, "pg_replslot", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("pg_logical/replorigin_checkpoint", O_RDONLY) = 6
write(2, "2019-06-02 14:50:34.777 EDT [283″…, 1062019-06-02 14:50:34.777 EDT [28362] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
-bash-4.2$ cat pg_logical/replorigin_checkpoint
cat: pg_logical/replorigin_checkpoint: No such file or directory
-bash-4.2$ pwd
/data/patroni/tmp
-bash-4.2$ cd ..
-bash-4.2$ cat pg_logical/replorigin_checkpoint
øÉKíÛ0ðð bø{nð- Ðð à Ð à4Ø4-bash-4.2$ PuTTY
-bash: PuTTY: command not found
-bash-4.2$
-bash-4.2$ strings pg_logical/replorigin_checkpoint
-bash-4.2$ ls -altri pg_logical/replorigin_checkpoint
67894871 -rw——-. 1 postgres postgres 16384 Oct 29  2018 pg_logical/replorigin_checkpoint
-bash-4.2$ ls -altri pg_logical/
total 20
 67894871 -rw——-.  1 postgres postgres 16384 Oct 29  2018 replorigin_checkpoint
136946383 drwx——.  2 postgres postgres     6 Oct 29  2018 snapshots
204367784 drwx——.  2 postgres postgres     6 Oct 29  2018 mappings
 67894870 drwx——.  4 postgres postgres    65 Apr 28 06:06 .
135326272 drwx——. 21 postgres postgres  4096 Jun  2 14:50 ..
-bash-4.2$


So let's copy a good one from another host (I guess we could delete it but I haven't tried):


[root@psql03 pg_logical]#
[root@psql03 pg_logical]# ls -altri
total 8
 68994432 drwx——.  2 postgres postgres    6 Oct 29  2018 snapshots
134984156 drwx——.  2 postgres postgres    6 Oct 29  2018 mappings
   566745 -rw——-.  1 postgres postgres    8 May 22 01:55 replorigin_checkpoint
   566731 drwx——.  4 postgres postgres   65 May 22 01:55 .
    89714 drwxr-xr-x. 20 root     root     4096 May 22 22:43 ..
[root@psql03 pg_logical]#
[root@psql03 pg_logical]#
[root@psql03 pg_logical]#
[root@psql03 pg_logical]# scp replorigin_checkpoint psql02:/data/patroni/pg_logical/
Password:
replorigin_checkpoint                                                                                 100%    8    10.1KB/s   00:00
[root@psql03 pg_logical]#
[root@psql03 pg_logical]#


Now we can get to the backend in standalone mode:


-bash-4.2$
-bash-4.2$ /usr/pgsql-10/bin/postgres –single -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=off –listen_addresses=192.168.0.124 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=20 -d 5 2>&1
2019-06-02 15:00:48.981 EDT [29057] DEBUG:  invoking IpcMemoryCreate(size=144687104)
2019-06-02 15:00:48.982 EDT [29057] DEBUG:  mmap(144703488) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-02 15:00:48.993 EDT [29057] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-06-02 15:00:48.993 EDT [29057] DEBUG:  removing file "pg_notify/0000"
2019-06-02 15:00:48.993 EDT [29057] DEBUG:  dynamic shared memory system will support 128 segments
2019-06-02 15:00:48.994 EDT [29057] DEBUG:  created dynamic shared memory control segment 1025202362 (3088 bytes)
2019-06-02 15:00:48.994 EDT [29057] DEBUG:  InitPostgres
2019-06-02 15:00:48.994 EDT [29057] DEBUG:  my backend ID is 1
2019-06-02 15:00:48.994 EDT [29057] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-02 15:00:49.265 EDT [29057] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-02 15:00:49.266 EDT [29057] LOG:  invalid primary checkpoint record
2019-06-02 15:00:49.266 EDT [29057] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  redo record is at 0/4C34ED70; shutdown FALSE
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  next transaction ID: 0:1409831; next OID: 237578
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  next MultiXactId: 48; next MultiXactOffset: 174
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  oldest unfrozen transaction ID: 549, in database 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  oldest MultiXactId: 1, in database 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  starting up replication slots
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  starting up replication origin progress state
2019-06-02 15:00:49.266 EDT [29057] LOG:  database system was not properly shut down; automatic recovery in progress
2019-06-02 15:00:49.267 EDT [29057] DEBUG:  resetting unlogged relations: cleanup 1 init 0
2019-06-02 15:00:49.269 EDT [29057] LOG:  redo starts at 0/4C34ED70
2019-06-02 15:00:49.273 EDT [29057] DEBUG:  attempting to remove WAL segments newer than log file 000000BE000000000000004C
2019-06-02 15:00:49.273 EDT [29057] LOG:  invalid record length at 0/4C35CDC0: wanted 24, got 0
2019-06-02 15:00:49.273 EDT [29057] LOG:  redo done at 0/4C35CD90
2019-06-02 15:00:49.273 EDT [29057] LOG:  last completed transaction was at log time 2019-04-28 06:05:44.017446-04
2019-06-02 15:00:49.273 EDT [29057] DEBUG:  resetting unlogged relations: cleanup 0 init 1
2019-06-02 15:00:49.280 EDT [29057] DEBUG:  performing replication slot checkpoint
2019-06-02 15:00:49.288 EDT [29057] DEBUG:  attempting to remove WAL segments older than log file 000000000000000000000043
2019-06-02 15:00:49.289 EDT [29057] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-02 15:00:49.290 EDT [29057] DEBUG:  oldest MultiXactId member is at offset 1
2019-06-02 15:00:49.290 EDT [29057] DEBUG:  MultiXact member stop limit is now 4294914944 based on MultiXact 1
2019-06-02 15:00:49.292 EDT [29057] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-02 15:00:49.302 EDT [29057] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0

PostgreSQL stand-alone backend 10.5
backend>


But we choose not to use the backend capabilities at this time.  We'll start the database as Patroni would, using the following command:


-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=off –listen_addresses=192.168.0.124 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=20 -d 5 2>&1
2019-06-02 15:11:55.379 EDT [29789] DEBUG:  postgres: PostmasterMain: initial environment dump:
2019-06-02 15:11:55.380 EDT [29789] DEBUG:  —————————————–
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      XDG_SESSION_ID=171
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HOSTNAME=psql02.nix.mds.xyz
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      SHELL=/bin/bash
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      TERM=xterm
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HISTSIZE=1000
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      USER=postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      MAIL=/var/spool/mail/postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PWD=/data/patroni/pg_logical
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LANG=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HISTCONTROL=ignoredups
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      SHLVL=1
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HOME=/var/lib/pgsql
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LOGNAME=postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PGDATA=/data/patroni
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      _=/usr/pgsql-10/bin/postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      OLDPWD=/data/patroni
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_COLLATE=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_CTYPE=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_MESSAGES=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_MONETARY=C
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_NUMERIC=C
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_TIME=C
2019-06-02 15:11:55.380 EDT [29789] DEBUG:  —————————————–
2019-06-02 15:11:55.383 EDT [29789] LOG:  listening on IPv4 address "192.168.0.124", port 5432
2019-06-02 15:11:55.385 EDT [29789] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-06-02 15:11:55.386 EDT [29789] DEBUG:  invoking IpcMemoryCreate(size=144687104)
2019-06-02 15:11:55.387 EDT [29789] DEBUG:  mmap(144703488) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  removing file "pg_notify/0000"
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  dynamic shared memory system will support 128 segments
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  created dynamic shared memory control segment 721092148 (3088 bytes)
2019-06-02 15:11:55.401 EDT [29789] DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-06-02 15:11:55.404 EDT [29789] LOG:  redirecting log output to logging collector process
2019-06-02 15:11:55.404 EDT [29789] HINT:  Future log output will appear in directory "log".

And voila!  We are in our database and can see all of our databases:

-bash-4.2$ psql -h psql02 -p 5432 -W
Password:
psql (10.5)
Type "help" for help.

postgres=# \l
                                          List of databases
      Name       |    Owner     | Encoding |   Collate   |    Ctype    |      Access privileges
—————–+————–+———-+————-+————-+—————————–
 amon_mws01      | amon_mws01   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 awx             | awx          | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 confluence      | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres               +
                 |              |          |             |             | postgres=CTc/postgres      +
                 |              |          |             |             | confluenceuser=CTc/postgres
 hue_mws01       | hue_mws01    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 metastore_mws01 | hive_mws01   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 nav_mws01       | nav_mws01    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 navms_mws01     | navms_mws01  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 oozie_mws01     | oozie_mws01  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 postgres        | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 rman_mws01      | rman_mws01   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 scm_mws01       | scm_mws01    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 sentry_mws01    | sentry_mws01 | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0       | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                +
                 |              |          |             |             | postgres=CTc/postgres
 template1       | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                +
                 |              |          |             |             | postgres=CTc/postgres
 twr             | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres               +
                 |              |          |             |             | postgres=CTc/postgres      +
                 |              |          |             |             | twr=CTc/postgres
(15 rows)

postgres=#

And checking after to ensure PostgreSQL / Patroni replicated everything just fine:

[root@psql01 base]# ls -altri
total 364
 67518653 drwx——.  2 postgres postgres  8192 Jun  2 20:35 1
134766555 drwx——.  2 postgres postgres  8192 Jun  2 20:35 13805
   152733 drwx——.  2 postgres postgres 24576 Jun  2 20:35 16395
 68698149 drwx——.  2 postgres postgres 24576 Jun  2 20:35 24586
134741283 drwx——.  2 postgres postgres 16384 Jun  2 20:35 40970
202922441 drwx——.  2 postgres postgres 16384 Jun  2 20:35 131134
   871098 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131136
 68026687 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131137
135079123 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131138
202874795 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131139
   871469 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131140
 68280133 drwx——.  2 postgres postgres 16384 Jun  2 20:35 131141
135080185 drwx——.  2 postgres postgres 12288 Jun  2 20:35 131142
   152732 drwx——. 17 postgres postgres  4096 Jun  2 20:35 .
202879025 drwx——.  2 postgres postgres 16384 Jun  2 20:35 155710
 67482133 drwx——. 21 postgres postgres  4096 Jun  2 20:36 ..
201711623 drwx——.  2 postgres postgres  8192 Jun  2 20:36 13806
[root@psql01 base]#
[root@psql01 base]#
[root@psql01 base]#
[root@psql01 base]# pwd
/data/patroni/base
[root@psql01 base]#

 

[root@psql02 base]# ls -altri
total 368
204367267 drwx——.  2 postgres postgres  8192 Mar 24 03:47 13805
 68669097 drwx——. 17 postgres postgres  4096 Apr 14 10:08 .
204362619 drwx——.  2 postgres postgres 16384 Jun  2 20:31 40970
134473951 drwx——.  2 postgres postgres 24576 Jun  2 20:31 24586
 68669102 drwx——.  2 postgres postgres 24576 Jun  2 20:31 16395
138812710 drwx——.  2 postgres postgres  8192 Jun  2 20:31 1
204366769 drwx——.  2 postgres postgres 12288 Jun  2 20:31 131142
136945631 drwx——.  2 postgres postgres 16384 Jun  2 20:31 131141
 67894451 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131140
  1403920 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131139
204366412 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131138
136945273 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131137
 67894080 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131136
  1403182 drwx——.  2 postgres postgres 16384 Jun  2 20:31 131134
  1404278 drwx——.  2 postgres postgres 16384 Jun  2 20:31 155710
 11395780 drwx——.  2 postgres postgres  8192 Jun  2 20:31 13806
135326272 drwx——. 21 postgres postgres  4096 Jun  2 20:31 ..
[root@psql02 base]# pwd
/data/patroni/base
[root@psql02 base]#

 

[root@psql03 audit]# cd /data/patroni/base/
[root@psql03 base]# ls -altri
total 372
 67130854 drwx——.  2 postgres postgres  8192 Jun  2 20:37 1
134446297 drwx——.  2 postgres postgres  8192 Jun  2 20:37 13805
    79298 drwx——.  2 postgres postgres 24576 Jun  2 20:37 16395
 69209007 drwx——.  2 postgres postgres 24576 Jun  2 20:37 24586
135152677 drwx——.  2 postgres postgres 16384 Jun  2 20:37 40970
201954381 drwx——.  2 postgres postgres 16384 Jun  2 20:37 131134
    80500 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131136
 68241705 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131137
134443358 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131138
201808206 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131139
    80871 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131140
 68242063 drwx——.  2 postgres postgres 16384 Jun  2 20:37 131141
134443716 drwx——.  2 postgres postgres 12288 Jun  2 20:37 131142
    79297 drwx——. 17 postgres postgres  4096 Jun  2 20:37 .
201828372 drwx——.  2 postgres postgres 16384 Jun  2 20:37 155710
134812989 drwx——. 21 postgres postgres  4096 Jun  2 20:38 ..
201807458 drwx——.  2 postgres postgres  8192 Jun  2 20:38 13806
[root@psql03 base]# pwd
/data/patroni/base
[root@psql03 base]#

And that should get you up and running again.  Don't forget to get some decent PostgreSQL backups.

ALTERNATE

We have not tried this but the above could potentially be resolved by pg_resetwal as well:

[root@psql03 ~]# find / -iname pg_resetwal
/usr/pgsql-10/bin/pg_resetwal
[root@psql03 ~]#

 

Thx,
TK

FATAL:  the database system is starting up

If you are receiving the following when postgresql ( w/ Patroni ) is starting up:

2019-04-04 14:59:15.715 EDT [26025] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000390000000000000008 has already been removed
2019-04-04 14:59:16.420 EDT [26029] FATAL:  the database system is starting up

consider running the individual postgres line separately in debug mode like this to reveal the true cause:

-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100 -d 5
2019-05-23 08:40:23.585 EDT [10792] DEBUG:  postgres: PostmasterMain: initial environment dump:
2019-05-23 08:40:23.586 EDT [10792] DEBUG:  —————————————–
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      XDG_SESSION_ID=25
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HOSTNAME=psql01.nix.mds.xyz
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      SHELL=/bin/bash
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      TERM=xterm
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HISTSIZE=1000
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      USER=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      MAIL=/var/spool/mail/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PWD=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LANG=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HISTCONTROL=ignoredups
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      SHLVL=1
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HOME=/var/lib/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LOGNAME=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGDATA=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      _=/usr/pgsql-10/bin/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      OLDPWD=/data
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_COLLATE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_CTYPE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_MESSAGES=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_MONETARY=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_NUMERIC=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_TIME=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:  —————————————–
2019-05-23 08:40:23.589 EDT [10792] DEBUG:  registering background worker "logical replication launcher"
2019-05-23 08:40:23.590 EDT [10792] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:40:23.595 EDT [10792] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:40:23.597 EDT [10792] DEBUG:  invoking IpcMemoryCreate(size=148545536)
2019-05-23 08:40:23.598 EDT [10792] DEBUG:  mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  removing file "pg_notify/0000"
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  dynamic shared memory system will support 288 segments
2019-05-23 08:40:23.620 EDT [10792] DEBUG:  created dynamic shared memory control segment 499213675 (6928 bytes)
2019-05-23 08:40:23.623 EDT [10792] DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-05-23 08:40:23.626 EDT [10792] LOG:  redirecting log output to logging collector process
2019-05-23 08:40:23.626 EDT [10792] HINT:  Future log output will appear in directory "log".
^C2019-05-23 08:41:04.346 EDT [10793] DEBUG:  logger shutting down
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  proc_exit(0): 0 callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  exit(0)
-bash-4.2$ 2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  proc_exit(-1): 0 callbacks to make

-bash-4.2$
 

-bash-4.2$ free
              total        used        free      shared  buff/cache   available
Mem:        3881708      218672     1687436      219292     1975600     3113380
Swap:       4063228           0     4063228
-bash-4.2$

 

The line above in red, indicates lack of system memory on this VM due to a lack of memory on the underlying physical host (overcommitment) .  You'll need to a) assign more memory to the VM, if you see the physical has plenty, or b) purchase more memory for the physical or c) relocate the VM to a host with more memory.  If this doesn't solve the problem, we need to look deeper and check the running process using strace:

[root@psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root      2217  2188  0 00:38 pts/1    00:00:00 tail -f postgresql-Thu.log
postgres  2512     1  4 00:42 ?        00:00:01 /usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres  2533     1  0 00:42 ?        00:00:00 /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=on –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100
postgres  2535  2533  0 00:42 ?        00:00:00 postgres: postgres: logger process
postgres  2536  2533  0 00:42 ?        00:00:00 postgres: postgres: startup process   waiting for 000000010000000000000008
root      2664  2039  0 00:42 pts/0    00:00:00 grep –color=auto -Ei patroni|postgres
[root@psql01 ~]#

Then tracing the above line in red:

[root@psql01 ~]# strace -p 2536
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No such file or directory)
epoll_create1(EPOLL_CLOEXEC)            = 3
epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954624, u64=16954624}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954648, u64=16954648}}) = 0
epoll_wait(3, ^Cstrace: Process 2536 detached
 <detached …>
[root@psql01 ~]# 

Ensure you set permissions on the copied files as well or you may receive this:

[root@psql01 pg_wal]# tail -f ../log/postgresql-Fri.log
2019-05-24 01:22:32.979 EDT [13127] LOG:  aborting startup due to startup process failure
2019-05-24 01:22:32.982 EDT [13127] LOG:  database system is shut down
2019-05-24 01:22:33.692 EDT [13146] LOG:  database system was shut down in recovery at 2019-05-24 01:15:31 EDT
2019-05-24 01:22:33.693 EDT [13146] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-24 01:22:33.693 EDT [13146] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-24 01:22:33.693 EDT [13146] FATAL:  could not open file "pg_wal/0000003A.history": Permission denied

Alternatively, the segments could be copied over to the node missing them in it's archive, however this did not work for me since it appears the primary was too far ahead of the failed replica (5 days worth):

psql02 # patronictl -c /etc/patroni.yml list; du -sh .
+———-+————-+——————–+——–+———+———–+
| Cluster  |    Member   |        Host        |  Role  |  State  | Lag in MB |
+———-+————-+——————–+——–+———+———–+
| postgres | postgresql0 | psql01.nix.mds.xyz |        | running |     235.0 |
| postgres | postgresql1 | psql02.nix.mds.xyz | Leader | running |       0.0 |
| postgres | postgresql2 | psql03.nix.mds.xyz |        | running |       0.0 |
+———-+————-+——————–+——–+———+———–+
193M    .
psql02 # pwd
/data/patroni/pg_wal/
psql02 # ls -altri
total 196952
140042537 -rw——-.  1 postgres postgres    54942 Dec 27  2021 000004C4.history
140121953 -rw——-.  1 postgres postgres    54989 Dec 27  2021 000004C5.history
140042529 -rw——-.  1 postgres postgres    55036 Dec 27  2021 000004C6.history
140042535 -rw——-.  1 postgres postgres    55083 Jan 15  2022 000004C7.history
140042527 -rw——-.  1 postgres postgres    55130 Jan 16  2022 000004C8.history
140042506 -rw——-.  1 postgres postgres    55177 Sep 30 19:21 000004C9.history
204315808 drwx——.  2 postgres postgres        6 Sep 30 23:01 archive_status
135355261 -rw——-.  1 postgres postgres 16777216 Dec 17 22:26 000004C90000003E000000A7
135127587 -rw——-.  1 postgres postgres 16777216 Dec 17 23:42 000004C90000003E000000A8
 68940244 drwx——. 21 postgres postgres     4096 Dec 18 00:00 ..
140042505 -rw——-.  1 postgres postgres 16777216 Dec 18 01:30 000004C90000003E000000A9
140042509 -rw——-.  1 postgres postgres 16777216 Dec 18 03:17 000004C90000003E0000009E
140042507 -rw——-.  1 postgres postgres 16777216 Dec 18 05:07 000004C90000003E0000009F
135355263 -rw——-.  1 postgres postgres 16777216 Dec 18 06:50 000004C90000003E000000A0
140042510 -rw——-.  1 postgres postgres 16777216 Dec 18 08:40 000004C90000003E000000A1
135127616 -rw——-.  1 postgres postgres 16777216 Dec 18 10:28 000004C90000003E000000A2
135341490 -rw——-.  1 postgres postgres 16777216 Dec 18 12:13 000004C90000003E000000A3
135355259 -rw——-.  1 postgres postgres 16777216 Dec 18 14:01 000004C90000003E000000A4
140042524 -rw——-.  1 postgres postgres 16777216 Dec 18 15:51 000004C90000003E000000A5
140121952 drwx——.  3 postgres postgres     4096 Dec 18 15:51 .
135355260 -rw——-.  1 postgres postgres 16777216 Dec 18 17:21 000004C90000003E000000A6
 psql02 # scp * root@psql01.nix.mds.xyz:/data/patroni/pg_wal/

Checking replication again:

# patronictl -c /etc/patroni.yml list;
+———-+————-+——————–+——–+———+———–+
| Cluster  |    Member   |        Host        |  Role  |  State  | Lag in MB |
+———-+————-+——————–+——–+———+———–+
| postgres | postgresql0 | psql01.nix.mds.xyz |        | running |       0.0 |
| postgres | postgresql1 | psql02.nix.mds.xyz | Leader | running |       0.0 |
| postgres | postgresql2 | psql03.nix.mds.xyz |        | running |       0.0 |
+———-+————-+——————–+——–+———+———–+

An alternative still, is to set 

wal_keep_segments

to a value other then '8'.  For example:

# patronictl -c /etc/patroni.yml edit-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    max_connections: 256
    max_replication_slots: 64
    max_wal_senders: 32
    wal_keep_segments: 64
  use_pg_rewind: true
retry_timeout: 10
ttl: 30

However, in some versions, there was a notable bug:

https://github.com/zalando/patroni/issues/2062

patroni/patroni/postgresql/config.py

Line 837 in 258e7e2

 parameters.setdefault('wal_keep_size', str(wal_keep_segments * 16) + 'MB') 
2021-09-06 13:31:38,413 WARNING: Value=100100100100100100100100100100100100100100100100MB of parameter=wal_keep_size is too big, decreasing to 2147483647MB

due to a fix in a recent release, prior to the one used here:

https://github.com/zalando/patroni/pull/1654/files

which may cause issues.  In our case, since we're running an older version, we'll set it to 64 instead of 8 as per the above.

# patronictl -c /etc/patroni.yml edit-config

+++
@@ -5,6 +5,7 @@
     max_connections: 256
     max_replication_slots: 64
     max_wal_senders: 32
+    wal_keep_segments: 64
   use_pg_rewind: true
 retry_timeout: 10
 ttl: 30

Apply these changes? [y/N]: y
Configuration changed

And update the patroni.yml config as well:

# cat /etc/patroni.yml
.
.
.
postgresql:
    parameters:
        unix_socket_directories: '.'
        max_connections: '256'
        max_replication_slots: '64'
        wal_keep_segments: '64'

 

Thx,
TK


     
  Copyright © 2003 - 2025 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License

 

The IT Development and Technology Mini Vault | MicroDevSys.com
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.