Header Shadow Image


Cannot read XML: (41) Specification mandate value for attribute data-pjax-transient [Line: 38 | Column: 40].

When you get this:

Cannot read XML: (41) Specification mandate value for attribute data-pjax-transient [Line: 38 | Column: 40].

ensure you're download the xml or any file from github using the clone or download a zip option, instead of Save As / Save Link As ….

Thx,
TK

[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():158

Another issue we can run into is the following set of messages:

2019-06-08T04:56:24.518538Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():158
2019-06-08T04:56:24.518591Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
2019-06-08T04:56:24.518764Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'galera_cluster1' at 'gcomm://192.168.0.126,192.168.0.107,192.168.0.114': -110 (Connection timed out)
2019-06-08T04:56:24.518793Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2019-06-08T04:56:24.518812Z 0 [ERROR] WSREP: wsrep::connect(gcomm://192.168.0.126,192.168.0.107,192.168.0.114) failed: 7
2019-06-08T04:56:24.518835Z 0 [ERROR] Aborting

To solve this, take a look at your /etc/my.cnf file. The following three fields have to match the server that you are on:

wsrep_node_address="192.168.0.126"
wsrep_node_name="mysql01"
server_id=1
bind-address=192.168.0.126

If they don't, the above error is thrown.  server_id must be unique for each node.  This can happen when you're restoring your database from one node to another or trying recovery steps where you copy the /etc/my.cnf over to another host. 

 

Another solution involves copying the data directory from the master or most current host to the target host we want as a new master.  We do this because we wanted to test when the master is offline while working only with 2/3 nodes.  Here is our situation:

mysql01 GOOD
mysql02 BAD
mysql03 BAD

Copy the data dir from mysql01 to mysql02:

mysql02 # cd /var/lib/mysql; scp -rp mysql01:/var/lib/mysql/  .

Set the safe_to_bootstrap flag to 1:

cat grastate.dat
# GALERA saved state
version: 2.1
uuid:    f25fc12b-8a0b-11e9-b58d-bfb801e3b36d
seqno:   -1
safe_to_bootstrap: 1

Bootstrap this node:

mysql02 # /usr/bin/mysqld_bootstrap

On the third node, mysql03, remove all files from the /var/lib/mysql folder because we'll let it sync up from mysql02:

mysql03 # cd /var/lib/mysql; rm -rf *

Start mysql on mysql03 so it sync's from mysql02:

mysql03 # systemctl start mysqld

Let it sync.  You should have an accessible 2/3 node cluster at this point.

Thx,
TK

PostgreSQL Pull Backup

Let's setup some PostgreSQL backups.  

In this post, we'll set up a pull methodology in order to get backups of our PostgreSQL Cluster .  The backups will be saved remotely so any failures in our cluster will be independent of our PostgreSQL backups.

First, let's prepare the postgres account so we can login remotely without a pass.  This will allow us to run the pg_dump command remotely.  So generate a pair of keys in whatever account of your choice and exchange the keys with the remote hosts like this:

[root@mbpc-pc .ssh]# cat id_rsa.pub
ssh-rsa <SECRET KEY TEXT>
[root@mbpc-pc .ssh]#
[root@mbpc-pc .ssh]#
[root@mbpc-pc .ssh]# ssh postgres@psql01
FIPS integrity verification test failed.
Last login: Sun Jun  2 18:46:53 2019
-bash-4.2$ cat .ssh/authorized_keys
ssh-rsa <SECRET KEY TEXT>
-bash-4.2$ logout
Connection to psql01 closed.
[root@mbpc-pc .ssh]# ssh postgres@psql02
FIPS integrity verification test failed.
Last login: Sun Jun  2 17:15:30 2019 from mbpc-pc.nix.mds.xyz
-bash-4.2$ logout
Connection to psql02 closed.
[root@mbpc-pc .ssh]# ssh postgres@psql03
FIPS integrity verification test failed.
Last login: Sun Jun  2 18:30:51 2019 from psql01.nix.mds.xyz
-bash-4.2$

Once you have the keys exchanged, you'll need a .pgpass within the root folder of the postgres account on each of the above cluster hosts:

-bash-4.2$ cat .pgpass
psql-c01.nix.mds.xyz:5432:*:postgres:<SECRET>
psql01.nix.mds.xyz:5432:*:postgres:<SECRET>
psql02.nix.mds.xyz:5432:*:postgres:<SECRET>
psql03.nix.mds.xyz:5432:*:postgres:<SECRET>
-bash-4.2$

Next, we'll write a short script to login to a postgres node and take a backup saving the gzip file locally on the calling server:

[root@mbpc-pc .ssh]# cat /mnt/postgres-backup.sh
#!/bin/bash

PSQLH="";
PSQLR="";

# For a reason I've yet to investigate, cluster IP doesn't work here.  So determening the node with a running instance the real shitty way.
for KEY in $( echo psql-c01 psql01 psql02 psql03 ); do
        PSQLH=$( ssh postgres@$KEY "hostname" 2>/dev/null );
        PSQLR=$( ssh postgres@$KEY "ps -ef|grep -Ei \"pgsql-10.*postgres\"|grep -v grep" 2>/dev/null);
        [[ PSQLR != “” ]] && {
                echo $PSQLH"|"$PSQLR; break;
        };
done

[[ PSQLH == “” ]] && {
        echo "ERROR: PSQLH var was empty.  Should be a hostname.";
        exit 0;
};

ssh postgres@psql-c01.nix.mds.xyz "pg_dumpall -U postgres -h $PSQLH -p 5432 | gzip -vc" > ./psql-c01.sql.$(date +%s).gz && find /mnt/SomeBigDisk/psql-backup/ -type f -name '*.sql.*.gz' -mtime +180 -exec rm {} \;
[root@mbpc-pc .ssh]#

 

Notice the find line in the above.  It will clear out any old backups after 180 saves.

Schedule this via cron or any other scheduling software you prefer:

[root@mbpc-pc .ssh]# crontab -l|grep postgres
30 3 * * * /mnt/postgres-backup.sh
[root@mbpc-pc .ssh]#

The beauty of this method is that pg_dump will always be the correct version to match the PostgreSQL software running there and this will leave no intermediate files.  You won't need to ensure the pg_dump always matches your DB Cluster.  

Thx,
TK

PANIC:  replication checkpoint has wrong magic 0 instead of 307747550

So we run into a little problem getting out PostgreSQL Patroni w/ ETCD cluster going after a rather serious failure. 

# sudo su – postgres

$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –listen_addresses=192.168.0.118 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100 –hot_standby=on –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 -d 5

This resulted in one of the 3 messages above.  Hence the post here.  If I can start a single instance, I should be fine since I could then 1) replicate over to the other two or 2) simply take a dump, reinitialize all the databases then restore the dump.  

Using the above procedure I get one of three error messages when using the data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

[ PSQL03 ]
FATAL:syntax error inhistory file:f2W 

 

Unfortunately, we couldn't do anything about PSQL03 and PSQL02, the standby's, since the database base/ folder was way out of sync, meaning, there was no tables there:

[ PSQL03 ]

[root@psql03 base]# ls -altri
total 40
    42424 drwx——.  2 postgres postgres 8192 Oct 29  2018 1
 67714749 drwx——.  2 postgres postgres 8192 Oct 29  2018 13805
202037206 drwx——.  5 postgres postgres   38 Oct 29  2018 .
134312175 drwx——.  2 postgres postgres 8192 May 22 01:55 13806
    89714 drwxr-xr-x. 20 root     root     4096 May 22 22:43 ..
[root@psql03 base]#

 

[ PSQL02 ]

 [root@psql02 base]# ls -altri

total 412
201426668 drwx——.  2 postgres postgres  8192 Oct 29  2018 1
   743426 drwx——.  2 postgres postgres  8192 Mar 24 03:47 13805
135326327 drwx——.  2 postgres postgres 16384 Mar 24 20:15 40970
   451699 drwx——.  2 postgres postgres 40960 Mar 25 19:47 16395
  1441696 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131137
 68396137 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131138
135671065 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131139
204353100 drwx——.  2 postgres postgres  8192 Mar 31 15:09 131140
135326320 drwx——. 17 postgres postgres  4096 Apr 14 10:08 .
 68574415 drwx——.  2 postgres postgres 12288 Apr 28 06:06 131142
   288896 drwx——.  2 postgres postgres 16384 Apr 28 06:06 131141
203015232 drwx——.  2 postgres postgres  8192 Apr 28 06:06 131136
135326328 drwx——.  2 postgres postgres 40960 May  5 22:09 24586
 67282461 drwx——.  2 postgres postgres  8192 May  5 22:09 13806
 67640961 drwx——.  2 postgres postgres 20480 May  5 22:09 131134
203500274 drwx——.  2 postgres postgres 16384 May  5 22:09 155710
134438257 drwxr-xr-x. 20 root     root      4096 May 22 01:44 ..
[root@psql02 base]# pwd
/root/postgres-patroni-backup/base
[root@psql02 base]#

 

[ PSQL01 ]

[root@psql01 base]# ls -altri
total 148
134704615 drwx——.  2 postgres postgres  8192 Oct 29  2018 1
201547700 drwx——.  2 postgres postgres  8192 Oct 29  2018 13805
   160398 drwx——.  2 postgres postgres  8192 Feb 24 23:53 13806
 67482137 drwx——.  7 postgres postgres    62 Feb 24 23:54 .
135909671 drwx——.  2 postgres postgres 24576 Feb 24 23:54 24586
134444555 drwx——.  2 postgres postgres 24576 Feb 24 23:54 16395
 67178716 drwxr-xr-x. 20 root     root      4096 May 22 01:53 ..
[root@psql01 base]# pwd
/root/postgresql-patroni-etcd/base
[root@psql01 base]#

So we could only work with PSQL02, the original primary node.  Everyother node has nothing.  

Looks like our replorigin_checkpoint is at fault resulting in a rather nasty replication error:

open("pg_wal/000000BE000000000000004C", O_RDONLY) = 5
open("pg_wal/000000BE000000000000004C", O_RDONLY) = 5
openat(AT_FDCWD, "base", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
openat(AT_FDCWD, "pg_tblspc", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
openat(AT_FDCWD, "pg_replslot", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
openat(AT_FDCWD, "pg_replslot", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("pg_logical/replorigin_checkpoint", O_RDONLY) = 6
write(2, "2019-06-02 14:50:34.777 EDT [283″…, 1062019-06-02 14:50:34.777 EDT [28362] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
-bash-4.2$ cat pg_logical/replorigin_checkpoint
cat: pg_logical/replorigin_checkpoint: No such file or directory
-bash-4.2$ pwd
/data/patroni/tmp
-bash-4.2$ cd ..
-bash-4.2$ cat pg_logical/replorigin_checkpoint
øÉKíÛ0ðð bø{nð- Ðð à Ð à4Ø4-bash-4.2$ PuTTY
-bash: PuTTY: command not found
-bash-4.2$
-bash-4.2$ strings pg_logical/replorigin_checkpoint
-bash-4.2$ ls -altri pg_logical/replorigin_checkpoint
67894871 -rw——-. 1 postgres postgres 16384 Oct 29  2018 pg_logical/replorigin_checkpoint
-bash-4.2$ ls -altri pg_logical/
total 20
 67894871 -rw——-.  1 postgres postgres 16384 Oct 29  2018 replorigin_checkpoint
136946383 drwx——.  2 postgres postgres     6 Oct 29  2018 snapshots
204367784 drwx——.  2 postgres postgres     6 Oct 29  2018 mappings
 67894870 drwx——.  4 postgres postgres    65 Apr 28 06:06 .
135326272 drwx——. 21 postgres postgres  4096 Jun  2 14:50 ..
-bash-4.2$


So let's copy a good one from another host (I guess we could delete it but I haven't tried):


[root@psql03 pg_logical]#
[root@psql03 pg_logical]# ls -altri
total 8
 68994432 drwx——.  2 postgres postgres    6 Oct 29  2018 snapshots
134984156 drwx——.  2 postgres postgres    6 Oct 29  2018 mappings
   566745 -rw——-.  1 postgres postgres    8 May 22 01:55 replorigin_checkpoint
   566731 drwx——.  4 postgres postgres   65 May 22 01:55 .
    89714 drwxr-xr-x. 20 root     root     4096 May 22 22:43 ..
[root@psql03 pg_logical]#
[root@psql03 pg_logical]#
[root@psql03 pg_logical]#
[root@psql03 pg_logical]# scp replorigin_checkpoint psql02:/data/patroni/pg_logical/
Password:
replorigin_checkpoint                                                                                 100%    8    10.1KB/s   00:00
[root@psql03 pg_logical]#
[root@psql03 pg_logical]#


Now we can get to the backend in standalone mode:


-bash-4.2$
-bash-4.2$ /usr/pgsql-10/bin/postgres –single -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=off –listen_addresses=192.168.0.124 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=20 -d 5 2>&1
2019-06-02 15:00:48.981 EDT [29057] DEBUG:  invoking IpcMemoryCreate(size=144687104)
2019-06-02 15:00:48.982 EDT [29057] DEBUG:  mmap(144703488) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-02 15:00:48.993 EDT [29057] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-06-02 15:00:48.993 EDT [29057] DEBUG:  removing file "pg_notify/0000"
2019-06-02 15:00:48.993 EDT [29057] DEBUG:  dynamic shared memory system will support 128 segments
2019-06-02 15:00:48.994 EDT [29057] DEBUG:  created dynamic shared memory control segment 1025202362 (3088 bytes)
2019-06-02 15:00:48.994 EDT [29057] DEBUG:  InitPostgres
2019-06-02 15:00:48.994 EDT [29057] DEBUG:  my backend ID is 1
2019-06-02 15:00:48.994 EDT [29057] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-02 15:00:49.265 EDT [29057] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-02 15:00:49.266 EDT [29057] LOG:  invalid primary checkpoint record
2019-06-02 15:00:49.266 EDT [29057] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  redo record is at 0/4C34ED70; shutdown FALSE
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  next transaction ID: 0:1409831; next OID: 237578
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  next MultiXactId: 48; next MultiXactOffset: 174
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  oldest unfrozen transaction ID: 549, in database 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  oldest MultiXactId: 1, in database 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  starting up replication slots
2019-06-02 15:00:49.266 EDT [29057] DEBUG:  starting up replication origin progress state
2019-06-02 15:00:49.266 EDT [29057] LOG:  database system was not properly shut down; automatic recovery in progress
2019-06-02 15:00:49.267 EDT [29057] DEBUG:  resetting unlogged relations: cleanup 1 init 0
2019-06-02 15:00:49.269 EDT [29057] LOG:  redo starts at 0/4C34ED70
2019-06-02 15:00:49.273 EDT [29057] DEBUG:  attempting to remove WAL segments newer than log file 000000BE000000000000004C
2019-06-02 15:00:49.273 EDT [29057] LOG:  invalid record length at 0/4C35CDC0: wanted 24, got 0
2019-06-02 15:00:49.273 EDT [29057] LOG:  redo done at 0/4C35CD90
2019-06-02 15:00:49.273 EDT [29057] LOG:  last completed transaction was at log time 2019-04-28 06:05:44.017446-04
2019-06-02 15:00:49.273 EDT [29057] DEBUG:  resetting unlogged relations: cleanup 0 init 1
2019-06-02 15:00:49.280 EDT [29057] DEBUG:  performing replication slot checkpoint
2019-06-02 15:00:49.288 EDT [29057] DEBUG:  attempting to remove WAL segments older than log file 000000000000000000000043
2019-06-02 15:00:49.289 EDT [29057] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-02 15:00:49.290 EDT [29057] DEBUG:  oldest MultiXactId member is at offset 1
2019-06-02 15:00:49.290 EDT [29057] DEBUG:  MultiXact member stop limit is now 4294914944 based on MultiXact 1
2019-06-02 15:00:49.292 EDT [29057] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-02 15:00:49.302 EDT [29057] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0

PostgreSQL stand-alone backend 10.5
backend>


But we choose not to use the backend capabilities at this time.  We'll start the database as Patroni would, using the following command:


-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=off –listen_addresses=192.168.0.124 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=20 -d 5 2>&1
2019-06-02 15:11:55.379 EDT [29789] DEBUG:  postgres: PostmasterMain: initial environment dump:
2019-06-02 15:11:55.380 EDT [29789] DEBUG:  —————————————–
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      XDG_SESSION_ID=171
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HOSTNAME=psql02.nix.mds.xyz
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      SHELL=/bin/bash
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      TERM=xterm
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HISTSIZE=1000
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      USER=postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      MAIL=/var/spool/mail/postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PWD=/data/patroni/pg_logical
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LANG=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HISTCONTROL=ignoredups
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      SHLVL=1
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      HOME=/var/lib/pgsql
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LOGNAME=postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PGDATA=/data/patroni
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      _=/usr/pgsql-10/bin/postgres
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      OLDPWD=/data/patroni
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_COLLATE=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_CTYPE=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_MESSAGES=en_US.UTF-8
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_MONETARY=C
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_NUMERIC=C
2019-06-02 15:11:55.380 EDT [29789] DEBUG:      LC_TIME=C
2019-06-02 15:11:55.380 EDT [29789] DEBUG:  —————————————–
2019-06-02 15:11:55.383 EDT [29789] LOG:  listening on IPv4 address "192.168.0.124", port 5432
2019-06-02 15:11:55.385 EDT [29789] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-06-02 15:11:55.386 EDT [29789] DEBUG:  invoking IpcMemoryCreate(size=144687104)
2019-06-02 15:11:55.387 EDT [29789] DEBUG:  mmap(144703488) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  removing file "pg_notify/0000"
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  dynamic shared memory system will support 128 segments
2019-06-02 15:11:55.398 EDT [29789] DEBUG:  created dynamic shared memory control segment 721092148 (3088 bytes)
2019-06-02 15:11:55.401 EDT [29789] DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-06-02 15:11:55.404 EDT [29789] LOG:  redirecting log output to logging collector process
2019-06-02 15:11:55.404 EDT [29789] HINT:  Future log output will appear in directory "log".

And voila!  We are in our database and can see all of our databases:

-bash-4.2$ psql -h psql02 -p 5432 -W
Password:
psql (10.5)
Type "help" for help.

postgres=# \l
                                          List of databases
      Name       |    Owner     | Encoding |   Collate   |    Ctype    |      Access privileges
—————–+————–+———-+————-+————-+—————————–
 amon_mws01      | amon_mws01   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 awx             | awx          | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 confluence      | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres               +
                 |              |          |             |             | postgres=CTc/postgres      +
                 |              |          |             |             | confluenceuser=CTc/postgres
 hue_mws01       | hue_mws01    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 metastore_mws01 | hive_mws01   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 nav_mws01       | nav_mws01    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 navms_mws01     | navms_mws01  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 oozie_mws01     | oozie_mws01  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 postgres        | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 rman_mws01      | rman_mws01   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 scm_mws01       | scm_mws01    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 sentry_mws01    | sentry_mws01 | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0       | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                +
                 |              |          |             |             | postgres=CTc/postgres
 template1       | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                +
                 |              |          |             |             | postgres=CTc/postgres
 twr             | postgres     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres               +
                 |              |          |             |             | postgres=CTc/postgres      +
                 |              |          |             |             | twr=CTc/postgres
(15 rows)

postgres=#

And checking after to ensure PostgreSQL / Patroni replicated everything just fine:

[root@psql01 base]# ls -altri
total 364
 67518653 drwx——.  2 postgres postgres  8192 Jun  2 20:35 1
134766555 drwx——.  2 postgres postgres  8192 Jun  2 20:35 13805
   152733 drwx——.  2 postgres postgres 24576 Jun  2 20:35 16395
 68698149 drwx——.  2 postgres postgres 24576 Jun  2 20:35 24586
134741283 drwx——.  2 postgres postgres 16384 Jun  2 20:35 40970
202922441 drwx——.  2 postgres postgres 16384 Jun  2 20:35 131134
   871098 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131136
 68026687 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131137
135079123 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131138
202874795 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131139
   871469 drwx——.  2 postgres postgres  8192 Jun  2 20:35 131140
 68280133 drwx——.  2 postgres postgres 16384 Jun  2 20:35 131141
135080185 drwx——.  2 postgres postgres 12288 Jun  2 20:35 131142
   152732 drwx——. 17 postgres postgres  4096 Jun  2 20:35 .
202879025 drwx——.  2 postgres postgres 16384 Jun  2 20:35 155710
 67482133 drwx——. 21 postgres postgres  4096 Jun  2 20:36 ..
201711623 drwx——.  2 postgres postgres  8192 Jun  2 20:36 13806
[root@psql01 base]#
[root@psql01 base]#
[root@psql01 base]#
[root@psql01 base]# pwd
/data/patroni/base
[root@psql01 base]#

 

[root@psql02 base]# ls -altri
total 368
204367267 drwx——.  2 postgres postgres  8192 Mar 24 03:47 13805
 68669097 drwx——. 17 postgres postgres  4096 Apr 14 10:08 .
204362619 drwx——.  2 postgres postgres 16384 Jun  2 20:31 40970
134473951 drwx——.  2 postgres postgres 24576 Jun  2 20:31 24586
 68669102 drwx——.  2 postgres postgres 24576 Jun  2 20:31 16395
138812710 drwx——.  2 postgres postgres  8192 Jun  2 20:31 1
204366769 drwx——.  2 postgres postgres 12288 Jun  2 20:31 131142
136945631 drwx——.  2 postgres postgres 16384 Jun  2 20:31 131141
 67894451 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131140
  1403920 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131139
204366412 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131138
136945273 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131137
 67894080 drwx——.  2 postgres postgres  8192 Jun  2 20:31 131136
  1403182 drwx——.  2 postgres postgres 16384 Jun  2 20:31 131134
  1404278 drwx——.  2 postgres postgres 16384 Jun  2 20:31 155710
 11395780 drwx——.  2 postgres postgres  8192 Jun  2 20:31 13806
135326272 drwx——. 21 postgres postgres  4096 Jun  2 20:31 ..
[root@psql02 base]# pwd
/data/patroni/base
[root@psql02 base]#

 

[root@psql03 audit]# cd /data/patroni/base/
[root@psql03 base]# ls -altri
total 372
 67130854 drwx——.  2 postgres postgres  8192 Jun  2 20:37 1
134446297 drwx——.  2 postgres postgres  8192 Jun  2 20:37 13805
    79298 drwx——.  2 postgres postgres 24576 Jun  2 20:37 16395
 69209007 drwx——.  2 postgres postgres 24576 Jun  2 20:37 24586
135152677 drwx——.  2 postgres postgres 16384 Jun  2 20:37 40970
201954381 drwx——.  2 postgres postgres 16384 Jun  2 20:37 131134
    80500 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131136
 68241705 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131137
134443358 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131138
201808206 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131139
    80871 drwx——.  2 postgres postgres  8192 Jun  2 20:37 131140
 68242063 drwx——.  2 postgres postgres 16384 Jun  2 20:37 131141
134443716 drwx——.  2 postgres postgres 12288 Jun  2 20:37 131142
    79297 drwx——. 17 postgres postgres  4096 Jun  2 20:37 .
201828372 drwx——.  2 postgres postgres 16384 Jun  2 20:37 155710
134812989 drwx——. 21 postgres postgres  4096 Jun  2 20:38 ..
201807458 drwx——.  2 postgres postgres  8192 Jun  2 20:38 13806
[root@psql03 base]# pwd
/data/patroni/base
[root@psql03 base]#

And that should get you up and running again.  Don't forget to get some decent PostgreSQL backups.

ALTERNATE

We have not tried this but the above could potentially be resolved by pg_resetwal as well:

[root@psql03 ~]# find / -iname pg_resetwal
/usr/pgsql-10/bin/pg_resetwal
[root@psql03 ~]#

 

Thx,
TK

FATAL:  the database system is starting up

If you are receiving the following when postgresql ( w/ Patroni ) is starting up:

2019-04-04 14:59:15.715 EDT [26025] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000390000000000000008 has already been removed
2019-04-04 14:59:16.420 EDT [26029] FATAL:  the database system is starting up

consider running the individual postgres line separately in debug mode like this to reveal the true cause:

-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100 -d 5
2019-05-23 08:40:23.585 EDT [10792] DEBUG:  postgres: PostmasterMain: initial environment dump:
2019-05-23 08:40:23.586 EDT [10792] DEBUG:  —————————————–
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      XDG_SESSION_ID=25
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HOSTNAME=psql01.nix.mds.xyz
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      SHELL=/bin/bash
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      TERM=xterm
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HISTSIZE=1000
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      USER=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      MAIL=/var/spool/mail/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PWD=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LANG=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HISTCONTROL=ignoredups
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      SHLVL=1
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HOME=/var/lib/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LOGNAME=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGDATA=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      _=/usr/pgsql-10/bin/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      OLDPWD=/data
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_COLLATE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_CTYPE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_MESSAGES=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_MONETARY=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_NUMERIC=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_TIME=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:  —————————————–
2019-05-23 08:40:23.589 EDT [10792] DEBUG:  registering background worker "logical replication launcher"
2019-05-23 08:40:23.590 EDT [10792] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:40:23.595 EDT [10792] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:40:23.597 EDT [10792] DEBUG:  invoking IpcMemoryCreate(size=148545536)
2019-05-23 08:40:23.598 EDT [10792] DEBUG:  mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  removing file "pg_notify/0000"
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  dynamic shared memory system will support 288 segments
2019-05-23 08:40:23.620 EDT [10792] DEBUG:  created dynamic shared memory control segment 499213675 (6928 bytes)
2019-05-23 08:40:23.623 EDT [10792] DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-05-23 08:40:23.626 EDT [10792] LOG:  redirecting log output to logging collector process
2019-05-23 08:40:23.626 EDT [10792] HINT:  Future log output will appear in directory "log".
^C2019-05-23 08:41:04.346 EDT [10793] DEBUG:  logger shutting down
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  proc_exit(0): 0 callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  exit(0)
-bash-4.2$ 2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  proc_exit(-1): 0 callbacks to make

-bash-4.2$
 

-bash-4.2$ free
              total        used        free      shared  buff/cache   available
Mem:        3881708      218672     1687436      219292     1975600     3113380
Swap:       4063228           0     4063228
-bash-4.2$

 

The line above in red, indicates lack of system memory on this VM due to a lack of memory on the underlying physical host (overcommitment) .  You'll need to a) assign more memory to the VM, if you see the physical has plenty, or b) purchase more memory for the physical or c) relocate the VM to a host with more memory.  If this doesn't solve the problem, we need to look deeper and check the running process using strace:

[root@psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root      2217  2188  0 00:38 pts/1    00:00:00 tail -f postgresql-Thu.log
postgres  2512     1  4 00:42 ?        00:00:01 /usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres  2533     1  0 00:42 ?        00:00:00 /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=on –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100
postgres  2535  2533  0 00:42 ?        00:00:00 postgres: postgres: logger process
postgres  2536  2533  0 00:42 ?        00:00:00 postgres: postgres: startup process   waiting for 000000010000000000000008
root      2664  2039  0 00:42 pts/0    00:00:00 grep –color=auto -Ei patroni|postgres
[root@psql01 ~]#

Then tracing the above line in red:

[root@psql01 ~]# strace -p 2536
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No such file or directory)
epoll_create1(EPOLL_CLOEXEC)            = 3
epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954624, u64=16954624}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954648, u64=16954648}}) = 0
epoll_wait(3, ^Cstrace: Process 2536 detached
 <detached …>
[root@psql01 ~]# 

Ensure you set permissions on the copied files as well or you may receive this:

[root@psql01 pg_wal]# tail -f ../log/postgresql-Fri.log
2019-05-24 01:22:32.979 EDT [13127] LOG:  aborting startup due to startup process failure
2019-05-24 01:22:32.982 EDT [13127] LOG:  database system is shut down
2019-05-24 01:22:33.692 EDT [13146] LOG:  database system was shut down in recovery at 2019-05-24 01:15:31 EDT
2019-05-24 01:22:33.693 EDT [13146] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-24 01:22:33.693 EDT [13146] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-24 01:22:33.693 EDT [13146] FATAL:  could not open file "pg_wal/0000003A.history": Permission denied

 

Thx,
TK

psql01 etcd: read wal error (walpb: crc mismatch) and cannot be repaired

To fix:

May 22 00:29:31 psql01 etcd: read wal error (walpb: crc mismatch) and cannot be repaired

Do the following.  First copy the old wal files out of the way:

[root@psql01 wal]# ls -altri
total 375092
201347741 -rw——-. 1 etcd etcd 64000056 Mar 30 15:44 0000000000000027-000000000181bbc9.wal
201347715 -rw——-. 1 etcd etcd 64000104 Apr  1 18:46 0000000000000028-000000000188798c.wal
201347727 -rw——-. 1 etcd etcd 64000056 Apr  3 18:02 0000000000000029-00000000018f2f2c.wal
201347690 -rw——-. 1 etcd etcd 64000040 Apr 22 11:24 000000000000002a-0000000001959a44.wal
201547677 -rw——-. 1 etcd etcd 64000000 Apr 22 11:24 1.tmp
201528077 -rw——-. 1 etcd etcd 64000000 Apr 28 06:06 000000000000002b-0000000001aace2a.wal
 69149887 drwx——. 4 etcd etcd       27 May 22 00:29 ..
201547666 drwx——. 2 etcd etcd     4096 May 22 00:29 .
[root@psql01 wal]# systemctl stop etcd
[root@psql01 wal]# mkdir /root/etcd-backup
[root@psql01 wal]# mv * /root/etcd-backup/

 

Next, start ETCD on the other 2 members.  Once the other two ETCD servers start, start ETCD on psql01 (first cluster member or whatever member was failing in your cluster)

ETCD should now be restarted and synced up from it's donors ( other cluster members ).  

Thx,
TK

Linux LVM: Adding Disk Space to Virtual of Physical Drives

Linux LVM: Adding Disk Space to Virtual of Physical Drives

In this writeup, we will aim to increase the size of the root drive that has:

1) Standard drive partitioning using fdisk and no LVM: /dev/sda1
2) Has LVM for the OS and files: /dev/sda2

This procedure will help to avoid the can't find the centos-root logical volume.

Read the rest of this entry »

DNS issue: Can’t ping but nslookup works

DNS issue: Can't ping but nslookup works

You can do several things in this case. Start Services then recycle DHCP Client.  ipconfig /flushdns and netsh int ip reset resettcpip.txt can fix this temporarily as well.

I've elected to simply stop DHCP Client and let the system do all lookups against my internal DNS servers.

This still leaves the problem of the DHCP Client not working correctly which I'm not 100% sure about. 

Can lookup event viewer to determine the issue however there was nothing in event viewer for this.  

Cheers,
TK

REF: https://merabheja.com/fix-nslookup-works-but-ping-fails-in-windows-10/ 

Setup a USB Null Modem for Kernel Dump Captures

We will setup a serial null modem cable for administering and connecting to a physical machine via another in the event that:  

1) We want to capture kernel crashes and dumps.  
2) Login to the machine machine remotely via another linux box to do things like restart the network.  

For this we will need:  

1) One of DB9 RS232 Serial Null Modem Cable F/F
2) Two of USB to RS232 Serial Port DB9 9 Pin Male

Connect the USB to Serial Adapter to both systems.  Following it set the tty specifc settings on ttyUSB0:

6889  stty -F /dev/ttyUSB0 115200 cs8 -cstopb -parenb
6890  stty -F /dev/ttyUSB0 -a

 

Test the serial connection by running the following:

6894  /sbin/agetty -L 115200 ttyUSB0
 

Use minicom from the connecting linux host.  When test running /sbin/agetty -L 115200 ttyUSB0, you should see a prompt:

[root@rfc1178-01 ~]# minicom

Welcome to minicom 2.6.2

OPTIONS: I18n
Compiled on Jun 25 2013, 10:33:48.
Port /dev/ttyUSB0, 11:30:08

Press CTRL-A Z for help on special keys

Scientific Linux release 6.10 (Carbon)
Kernel 4.18.19 on an x86_64

mbpc-pc login: root
Password:
Last login: Fri Apr 19 12:51:19 from 192.168.0.76
0;root@mbpc-pc:~[root@mbpc-pc ~]#
0;root@mbpc-pc:~[root@mbpc-pc ~]#
0;root@mbpc-pc:~[root@mbpc-pc ~]#
0;root@mbpc-pc:~[root@mbpc-pc ~]# uptime
 13:03:19 up 14 min,  1 user,  load average: 0.06, 0.13, 0.18
0;root@mbpc-pc:~[root@mbpc-pc ~]#

 

You should be able to login as above confirming the physical layer (USB to Serial -> Null Modem Female-to-Female -> Serial to USB) functions correctly and root is allowed to login.  Configure the kernel to send messages on the tty:

title Scientific Linux (4.18.19)
        root (hd0,0)
        kernel /vmlinuz-4.18.19 ro root=/dev/mapper/mbpcvg-rootlv rd_LVM_LV=mbpcvg/rootlv rd_LVM_LV=VGEntertain/olv_swap rd_LVM_LV=mbpcvg/swaplv rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rhgb nomodeset irqpoll pcie_aspm=off amd_iommu=on crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M pci=nomsi nohpet clocksource=rtc console=ttyUSB0,115200n8 console=tty0

Configure:

[root@mbpc-pc ~]# cat /etc/securetty |grep USB
ttyUSB0
[root@mbpc-pc ~]# cat /etc/init/ttyUSB0.conf
# ttyUSB0 – agetty
#
# This service maintains a agetty on ttyUSB0.

stop on runlevel [S06]
start on runlevel [12435]

respawn
exec agetty -L /dev/ttyUSB0 115200
[root@mbpc-pc ~]#

 

Configure the minicom settings on the external host (CTRL – A, followed by Z.  Look for option cOnfigure Minicom..O or directly using CTRL – A followed by O):

+—–[configuration]——+
| Filenames and paths      |
| File transfer protocols  |
| Serial port setup        |
| Modem and dialing        |
| Screen and keyboard      |
| Save setup as dfl        |
| Save setup as..          |
| Exit                     |
+————————–+

Followed by the settings below:

+———————————————————————–+
| A –    Serial Device      : /dev/ttyUSB0                              |
|                                                                       |
| C –   Callin Program      :                                           |
| D –  Callout Program      :                                           |
| E –    Bps/Par/Bits       : 115200 8N1                                |
| F – Hardware Flow Control : No                                        |
| G – Software Flow Control : Yes                                       |
|                                                                       |
|    Change which setting?                                              |
+———————————————————————–+

Hit ESC when done and save the configuration:

| Save setup as dfl        |

Restart the server to ensure changes take effect.  You should now see messages from the minicom terminal on the secondary system:

Welcome to minicom 2.6.2

OPTIONS: I18n
Compiled on Jun 25 2013, 10:33:48.
Port /dev/ttyUSB0, 12:03:52

Press CTRL-A Z for help on special keys


Scientific Linux release 6.10 (Carbon)
Kernel 4.18.19 on an x86_64

mbpc-pc login:

Next, test restart with the console connected to see restart messages being printed:

Linux version 4.18.19 (root@mbpc-pc) (gcc version 4.4.7 201209
Command line: ro root=/dev/mapper/mbpcvg-rootlv rd_LVM_LV=mbpcvg/rootlv rd_LVM_8
x86/fpu: x87 FPU will use FXSAVE
BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x0000000000093fff] usable
BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000dfceffff] usable
BIOS-e820: [mem 0x00000000dfcf0000-0x00000000dfcf0fff] ACPI NVS
BIOS-e820: [mem 0x00000000dfcf1000-0x00000000dfcfffff] ACPI data
BIOS-e820: [mem 0x00000000dfd00000-0x00000000dfdfffff] reserved
BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
BIOS-e820: [mem 0x00000000fec00000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000011fffffff] usable
NX (Execute Disable) protection: active
SMBIOS 2.4 present.
DMI: Gigabyte Technology Co., Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
AGP: No AGP bridge found

 

Testing can be done using this:

[root@mbpc-pc cores]# echo "This is a ttyUSB0 test from mbpc-pc." > /dev/ttyUSB0
[root@mbpc-pc cores]#

 

Result on the console is:

[root@mbpc-pc ~]# This is a ttyUSB0 test from mbpc-pc.
CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.6.2  | VT102 | Online 08:12

 

If you get a prompt but no kernel messages, ensure you compile the following options into the kernel:

CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL_EDGEPORT_TI=y
CONFIG_USB_SERIAL_MOS7840=y

You can find the above in the make menuconfig driver sections.  You can find the above by pressing forward slash ( / ) followed by the search string CONFIG_USB_SERIAL which will give you the path of the option:


  |   Location:                                               |
  |     -> Device Drivers                                     |
  |       -> USB support (USB_SUPPORT [=y])                   |
  |         -> USB Serial Converter support (USB_SERIAL [=y]) |

 

If you get kernel messages but no prompt (after enabling additional kernel parameters above) then try adding the following additional parameters:

[root@mbpc-pc linux-4.18.19]# cat /etc/init/ttyUSB0.conf
# ttyUSB0 – agetty
#
# This service maintains a agetty on ttyUSB0.

stop on runlevel [S06] and (
            not-container or
            container CONTAINER=lxc or
            container CONTAINER=lxc-libvirt)

start on runlevel [12435]

respawn
exec agetty -L /dev/ttyUSB0 115200 vt100
[root@mbpc-pc linux-4.18.19]#

 

However for us it was just a matter of restarting against since agetty didn't come up the first time.  If with the addition of the above items in green you now get a console, all is good and you should be all set to capture the kernel messages when crashes happen!

REF: https://wiki.freepbx.org/display/PC/Capturing+Kernel+Panic+via+Serial+Port

Cheers,
TK

com.cloudera.cmf.service.CommandException: java.io.IOException: Cannot create command directory: /var/lib/cloudera-scm-server/temp/commands/114

Getting this?

com.cloudera.cmf.service.CommandException: java.io.IOException: Cannot create command directory: /var/lib/cloudera-scm-server/temp/commands/114

it's because we blow the folder away.  Reinstall the packages:

[root@cm-r01nn01 ~]# yum reinstall cloudera-manager-daemons cloudera-manager-agent cloudera-manager-server -y

Thx,
TK


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License