FATAL: the database system is starting up
If you are receiving the following when postgresql ( w/ Patroni ) is starting up:
2019-04-04 14:59:15.715 EDT [26025] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000390000000000000008 has already been removed
2019-04-04 14:59:16.420 EDT [26029] FATAL: the database system is starting up
consider running the individual postgres line separately in debug mode like this to reveal the true cause:
-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100 -d 5
2019-05-23 08:40:23.585 EDT [10792] DEBUG: postgres: PostmasterMain: initial environment dump:
2019-05-23 08:40:23.586 EDT [10792] DEBUG: —————————————–
2019-05-23 08:40:23.586 EDT [10792] DEBUG: XDG_SESSION_ID=25
2019-05-23 08:40:23.586 EDT [10792] DEBUG: HOSTNAME=psql01.nix.mds.xyz
2019-05-23 08:40:23.586 EDT [10792] DEBUG: SHELL=/bin/bash
2019-05-23 08:40:23.586 EDT [10792] DEBUG: TERM=xterm
2019-05-23 08:40:23.586 EDT [10792] DEBUG: HISTSIZE=1000
2019-05-23 08:40:23.586 EDT [10792] DEBUG: USER=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-05-23 08:40:23.586 EDT [10792] DEBUG: MAIL=/var/spool/mail/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG: PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-05-23 08:40:23.586 EDT [10792] DEBUG: PWD=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LANG=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG: HISTCONTROL=ignoredups
2019-05-23 08:40:23.586 EDT [10792] DEBUG: SHLVL=1
2019-05-23 08:40:23.586 EDT [10792] DEBUG: HOME=/var/lib/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LOGNAME=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG: PGDATA=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-05-23 08:40:23.586 EDT [10792] DEBUG: _=/usr/pgsql-10/bin/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG: OLDPWD=/data
2019-05-23 08:40:23.586 EDT [10792] DEBUG: PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-05-23 08:40:23.586 EDT [10792] DEBUG: PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LC_COLLATE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LC_CTYPE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LC_MESSAGES=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LC_MONETARY=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LC_NUMERIC=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG: LC_TIME=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG: —————————————–
2019-05-23 08:40:23.589 EDT [10792] DEBUG: registering background worker "logical replication launcher"
2019-05-23 08:40:23.590 EDT [10792] LOG: listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:40:23.595 EDT [10792] LOG: listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:40:23.597 EDT [10792] DEBUG: invoking IpcMemoryCreate(size=148545536)
2019-05-23 08:40:23.598 EDT [10792] DEBUG: mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-05-23 08:40:23.619 EDT [10792] DEBUG: SlruScanDirectory invoking callback on pg_notify/0000
2019-05-23 08:40:23.619 EDT [10792] DEBUG: removing file "pg_notify/0000"
2019-05-23 08:40:23.619 EDT [10792] DEBUG: dynamic shared memory system will support 288 segments
2019-05-23 08:40:23.620 EDT [10792] DEBUG: created dynamic shared memory control segment 499213675 (6928 bytes)
2019-05-23 08:40:23.623 EDT [10792] DEBUG: max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-05-23 08:40:23.626 EDT [10792] LOG: redirecting log output to logging collector process
2019-05-23 08:40:23.626 EDT [10792] HINT: Future log output will appear in directory "log".
^C2019-05-23 08:41:04.346 EDT [10793] DEBUG: logger shutting down
2019-05-23 08:41:04.346 EDT [10793] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG: proc_exit(0): 0 callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG: exit(0)
-bash-4.2$ 2019-05-23 08:41:04.346 EDT [10793] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG: proc_exit(-1): 0 callbacks to make
-bash-4.2$
-bash-4.2$ free
total used free shared buff/cache available
Mem: 3881708 218672 1687436 219292 1975600 3113380
Swap: 4063228 0 4063228
-bash-4.2$
The line above in red, indicates lack of system memory on this VM due to a lack of memory on the underlying physical host (overcommitment) . You'll need to a) assign more memory to the VM, if you see the physical has plenty, or b) purchase more memory for the physical or c) relocate the VM to a host with more memory. If this doesn't solve the problem, we need to look deeper and check the running process using strace:
[root@psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root 2217 2188 0 00:38 pts/1 00:00:00 tail -f postgresql-Thu.log
postgres 2512 1 4 00:42 ? 00:00:01 /usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres 2533 1 0 00:42 ? 00:00:00 /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=on –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100
postgres 2535 2533 0 00:42 ? 00:00:00 postgres: postgres: logger process
postgres 2536 2533 0 00:42 ? 00:00:00 postgres: postgres: startup process waiting for 000000010000000000000008
root 2664 2039 0 00:42 pts/0 00:00:00 grep –color=auto -Ei patroni|postgres
[root@psql01 ~]#
Then tracing the above line in red:
[root@psql01 ~]# strace -p 2536
read(5, 0x7fff9cb4eb87, 1) = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x7fff9cb4eb87, 1) = -1 EAGAIN (Resource temporarily unavailable)
open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No such file or directory)
epoll_create1(EPOLL_CLOEXEC) = 3
epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954624, u64=16954624}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954648, u64=16954648}}) = 0
epoll_wait(3, ^Cstrace: Process 2536 detached
<detached …>
[root@psql01 ~]#
Ensure you set permissions on the copied files as well or you may receive this:
[root@psql01 pg_wal]# tail -f ../log/postgresql-Fri.log
2019-05-24 01:22:32.979 EDT [13127] LOG: aborting startup due to startup process failure
2019-05-24 01:22:32.982 EDT [13127] LOG: database system is shut down
2019-05-24 01:22:33.692 EDT [13146] LOG: database system was shut down in recovery at 2019-05-24 01:15:31 EDT
2019-05-24 01:22:33.693 EDT [13146] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-24 01:22:33.693 EDT [13146] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-24 01:22:33.693 EDT [13146] FATAL: could not open file "pg_wal/0000003A.history": Permission denied
Alternatively, the segments could be copied over to the node missing them in it's archive, however this did not work for me since it appears the primary was too far ahead of the failed replica (5 days worth):
psql02 # patronictl -c /etc/patroni.yml list; du -sh .
+———-+————-+——————–+——–+———+———–+
| Cluster | Member | Host | Role | State | Lag in MB |
+———-+————-+——————–+——–+———+———–+
| postgres | postgresql0 | psql01.nix.mds.xyz | | running | 235.0 |
| postgres | postgresql1 | psql02.nix.mds.xyz | Leader | running | 0.0 |
| postgres | postgresql2 | psql03.nix.mds.xyz | | running | 0.0 |
+———-+————-+——————–+——–+———+———–+
193M .
psql02 # pwd
/data/patroni/pg_wal/
psql02 # ls -altri
total 196952
140042537 -rw——-. 1 postgres postgres 54942 Dec 27 2021 000004C4.history
140121953 -rw——-. 1 postgres postgres 54989 Dec 27 2021 000004C5.history
140042529 -rw——-. 1 postgres postgres 55036 Dec 27 2021 000004C6.history
140042535 -rw——-. 1 postgres postgres 55083 Jan 15 2022 000004C7.history
140042527 -rw——-. 1 postgres postgres 55130 Jan 16 2022 000004C8.history
140042506 -rw——-. 1 postgres postgres 55177 Sep 30 19:21 000004C9.history
204315808 drwx——. 2 postgres postgres 6 Sep 30 23:01 archive_status
135355261 -rw——-. 1 postgres postgres 16777216 Dec 17 22:26 000004C90000003E000000A7
135127587 -rw——-. 1 postgres postgres 16777216 Dec 17 23:42 000004C90000003E000000A8
68940244 drwx——. 21 postgres postgres 4096 Dec 18 00:00 ..
140042505 -rw——-. 1 postgres postgres 16777216 Dec 18 01:30 000004C90000003E000000A9
140042509 -rw——-. 1 postgres postgres 16777216 Dec 18 03:17 000004C90000003E0000009E
140042507 -rw——-. 1 postgres postgres 16777216 Dec 18 05:07 000004C90000003E0000009F
135355263 -rw——-. 1 postgres postgres 16777216 Dec 18 06:50 000004C90000003E000000A0
140042510 -rw——-. 1 postgres postgres 16777216 Dec 18 08:40 000004C90000003E000000A1
135127616 -rw——-. 1 postgres postgres 16777216 Dec 18 10:28 000004C90000003E000000A2
135341490 -rw——-. 1 postgres postgres 16777216 Dec 18 12:13 000004C90000003E000000A3
135355259 -rw——-. 1 postgres postgres 16777216 Dec 18 14:01 000004C90000003E000000A4
140042524 -rw——-. 1 postgres postgres 16777216 Dec 18 15:51 000004C90000003E000000A5
140121952 drwx——. 3 postgres postgres 4096 Dec 18 15:51 .
135355260 -rw——-. 1 postgres postgres 16777216 Dec 18 17:21 000004C90000003E000000A6
psql02 # scp * root@psql01.nix.mds.xyz:/data/patroni/pg_wal/
Checking replication again:
# patronictl -c /etc/patroni.yml list;
+———-+————-+——————–+——–+———+———–+
| Cluster | Member | Host | Role | State | Lag in MB |
+———-+————-+——————–+——–+———+———–+
| postgres | postgresql0 | psql01.nix.mds.xyz | | running | 0.0 |
| postgres | postgresql1 | psql02.nix.mds.xyz | Leader | running | 0.0 |
| postgres | postgresql2 | psql03.nix.mds.xyz | | running | 0.0 |
+———-+————-+——————–+——–+———+———–+
An alternative still, is to set
wal_keep_segments
to a value other then '8'. For example:
# patronictl -c /etc/patroni.yml edit-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
max_connections: 256
max_replication_slots: 64
max_wal_senders: 32
wal_keep_segments: 64
use_pg_rewind: true
retry_timeout: 10
ttl: 30
However, in some versions, there was a notable bug:
https://github.com/zalando/patroni/issues/2062
patroni/patroni/postgresql/config.py
Line 837 in 258e7e2
parameters.setdefault('wal_keep_size', str(wal_keep_segments * 16) + 'MB')
2021-09-06 13:31:38,413 WARNING: Value=100100100100100100100100100100100100100100100100MB of parameter=wal_keep_size is too big, decreasing to 2147483647MB
due to a fix in a recent release, prior to the one used here:
https://github.com/zalando/patroni/pull/1654/files
which may cause issues. In our case, since we're running an older version, we'll set it to 64 instead of 8 as per the above.
# patronictl -c /etc/patroni.yml edit-config
—
+++
@@ -5,6 +5,7 @@
max_connections: 256
max_replication_slots: 64
max_wal_senders: 32
+ wal_keep_segments: 64
use_pg_rewind: true
retry_timeout: 10
ttl: 30
Apply these changes? [y/N]: y
Configuration changed
And update the patroni.yml config as well:
# cat /etc/patroni.yml
.
.
.
postgresql:
parameters:
unix_socket_directories: '.'
max_connections: '256'
max_replication_slots: '64'
wal_keep_segments: '64'
Thx,
TK