Header Shadow Image


FATAL:  the database system is starting up

If you are receiving the following when postgresql ( w/ Patroni ) is starting up:

2019-04-04 14:59:15.715 EDT [26025] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000390000000000000008 has already been removed
2019-04-04 14:59:16.420 EDT [26029] FATAL:  the database system is starting up

consider running the individual postgres line separately in debug mode like this to reveal the true cause:

-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100 -d 5
2019-05-23 08:40:23.585 EDT [10792] DEBUG:  postgres: PostmasterMain: initial environment dump:
2019-05-23 08:40:23.586 EDT [10792] DEBUG:  —————————————–
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      XDG_SESSION_ID=25
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HOSTNAME=psql01.nix.mds.xyz
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      SHELL=/bin/bash
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      TERM=xterm
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HISTSIZE=1000
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      USER=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      MAIL=/var/spool/mail/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PWD=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LANG=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HISTCONTROL=ignoredups
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      SHLVL=1
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      HOME=/var/lib/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LOGNAME=postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGDATA=/data/patroni
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      _=/usr/pgsql-10/bin/postgres
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      OLDPWD=/data
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_COLLATE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_CTYPE=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_MESSAGES=en_US.UTF-8
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_MONETARY=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_NUMERIC=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:      LC_TIME=C
2019-05-23 08:40:23.586 EDT [10792] DEBUG:  —————————————–
2019-05-23 08:40:23.589 EDT [10792] DEBUG:  registering background worker "logical replication launcher"
2019-05-23 08:40:23.590 EDT [10792] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:40:23.595 EDT [10792] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:40:23.597 EDT [10792] DEBUG:  invoking IpcMemoryCreate(size=148545536)
2019-05-23 08:40:23.598 EDT [10792] DEBUG:  mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  removing file "pg_notify/0000"
2019-05-23 08:40:23.619 EDT [10792] DEBUG:  dynamic shared memory system will support 288 segments
2019-05-23 08:40:23.620 EDT [10792] DEBUG:  created dynamic shared memory control segment 499213675 (6928 bytes)
2019-05-23 08:40:23.623 EDT [10792] DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-05-23 08:40:23.626 EDT [10792] LOG:  redirecting log output to logging collector process
2019-05-23 08:40:23.626 EDT [10792] HINT:  Future log output will appear in directory "log".
^C2019-05-23 08:41:04.346 EDT [10793] DEBUG:  logger shutting down
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  proc_exit(0): 0 callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  exit(0)
-bash-4.2$ 2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-05-23 08:41:04.346 EDT [10793] DEBUG:  proc_exit(-1): 0 callbacks to make

-bash-4.2$
 

-bash-4.2$ free
              total        used        free      shared  buff/cache   available
Mem:        3881708      218672     1687436      219292     1975600     3113380
Swap:       4063228           0     4063228
-bash-4.2$

 

The line above in red, indicates lack of system memory on this VM due to a lack of memory on the underlying physical host (overcommitment) .  You'll need to a) assign more memory to the VM, if you see the physical has plenty, or b) purchase more memory for the physical or c) relocate the VM to a host with more memory.  If this doesn't solve the problem, we need to look deeper and check the running process using strace:

[root@psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root      2217  2188  0 00:38 pts/1    00:00:00 tail -f postgresql-Thu.log
postgres  2512     1  4 00:42 ?        00:00:01 /usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres  2533     1  0 00:42 ?        00:00:00 /usr/pgsql-10/bin/postgres -D /data/patroni –config-file=/data/patroni/postgresql.conf –hot_standby=on –listen_addresses=192.168.0.108 –max_worker_processes=8 –max_locks_per_transaction=64 –wal_level=replica –cluster_name=postgres –wal_log_hints=on –max_wal_senders=10 –track_commit_timestamp=off –max_prepared_transactions=0 –port=5432 –max_replication_slots=10 –max_connections=100
postgres  2535  2533  0 00:42 ?        00:00:00 postgres: postgres: logger process
postgres  2536  2533  0 00:42 ?        00:00:00 postgres: postgres: startup process   waiting for 000000010000000000000008
root      2664  2039  0 00:42 pts/0    00:00:00 grep –color=auto -Ei patroni|postgres
[root@psql01 ~]#

Then tracing the above line in red:

[root@psql01 ~]# strace -p 2536
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No such file or directory)
epoll_create1(EPOLL_CLOEXEC)            = 3
epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954624, u64=16954624}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=16954648, u64=16954648}}) = 0
epoll_wait(3, ^Cstrace: Process 2536 detached
 <detached …>
[root@psql01 ~]# 

Ensure you set permissions on the copied files as well or you may receive this:

[root@psql01 pg_wal]# tail -f ../log/postgresql-Fri.log
2019-05-24 01:22:32.979 EDT [13127] LOG:  aborting startup due to startup process failure
2019-05-24 01:22:32.982 EDT [13127] LOG:  database system is shut down
2019-05-24 01:22:33.692 EDT [13146] LOG:  database system was shut down in recovery at 2019-05-24 01:15:31 EDT
2019-05-24 01:22:33.693 EDT [13146] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-24 01:22:33.693 EDT [13146] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-24 01:22:33.693 EDT [13146] FATAL:  could not open file "pg_wal/0000003A.history": Permission denied

Alternatively, the segments could be copied over to the node missing them in it's archive, however this did not work for me since it appears the primary was too far ahead of the failed replica (5 days worth):

psql02 # patronictl -c /etc/patroni.yml list; du -sh .
+———-+————-+——————–+——–+———+———–+
| Cluster  |    Member   |        Host        |  Role  |  State  | Lag in MB |
+———-+————-+——————–+——–+———+———–+
| postgres | postgresql0 | psql01.nix.mds.xyz |        | running |     235.0 |
| postgres | postgresql1 | psql02.nix.mds.xyz | Leader | running |       0.0 |
| postgres | postgresql2 | psql03.nix.mds.xyz |        | running |       0.0 |
+———-+————-+——————–+——–+———+———–+
193M    .
psql02 # pwd
/data/patroni/pg_wal/
psql02 # ls -altri
total 196952
140042537 -rw——-.  1 postgres postgres    54942 Dec 27  2021 000004C4.history
140121953 -rw——-.  1 postgres postgres    54989 Dec 27  2021 000004C5.history
140042529 -rw——-.  1 postgres postgres    55036 Dec 27  2021 000004C6.history
140042535 -rw——-.  1 postgres postgres    55083 Jan 15  2022 000004C7.history
140042527 -rw——-.  1 postgres postgres    55130 Jan 16  2022 000004C8.history
140042506 -rw——-.  1 postgres postgres    55177 Sep 30 19:21 000004C9.history
204315808 drwx——.  2 postgres postgres        6 Sep 30 23:01 archive_status
135355261 -rw——-.  1 postgres postgres 16777216 Dec 17 22:26 000004C90000003E000000A7
135127587 -rw——-.  1 postgres postgres 16777216 Dec 17 23:42 000004C90000003E000000A8
 68940244 drwx——. 21 postgres postgres     4096 Dec 18 00:00 ..
140042505 -rw——-.  1 postgres postgres 16777216 Dec 18 01:30 000004C90000003E000000A9
140042509 -rw——-.  1 postgres postgres 16777216 Dec 18 03:17 000004C90000003E0000009E
140042507 -rw——-.  1 postgres postgres 16777216 Dec 18 05:07 000004C90000003E0000009F
135355263 -rw——-.  1 postgres postgres 16777216 Dec 18 06:50 000004C90000003E000000A0
140042510 -rw——-.  1 postgres postgres 16777216 Dec 18 08:40 000004C90000003E000000A1
135127616 -rw——-.  1 postgres postgres 16777216 Dec 18 10:28 000004C90000003E000000A2
135341490 -rw——-.  1 postgres postgres 16777216 Dec 18 12:13 000004C90000003E000000A3
135355259 -rw——-.  1 postgres postgres 16777216 Dec 18 14:01 000004C90000003E000000A4
140042524 -rw——-.  1 postgres postgres 16777216 Dec 18 15:51 000004C90000003E000000A5
140121952 drwx——.  3 postgres postgres     4096 Dec 18 15:51 .
135355260 -rw——-.  1 postgres postgres 16777216 Dec 18 17:21 000004C90000003E000000A6
 psql02 # scp * root@psql01.nix.mds.xyz:/data/patroni/pg_wal/

Checking replication again:

# patronictl -c /etc/patroni.yml list;
+———-+————-+——————–+——–+———+———–+
| Cluster  |    Member   |        Host        |  Role  |  State  | Lag in MB |
+———-+————-+——————–+——–+———+———–+
| postgres | postgresql0 | psql01.nix.mds.xyz |        | running |       0.0 |
| postgres | postgresql1 | psql02.nix.mds.xyz | Leader | running |       0.0 |
| postgres | postgresql2 | psql03.nix.mds.xyz |        | running |       0.0 |
+———-+————-+——————–+——–+———+———–+

An alternative still, is to set 

wal_keep_segments

to a value other then '8'.  For example:

# patronictl -c /etc/patroni.yml edit-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    max_connections: 256
    max_replication_slots: 64
    max_wal_senders: 32
    wal_keep_segments: 64
  use_pg_rewind: true
retry_timeout: 10
ttl: 30

However, in some versions, there was a notable bug:

https://github.com/zalando/patroni/issues/2062

patroni/patroni/postgresql/config.py

Line 837 in 258e7e2

 parameters.setdefault('wal_keep_size', str(wal_keep_segments * 16) + 'MB') 
2021-09-06 13:31:38,413 WARNING: Value=100100100100100100100100100100100100100100100100MB of parameter=wal_keep_size is too big, decreasing to 2147483647MB

due to a fix in a recent release, prior to the one used here:

https://github.com/zalando/patroni/pull/1654/files

which may cause issues.  In our case, since we're running an older version, we'll set it to 64 instead of 8 as per the above.

# patronictl -c /etc/patroni.yml edit-config

+++
@@ -5,6 +5,7 @@
     max_connections: 256
     max_replication_slots: 64
     max_wal_senders: 32
+    wal_keep_segments: 64
   use_pg_rewind: true
 retry_timeout: 10
 ttl: 30

Apply these changes? [y/N]: y
Configuration changed

And update the patroni.yml config as well:

# cat /etc/patroni.yml
.
.
.
postgresql:
    parameters:
        unix_socket_directories: '.'
        max_connections: '256'
        max_replication_slots: '64'
        wal_keep_segments: '64'

 

Thx,
TK

Leave a Reply

You must be logged in to post a comment.


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License