Header Shadow Image


rpc mount export: RPC: Unable to receive; errno = Connection refused

For the below errors:

[root@psql02 log]# showmount -e nfs02
rpc mount export: RPC: Unable to receive; errno = Connection refused
[root@psql02 log]#

Apr 16 01:12:37 nfs02 kernel: FINAL_REJECT: IN=eth0 OUT= MAC=00:50:56:86:2d:21:00:50:56:86:3c:c7:08:00 SRC=192.168.0.124 DST=192.168.0.119 LEN=60 TOS=0×00 PREC=0×00 TTL=64 ID=44729 DF PROTO=TCP SPT=978 DPT=20048 WINDOW=29200 RES=0×00 SYN URGP=0

[root@nfs02 log]#
[root@nfs02 log]#
[root@nfs02 log]#
[root@nfs02 log]# firewall-cmd –zone=public –list-all
public
  target: default
  icmp-block-inversion: no
  interfaces:
  sources:
  services: haproxy
  ports: 20048/udp 2049/tcp 111/tcp 111/udp 24007-24008/tcp 38465-38469/tcp 4501/tcp 4501/udp 22/tcp 22/udp 49000-59999/udp 49000-59999/tcp 9000/tcp 9000/udp 137/udp 138/udp
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

[root@nfs02 log]#

Ensure you have port 20048 TCP added to your firewall:

  995  firewall-cmd –zone=public –permanent –add-port=20048/tcp
  996  firewall-cmd –reload

Cheers,
TK

psql: error while loading shared libraries: libpq.so.rh-postgresql95-5: cannot open shared object file: No such file or directory

Well heck:

-bash-4.2$ psql
psql: error while loading shared libraries: libpq.so.rh-postgresql95-5: cannot open shared object file: No such file or directory
-bash-4.2$

So let's see what's going on:

[root@ovirt01 ~]# find / -iname libpq.so*
/usr/lib64/libpq.so.5
/usr/lib64/libpq.so.5.5
/opt/rh/rh-postgresql95/root/usr/lib64/libpq.so.rh-postgresql95-5
/opt/rh/rh-postgresql95/root/usr/lib64/libpq.so.rh-postgresql95-5.8
[root@ovirt01 ~]#

So we can see it's in the lib64 path.  And within the root postgres folder, we see that usr/lib is empty that lib points too:

[root@ovirt01 root]# find / -iname psql
/opt/rh/rh-postgresql95/root/usr/bin/psql
[root@ovirt01 root]# pwd
/opt/rh/rh-postgresql95/root
[root@ovirt01 root]# ls -altrid lib
201829110 lrwxrwxrwx. 1 root root 7 Feb 12 11:08 lib -> usr/lib
[root@ovirt01 root]# pwd
/opt/rh/rh-postgresql95/root
[root@ovirt01 root]# ls -altri usr/lib/
total 4
134420487 dr-xr-xr-x.  2 root root    6 Feb 16  2016 .
 67638534 drwxr-xr-x. 13 root root 4096 Feb 12 11:08 ..
[root@ovirt01 root]#

So obviously, nothing that uses usr/lib/ will get anything useful out of it.  But the following path under the same folder above has lot's of usefull things:

[root@ovirt01 root]# ls -altrid lib64
201829111 lrwxrwxrwx. 1 root root 9 Feb 12 11:08 lib64 -> usr/lib64
[root@ovirt01 root]#

Since the lib folder doesn't have anything useful in it, a simple solution is to link lib to usr/lib64 instead.  So let's do that.  Sure enough:

201829110 lrwxrwxrwx.  1 root root    9 Apr 15 00:56 lib -> usr/lib64

And here we go again:

-bash-4.2$ strace psql
execve("/opt/rh/rh-postgresql95/root/usr/bin/psql", ["psql"], [/* 21 vars */]) = 0
brk(NULL)                               = 0x17b5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa7a1ff5000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=42118, …}) = 0
mmap(NULL, 42118, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa7a1fea000
close(3)                                = 0
open("/lib64/tls/x86_64/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/lib64/tls/x86_64", 0x7ffcdfe3c4a0) = -1 ENOENT (No such file or directory)
open("/lib64/tls/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/lib64/tls", {st_mode=S_IFDIR|0555, st_size=6, …}) = 0
open("/lib64/x86_64/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/lib64/x86_64", 0x7ffcdfe3c4a0)   = -1 ENOENT (No such file or directory)
open("/lib64/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/lib64", {st_mode=S_IFDIR|0555, st_size=40960, …}) = 0
open("/usr/lib64/tls/x86_64/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/tls/x86_64", 0x7ffcdfe3c4a0) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/tls", {st_mode=S_IFDIR|0555, st_size=6, …}) = 0
open("/usr/lib64/x86_64/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/x86_64", 0x7ffcdfe3c4a0) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libpq.so.rh-postgresql95-5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib64", {st_mode=S_IFDIR|0555, st_size=40960, …}) = 0
writev(2, [{"psql", 4}, {": ", 2}, {"error while loading shared libra"..., 36}, {": ", 2}, {"libpq.so.rh-postgresql95-5", 26}, {": ", 2}, {"cannot open shared object file", 30}, {": ", 2}, {"No such file or directory", 25}, {"\n", 1}], 10psql: error while loading shared libraries: libpq.so.rh-postgresql95-5: cannot open shared object file: No such file or directory
) = 130
exit_group(127)                         = ?
+++ exited with 127 +++
-bash-4.2$

 

So we need to add it to the default library path.  Easy enough:

[root@ovirt01 ld.so.conf.d]#
[root@ovirt01 ld.so.conf.d]# cat postgres-x86_64.conf
/opt/rh/rh-postgresql95/root/lib
/opt/rh/rh-postgresql95/root/lib64
[root@ovirt01 ld.so.conf.d]# ldconfig
[root@ovirt01 ld.so.conf.d]# strings /etc/ld.so.cache |grep -Ei postgresql95
libpq.so.rh-postgresql95-5
/opt/rh/rh-postgresql95/root/lib64/libpq.so.rh-postgresql95-5
libpgtypes.so.rh-postgresql95-3
/opt/rh/rh-postgresql95/root/lib64/libpgtypes.so.rh-postgresql95-3
libecpg_compat.so.rh-postgresql95-3
/opt/rh/rh-postgresql95/root/lib64/libecpg_compat.so.rh-postgresql95-3
libecpg.so.rh-postgresql95-6
/opt/rh/rh-postgresql95/root/lib64/libecpg.so.rh-postgresql95-6
[root@ovirt01 ld.so.conf.d]#

And let's try again.  And sure enough, we have a winner:

-bash-4.2$
-bash-4.2$ psql
psql (9.5.9)
Type "help" for help.

postgres=# \l
                                             List of databases
         Name         |        Owner         | Encoding |   Collate   |    Ctype    |   Access privileges
———————-+———————-+———-+————-+————-+———————–
 engine               | engine               | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 ovirt_engine_history | ovirt_engine_history | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 postgres             | postgres             | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0            | postgres             | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
                      |                      |          |             |             | postgres=CTc/postgres
 template1            | postgres             | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
                      |                      |          |             |             | postgres=CTc/postgres
(5 rows)

postgres=#

Cheers,
TK

 

NFS reply xid reply ERR 20: Auth Invalid failure code 13 and AD logins become hung when using NFS home directories.

You get the following error message:

15:27:35.430633 IP 192.168.0.80.nfs > 192.168.0.145.843: Flags [P.], seq 29:53, ack 417, win 235, options [nop,nop,TS val 6635947 ecr 5159126], length 24: NFS reply xid 2938911306 reply ERR 20: Auth Invalid failure code 13

Checking the nfs01 / nfs02 server we see alot of the following:

[2018-04-08 17:19:39.014330] E [MSGID: 108006] [afr-common.c:5006:__afr_handle_child_down_event] 0-gv01-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.

[2018-04-08 17:32:21.714643] E [MSGID: 114058] [client-handshake.c:1571:client_query_portmap_cbk] 0-gv01-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.

[2018-04-08 17:32:21.734187] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-gv01-dht: dict is null

[2018-04-08 17:32:21.734437] I [MSGID: 104039] [glfs-resolve.c:935:__glfs_active_subvol] 0-gv01: first lookup on graph 6e667330-322e-6e69-782e-6d64732e7879 (0) failed (Transport endpoint is not connected) [Transport endpoint is not connected]

Looks like the backend store for our NFS mount is not looking too hot right now.  So we go into troubleshooting Gluster.

 

Cheers,
TK

/usr/bin/ganesha.nfsd: /lib64/libntirpc.so.1.6: version `NTIRPC_1.6.1′ not found (required by /usr/bin/ganesha.nfsd)

What to do when you get this message below:


[root@nfs01 ganesha]# systemctl status nfs-ganesha -l
â nfs-ganesha.service – NFS-Ganesha file server
   Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sun 2018-04-08 13:22:32 EDT; 1min 8s ago
     Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki
  Process: 2033 ExecStart=/bin/bash -c ${NUMACTL} ${NUMAOPTS} /usr/bin/ganesha.nfsd ${OPTIONS} ${EPOCH} (code=exited, status=1/FAILURE)

Apr 08 13:22:32 nfs01.nix.mds.xyz systemd[1]: Starting NFS-Ganesha file server…
Apr 08 13:22:32 nfs01.nix.mds.xyz bash[2033]: /usr/bin/ganesha.nfsd: /lib64/libntirpc.so.1.6: version `NTIRPC_1.6.1' not found (required by /usr/bin/ganesha.nfsd)
Apr 08 13:22:32 nfs01.nix.mds.xyz systemd[1]: nfs-ganesha.service: control process exited, code=exited status=1
Apr 08 13:22:32 nfs01.nix.mds.xyz systemd[1]: Failed to start NFS-Ganesha file server.
Apr 08 13:22:32 nfs01.nix.mds.xyz systemd[1]: Unit nfs-ganesha.service entered failed state.
Apr 08 13:22:32 nfs01.nix.mds.xyz systemd[1]: nfs-ganesha.service failed.
[root@nfs01 ganesha]#

I simply went into my custom build of nfs ganesha and issued a reinstallation of the binaries. 

[root@nfs02 nfs-ganesha]# make install
[ 16%] Built target ntirpc
[ 17%] Built target log
[ 19%] Built target config_parsing
[ 22%] Built target cidr
[ 23%] Built target avltree
[ 23%] Built target hashtable
[ 28%] Built target sal
[ 29%] Built target rpcal
[ 29%] Built target nfs4callbacks
[ 53%] Built target nfsproto
[ 55%] Built target nfs_mnt_xdr
[ 58%] Built target nlm
[ 58%] Built target gos
[ 58%] Built target string_utils
[ 60%] Built target rquota
[ 68%] Built target 9p
[ 69%] Built target sm_notify.ganesha
[ 70%] Built target hash
[ 70%] Built target netgroup_cache
[ 75%] Built target support
[ 75%] Built target uid2grp
[ 77%] Built target fsalnull
[ 80%] Built target fsalmdcache
[ 81%] Built target fsalpseudo
[ 82%] Built target fsalproxy
[ 87%] Built target fsalgpfs
[ 89%] Built target fsalgluster
[ 90%] Built target fsalmem
[ 90%] Built target idmap
[ 93%] Built target FsalCore
[ 96%] Built target MainServices
[100%] Built target ganesha.nfsd
Install the project…
– Install configuration: "Debug"
– Skipping: /etc/ganesha/ganesha.conf (already exists)
Files "/root/ganesha/nfs-ganesha/src/config_samples/ganesha.conf.example" to "/etc/ganesha/ganesha.conf" are different.
– Installing: /etc/ganesha/ganesha.conf.example
– Installing: /usr/share/doc/ganesha/config_samples
– Up-to-date: /usr/share/doc/ganesha/config_samples/ds.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/export.txt
– Up-to-date: /usr/share/doc/ganesha/config_samples/ganesha.conf.example
– Up-to-date: /usr/share/doc/ganesha/config_samples/gluster.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/gpfs.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/gpfs.ganesha.exports.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/gpfs.ganesha.log.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/gpfs.ganesha.main.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/gpfs.ganesha.nfsd.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/logging.txt
– Up-to-date: /usr/share/doc/ganesha/config_samples/logrotate_fsal_gluster
– Up-to-date: /usr/share/doc/ganesha/config_samples/logrotate_ganesha
– Up-to-date: /usr/share/doc/ganesha/config_samples/mem.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/rgw.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/rgw_bucket.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/vfs.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/xfs.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/README
– Up-to-date: /usr/share/doc/ganesha/config_samples/ceph.conf
– Up-to-date: /usr/share/doc/ganesha/config_samples/config.txt
– Installing: /var/run/ganesha
– Installing: /usr/lib64/pkgconfig/libntirpc.pc
– Installing: /usr/include/ntirpc
– Installing: /usr/include/ntirpc/fpmath.h
– Installing: /usr/include/ntirpc/getpeereid.h
– Installing: /usr/include/ntirpc/intrinsic.h
– Installing: /usr/include/ntirpc/libc_private.h
– Installing: /usr/include/ntirpc/misc
– Installing: /usr/include/ntirpc/misc/abstract_atomic.h
– Installing: /usr/include/ntirpc/misc/bsd_epoll.h
– Installing: /usr/include/ntirpc/misc/city.h
– Installing: /usr/include/ntirpc/misc/citycrc.h
– Installing: /usr/include/ntirpc/misc/event.h
– Installing: /usr/include/ntirpc/misc/opr.h
– Installing: /usr/include/ntirpc/misc/opr_queue.h
– Installing: /usr/include/ntirpc/misc/os_epoll.h
– Installing: /usr/include/ntirpc/misc/portable.h
– Installing: /usr/include/ntirpc/misc/queue.h
– Installing: /usr/include/ntirpc/misc/rbtree.h
– Installing: /usr/include/ntirpc/misc/rbtree_x.h
– Installing: /usr/include/ntirpc/misc/socket.h
– Installing: /usr/include/ntirpc/misc/stdint.h
– Installing: /usr/include/ntirpc/misc/stdio.h
– Installing: /usr/include/ntirpc/misc/timespec.h
– Installing: /usr/include/ntirpc/misc/wait_queue.h
– Installing: /usr/include/ntirpc/misc/winpthreads.h
– Installing: /usr/include/ntirpc/namespace.h
– Installing: /usr/include/ntirpc/netconfig.h
– Installing: /usr/include/ntirpc/reentrant.h
– Installing: /usr/include/ntirpc/rpc
– Installing: /usr/include/ntirpc/rpc/auth_inline.h
– Installing: /usr/include/ntirpc/rpc/auth_stat.h
– Installing: /usr/include/ntirpc/rpc/clnt_stat.h
– Installing: /usr/include/ntirpc/rpc/des.h
– Installing: /usr/include/ntirpc/rpc/des_crypt.h
– Installing: /usr/include/ntirpc/rpc/gss_internal.h
– Installing: /usr/include/ntirpc/rpc/nettype.h
– Installing: /usr/include/ntirpc/rpc/pool_queue.h
– Installing: /usr/include/ntirpc/rpc/rpc.h
– Installing: /usr/include/ntirpc/rpc/rpc_cksum.h
– Installing: /usr/include/ntirpc/rpc/rpc_com.h
– Installing: /usr/include/ntirpc/rpc/rpc_err.h
– Installing: /usr/include/ntirpc/rpc/rpc_msg.h
– Installing: /usr/include/ntirpc/rpc/rpcent.h
– Installing: /usr/include/ntirpc/rpc/svc_rqst.h
– Installing: /usr/include/ntirpc/rpc/tirpc_compat.h
– Installing: /usr/include/ntirpc/rpc/xdr_ioq.h
– Installing: /usr/include/ntirpc/rpc/auth.h
– Installing: /usr/include/ntirpc/rpc/auth_gss.h
– Installing: /usr/include/ntirpc/rpc/auth_unix.h
– Installing: /usr/include/ntirpc/rpc/clnt.h
– Installing: /usr/include/ntirpc/rpc/pmap_prot.h
– Installing: /usr/include/ntirpc/rpc/pmap_rmt.h
– Installing: /usr/include/ntirpc/rpc/rpcb_clnt.h
– Installing: /usr/include/ntirpc/rpc/rpcb_prot.h
– Installing: /usr/include/ntirpc/rpc/rpcb_prot.x
– Installing: /usr/include/ntirpc/rpc/svc.h
– Installing: /usr/include/ntirpc/rpc/svc_auth.h
– Installing: /usr/include/ntirpc/rpc/types.h
– Installing: /usr/include/ntirpc/rpc/work_pool.h
– Installing: /usr/include/ntirpc/rpc/xdr.h
– Installing: /usr/include/ntirpc/rpc/xdr_inline.h
– Installing: /usr/include/ntirpc/rpcsvc
– Installing: /usr/include/ntirpc/rpcsvc/crypt.h
– Installing: /usr/include/ntirpc/un-namespace.h
– Installing: /usr/include/ntirpc/version.h
– Up-to-date: /usr/lib64/libntirpc.so.1.6.1
– Installing: /usr/lib64/libntirpc.so.1.6
– Installing: /usr/lib64/libntirpc.so
– Up-to-date: /usr/lib64/ganesha/libfsalnull.so.4.2.0
– Up-to-date: /usr/lib64/ganesha/libfsalnull.so.4
– Up-to-date: /usr/lib64/ganesha/libfsalnull.so
– Up-to-date: /usr/lib64/ganesha/libfsalproxy.so.4.2.0
– Up-to-date: /usr/lib64/ganesha/libfsalproxy.so.4
– Up-to-date: /usr/lib64/ganesha/libfsalproxy.so
– Up-to-date: /usr/lib64/ganesha/libfsalgpfs.so.4.2.0
– Up-to-date: /usr/lib64/ganesha/libfsalgpfs.so.4
– Up-to-date: /usr/lib64/ganesha/libfsalgpfs.so
– Up-to-date: /usr/lib64/ganesha/libfsalgluster.so.4.2.0
– Up-to-date: /usr/lib64/ganesha/libfsalgluster.so.4
– Up-to-date: /usr/lib64/ganesha/libfsalgluster.so
– Up-to-date: /usr/lib64/ganesha/libfsalmem.so.4.2.0
– Up-to-date: /usr/lib64/ganesha/libfsalmem.so.4
– Up-to-date: /usr/lib64/ganesha/libfsalmem.so
– Up-to-date: /usr/bin/ganesha.nfsd
[root@nfs02 nfs-ganesha]#

 

I should however move over to the rpm packages now that they are at Ganesha V2.6 .  

Cheers,
TK

kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO and mount hangs

Mount still hangs and restarting NFS and autofs doesn't work to resolve this error below:

Apr  8 12:12:46 ovirt01 systemd: Stopping Automounts filesystems on demand…
Apr  8 12:12:46 ovirt01 kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO
Apr  8 12:12:47 ovirt01 automount[1487]: umount_autofs_indirect: ask umount returned busy /n
Apr  8 12:12:48 ovirt01 ovsdb-server: ovs|111192|stream_ssl|ERR|Private key must be configured to use SSL
Apr  8 12:12:48 ovirt01 ovsdb-server: ovs|111193|stream_ssl|ERR|Certificate must be configured to use SSL
Apr  8 12:12:49 ovirt01 systemd: Starting Automounts filesystems on demand…
Apr  8 12:12:49 ovirt01 automount[2165]: lookup_read_map: lookup(sss): getautomntent_r: No such file or directory
Apr  8 12:12:49 ovirt01 systemd: Started Automounts filesystems on demand.

This looks like a potential kernel bug as indicated through this thread.  Upgrade kernel and try again.  And this is the RH document that speak to more of this.  However despite the kernel versions, the issue continued again.  See the next dated post for more info.  Now despite this, we continued to have issues, even through reboots and it appeared that the above kernel upgrade didn't do very much at all.  So we investigated by attempting to mount the share remotely using both participating nodes and the VIP:

[root@ipaclient01 /]# mount nfs02:/n /n -v
mount.nfs: timeout set for Sun Apr  8 19:58:13 2018
mount.nfs: trying text-based options 'vers=4.1,addr=192.168.0.119,clientaddr=192.168.0.236'
mount.nfs: mount(2): No such file or directory
mount.nfs: trying text-based options 'addr=192.168.0.119'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying 192.168.0.119 prog 100003 vers 3 prot TCP port 2049
mount.nfs: prog 100005, trying vers=3, prot=17
mount.nfs: trying 192.168.0.119 prog 100005 vers 3 prot UDP port 20048
mount.nfs: mount(2): Permission denied
mount.nfs: access denied by server while mounting nfs02:/n

[root@ipaclient01 /]# showmount -e nfs02
Export list for nfs02:
/n (everyone)
[root@ipaclient01 /]#

Next we try with the following, remembering to unmount /n right after each attempt:

mount nfs02:/n /n -v
mount nfs01:/n /n -v
mount nfs-c01:/n /n -v

[root@ipaclient01 /]# showmount -e nfs02
Export list for nfs02:
/n (everyone)
[root@ipaclient01 /]#

 

Though we're still not 100% the exact solution that fixed this but we did recompile NFS Ganesha and reinstalled it also taking the Gluster FS offline and online numerous times as well as doing the same to NFS Ganesha and ensuring that HAPROXY and keepalived is working on both nodes nodes.  We also added HAPROXY statistics to the configuration with these lines at the end of each /etc/haproxy/haproxy.cfg file:

[root@nfs02 ganesha]# cat /etc/haproxy/haproxy.cfg
global
    log         127.0.0.1 local2 debug
    stats       socket /var/run/haproxy.sock mode 0600 level admin
    # stats     socket /var/lib/haproxy/stats
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    debug

defaults
    mode                    tcp
    log                     global
    option                  dontlognull
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000


frontend nfs-in
    bind nfs-c01:2049
    mode tcp
    option tcplog
    default_backend             nfs-back


backend nfs-back
    balance     roundrobin
    server      nfs02.nix.mds.xyz    nfs02.nix.mds.xyz:2049 check
    server      nfs01.nix.mds.xyz    nfs01.nix.mds.xyz:2049 check

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats realm Haproxy\ Statistics
    stats uri /haproxy-stats
    stats auth admin:passw0rd
[root@nfs02 ganesha]#

Sure enough, once this was done, the mounts worked like a charm from all the other hosts.  Another issue was that we were calling dhclient within a certain script at boot time.  This allocated a secondary IP to the interface potentially confusing the NFS servers though this is unconfirmed.  Nontheless, the host was reconfigured with only one static IP.

Digging some more:


==> n.log <==
[2018-04-09 05:08:13.704156] I [MSGID: 100030] [glusterfsd.c:2556:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.13.2 (args: /usr/sbin/glusterfs –process-name fuse –volfile-server=nfs01 –volfile-id=/gv01 /n)
[2018-04-09 05:08:13.711255] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction
[2018-04-09 05:08:13.728297] W [socket.c:3216:socket_connect] 0-glusterfs: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
[2018-04-09 05:08:13.729025] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-04-09 05:08:13.737757] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-04-09 05:08:13.738114] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3
[2018-04-09 05:08:13.738203] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4
[2018-04-09 05:08:13.738324] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 5
[2018-04-09 05:08:13.738330] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 6
[2018-04-09 05:08:13.738655] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 7
[2018-04-09 05:08:13.738742] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 8
[2018-04-09 05:08:13.739460] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-gv01-readdir-ahead: option 'parallel-readdir' is not recognized
[2018-04-09 05:08:13.739787] I [MSGID: 114020] [client.c:2360:notify] 0-gv01-client-0: parent translators are ready, attempting connect on transport
[2018-04-09 05:08:13.747040] W [socket.c:3216:socket_connect] 0-gv01-client-0: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
[2018-04-09 05:08:13.747372] I [MSGID: 114020] [client.c:2360:notify] 0-gv01-client-1: parent translators are ready, attempting connect on transport
[2018-04-09 05:08:13.747883] E [MSGID: 114058] [client-handshake.c:1571:client_query_portmap_cbk] 0-gv01-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2018-04-09 05:08:13.748026] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-gv01-client-0: disconnected from gv01-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2018-04-09 05:08:13.748070] W [MSGID: 108001] [afr-common.c:5391:afr_notify] 0-gv01-replicate-0: Client-quorum is not met
[2018-04-09 05:08:13.754493] W [socket.c:3216:socket_connect] 0-gv01-client-1: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
Final graph:
+——————————————————————————+
  1: volume gv01-client-0
  2:     type protocol/client
  3:     option ping-timeout 42
  4:     option remote-host nfs01
  5:     option remote-subvolume /bricks/0/gv01
  6:     option transport-type socket
  7:     option transport.address-family inet
  8:     option username 916ccf06-dc1d-467f-bc3d-f00a7449618f
  9:     option password a44739e0-9587-411f-8e6a-9a6a4e46156c
 10:     option event-threads 8
 11:     option transport.tcp-user-timeout 0
 12:     option transport.socket.keepalive-time 20
 13:     option transport.socket.keepalive-interval 2
 14:     option transport.socket.keepalive-count 9
 15:     option send-gids true
 16: end-volume
 17:
 18: volume gv01-client-1
 19:     type protocol/client
 20:     option ping-timeout 42
 21:     option remote-host nfs02
 22:     option remote-subvolume /bricks/0/gv01
 23:     option transport-type socket
 24:     option transport.address-family inet
 25:     option username 916ccf06-dc1d-467f-bc3d-f00a7449618f
 26:     option password a44739e0-9587-411f-8e6a-9a6a4e46156c
 27:     option event-threads 8
 28:     option transport.tcp-user-timeout 0
 29:     option transport.socket.keepalive-time 20
 30:     option transport.socket.keepalive-interval 2
 31:     option transport.socket.keepalive-count 9
 32:     option send-gids true
 33: end-volume
 34:
 35: volume gv01-replicate-0
 36:     type cluster/replicate
 37:     option afr-pending-xattr gv01-client-0,gv01-client-1
 38:     option quorum-type auto
 39:     option use-compound-fops off
 40:     subvolumes gv01-client-0 gv01-client-1
 41: end-volume
 42:
 43: volume gv01-dht
 44:     type cluster/distribute
 45:     option lock-migration off
 46:     subvolumes gv01-replicate-0
 47: end-volume
 48:
 49: volume gv01-write-behind
 50:     type performance/write-behind
 51:     option cache-size 8MB
 52:     subvolumes gv01-dht
 53: end-volume
 54:
 55: volume gv01-read-ahead
 56:     type performance/read-ahead
 57:     subvolumes gv01-write-behind
 58: end-volume
 59:
 60: volume gv01-readdir-ahead
 61:     type performance/readdir-ahead
 62:     option parallel-readdir off
 63:     option rda-request-size 131072
 64:     option rda-cache-limit 10MB
 65:     subvolumes gv01-read-ahead
 66: end-volume
 67:
 68: volume gv01-io-cache
 69:     type performance/io-cache
 70:     option cache-size 1GB
 71:     subvolumes gv01-readdir-ahead
 72: end-volume
 73:
 74: volume gv01-quick-read
 75:     type performance/quick-read
 76:     option cache-size 1GB
 77:     subvolumes gv01-io-cache
 78: end-volume
 79:
 80: volume gv01-open-behind
 81:     type performance/open-behind
 82:     subvolumes gv01-quick-read
 83: end-volume
 84:
 85: volume gv01-md-cache
 86:     type performance/md-cache
 87:     subvolumes gv01-open-behind
 88: end-volume
 89:
 90: volume gv01
 91:     type debug/io-stats
 92:     option log-level INFO
 93:     option latency-measurement off
 94:     option count-fop-hits off
 95:     subvolumes gv01-md-cache
 96: end-volume
 97:
 98: volume meta-autoload
 99:     type meta
100:     subvolumes gv01
101: end-volume
102:
+——————————————————————————+
[2018-04-09 05:08:13.922631] E [socket.c:2374:socket_connect_finish] 0-gv01-client-1: connection to 192.168.0.119:24007 failed (No route to host); disconnecting socket
[2018-04-09 05:08:13.922690] E [MSGID: 108006] [afr-common.c:5164:__afr_handle_child_down_event] 0-gv01-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2018-04-09 05:08:13.926201] I [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2018-04-09 05:08:13.926245] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 0
[2018-04-09 05:08:13.926518] I [MSGID: 108006] [afr-common.c:5444:afr_local_init] 0-gv01-replicate-0: no subvolumes up
[2018-04-09 05:08:13.926671] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-gv01-dht: dict is null
[2018-04-09 05:08:13.926762] E [fuse-bridge.c:4271:fuse_first_lookup] 0-fuse: first lookup on root failed (Transport endpoint is not connected)
[2018-04-09 05:08:13.927207] I [MSGID: 108006] [afr-common.c:5444:afr_local_init] 0-gv01-replicate-0: no subvolumes up
[2018-04-09 05:08:13.927262] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-gv01-dht: dict is null
[2018-04-09 05:08:13.927301] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001: failed to resolve (Transport endpoint is not connected)
[2018-04-09 05:08:13.927339] E [fuse-bridge.c:900:fuse_getattr_resume] 0-glusterfs-fuse: 2: GETATTR 1 (00000000-0000-0000-0000-000000000001) resolution failed
[2018-04-09 05:08:13.931497] I [MSGID: 108006] [afr-common.c:5444:afr_local_init] 0-gv01-replicate-0: no subvolumes up
[2018-04-09 05:08:13.931558] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-gv01-dht: dict is null
[2018-04-09 05:08:13.931599] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001: failed to resolve (Transport endpoint is not connected)
[2018-04-09 05:08:13.931623] E [fuse-bridge.c:900:fuse_getattr_resume] 0-glusterfs-fuse: 3: GETATTR 1 (00000000-0000-0000-0000-000000000001) resolution failed
[2018-04-09 05:08:13.937258] I [fuse-bridge.c:5093:fuse_thread_proc] 0-fuse: initating unmount of /n
[2018-04-09 05:08:13.938043] W [glusterfsd.c:1393:cleanup_and_exit] (–>/lib64/libpthread.so.0(+0x7e25) [0x7fb80b05ae25] –>/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x560b52471675] –>/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x560b5247149b] ) 0-: received signum (15), shutting down
[2018-04-09 05:08:13.938086] I [fuse-bridge.c:5855:fini] 0-fuse: Unmounting '/n'.
[2018-04-09 05:08:13.938106] I [fuse-bridge.c:5860:fini] 0-fuse: Closing fuse connection to '/n'.

==> glusterd.log <==
[2018-04-09 05:08:15.118078] W [socket.c:3216:socket_connect] 0-management: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"

==> glustershd.log <==
[2018-04-09 05:08:15.282192] W [socket.c:3216:socket_connect] 0-gv01-client-0: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
[2018-04-09 05:08:15.289508] W [socket.c:3216:socket_connect] 0-gv01-client-1: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"

We see that when one volume in the cluster is offline, gluster takes the other one offline too.  We might need a third server?

[root@nfs01 /]# mount -t glusterfs nfs01:/gv01 /n
Mount failed. Please check the log file for more details.
[root@nfs01 /]#

Can't mount the volume:

[root@nfs01 /]# gluster volume start gv01
volume start: gv01: failed: Quorum not met. Volume operation not allowed.
[root@nfs01 /]#

In this case we set two gluster tunables to allow the single brick to start:

[root@nfs01 glusterfs]# gluster volume set VOL cluster.server-quorum-type none
volume set: failed: Volume VOL does not exist
[root@nfs01 glusterfs]# gluster volume set gv01 cluster.server-quorum-type none
volume set: success
[root@nfs01 glusterfs]# gluster volume set gv01 cluster.quorum-type none
volume set: success
[root@nfs01 glusterfs]#
[root@nfs01 glusterfs]#
[root@nfs01 glusterfs]#
[root@nfs01 glusterfs]# gluster volume status
Status of volume: gv01
Gluster process                             TCP Port  RDMA Port  Online  Pid
——————————————————————————
Brick nfs01:/bricks/0/gv01                  49152     0          Y       28139
Self-heal Daemon on localhost               N/A       N/A        Y       28026

Task Status of Volume gv01
——————————————————————————
There are no active volume tasks

[root@nfs01 glusterfs]#

And set the same in /etc/glusterfs/glusterd.vol:

[root@nfs01 glusterfs]# cat /etc/glusterfs/glusterd.vol
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 0
    option event-threads 1
    option cluster.quorum-type none
    option cluster.server-quorum-type none
#   option lock-timer 180
#   option transport.address-family inet6
#   option base-port 49152
#   option max-port  65535
end-volume
[root@nfs01 glusterfs]#

Then you can mount remotely again.  But alas, that didn't resolve this either. 

ULTIMATE SOLUTION THAT WORKED

Until we started to look at auditd:

type=AVC msg=audit(1526965320.850:4094): avc:  denied  { write } for  pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 scontext=system_u:system_r:ganesha_t:s0 tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file
type=SYSCALL msg=audit(1526965320.850:4094): arch=c000003e syscall=2 success=no exit=-13 a0=7f23b0003150 a1=2 a2=180 a3=2 items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null)
type=PROCTITLE msg=audit(1526965320.850:4094): proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54
type=AVC msg=audit(1526965320.850:4095): avc:  denied  { unlink } for  pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 scontext=system_u:system_r:ganesha_t:s0 tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file
type=SYSCALL msg=audit(1526965320.850:4095): arch=c000003e syscall=87 success=no exit=-13 a0=7f23b0004100 a1=7f23b0000050 a2=7f23b0004100 a3=5 items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null)
type=PROCTITLE msg=audit(1526965320.850:4095): proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54

A few lines like this:

grep AVC /var/log/audit/audit.log | audit2allow -M systemd-allow

semodule -i systemd-allow.pp

solved the issue for us.  The error thrown also incuded this kernel panic message:

May 21 23:53:13 psql01 kernel: CPU: 3 PID: 2273 Comm: mount.nfs Tainted: G             L ————   3.10.0-693.21.1.el7.x86_64 #1


May 21 23:53:13 psql01 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014


May 21 23:53:13 psql01 kernel: task: ffff880136335ee0 ti: ffff8801376b0000 task.ti: ffff8801376b0000


May 21 23:53:13 psql01 kernel: RIP: 0010:[]  [] _raw_spin_unlock_irqrestore+0×15/0×20


May 21 23:53:13 psql01 kernel: RSP: 0018:ffff8801376b3a60  EFLAGS: 00000206


May 21 23:53:13 psql01 kernel: RAX: ffffffffc05ab078 RBX: ffff880036973928 RCX: dead000000000200


May 21 23:53:13 psql01 kernel: RDX: ffffffffc05ab078 RSI: 0000000000000206 RDI: 0000000000000206


May 21 23:53:13 psql01 kernel: RBP: ffff8801376b3a60 R08: ffff8801376b3ab8 R09: ffff880137de1200


May 21 23:53:13 psql01 kernel: R10: ffff880036973928 R11: 0000000000000000 R12: ffff880036973928


May 21 23:53:13 psql01 kernel: R13: ffff8801376b3a58 R14: ffff88013fd98a40 R15: ffff8801376b3a58


May 21 23:53:13 psql01 kernel: FS:  00007fab48f07880(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000


May 21 23:53:13 psql01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b


May 21 23:53:13 psql01 kernel: CR2: 00007f99793d93cc CR3: 000000013761e000 CR4: 00000000000007e0


May 21 23:53:13 psql01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000


May 21 23:53:13 psql01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400


May 21 23:53:13 psql01 kernel: Call Trace:


May 21 23:53:13 psql01 kernel: [] finish_wait+0×56/0×70


May 21 23:53:13 psql01 kernel: [] nfs_wait_client_init_complete+0xa1/0xe0 [nfs]


May 21 23:53:13 psql01 kernel: [] ? wake_up_atomic_t+0×30/0×30


May 21 23:53:13 psql01 kernel: [] nfs_get_client+0x22b/0×470 [nfs]


May 21 23:53:13 psql01 kernel: [] nfs4_set_client+0×98/0×130 [nfsv4]


May 21 23:53:13 psql01 kernel: [] nfs4_create_server+0x13e/0x3b0 [nfsv4]


May 21 23:53:13 psql01 kernel: [] nfs4_remote_mount+0x2e/0×60 [nfsv4]


May 21 23:53:13 psql01 kernel: [] mount_fs+0x3e/0x1b0


May 21 23:53:13 psql01 kernel: [] ? __alloc_percpu+0×15/0×20


May 21 23:53:13 psql01 kernel: [] vfs_kern_mount+0×67/0×110


May 21 23:53:13 psql01 kernel: [] nfs_do_root_mount+0×86/0xc0 [nfsv4]


May 21 23:53:13 psql01 kernel: [] nfs4_try_mount+0×44/0xc0 [nfsv4]


May 21 23:53:13 psql01 kernel: [] ? get_nfs_version+0×27/0×90 [nfs]


May 21 23:53:13 psql01 kernel: [] nfs_fs_mount+0x4cb/0xda0 [nfs]


May 21 23:53:13 psql01 kernel: [] ? nfs_clone_super+0×140/0×140 [nfs]


May 21 23:53:13 psql01 kernel: [] ? param_set_portnr+0×70/0×70 [nfs]


May 21 23:53:13 psql01 kernel: [] mount_fs+0x3e/0x1b0


May 21 23:53:13 psql01 kernel: [] ? __alloc_percpu+0×15/0×20


May 21 23:53:13 psql01 kernel: [] vfs_kern_mount+0×67/0×110


May 21 23:53:13 psql01 kernel: [] do_mount+0×233/0xaf0


May 21 23:53:13 psql01 kernel: [] SyS_mount+0×96/0xf0


May 21 23:53:13 psql01 kernel: [] system_call_fastpath+0x1c/0×21


May 21 23:53:13 psql01 kernel: [] ? system_call_after_swapgs+0xae/0×146

Best of luck!

Cheers,
TK

command ConnectStoragePoolVDS failed: Cannot find master domain:

So we receive the following error from oVirt:

VDSM mdskvm-p01.mds.xyz command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=87ec67c6-8da8-4161-afdf-180778a4b595, msdUUID=73fa156c-f085-466f-b409-130a9795a667'

and dig in a bit deeper to see what's going on:

[root@mdskvm-p01 log]# systemctl status vdsmd.service
â vdsmd.service – Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2018-03-30 23:18:02 EDT; 23h ago
  Process: 2787 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh –pre-start (code=exited, status=0/SUCCESS)
 Main PID: 2875 (vdsmd)
   CGroup: /system.slice/vdsmd.service
           ââ 2875 /usr/bin/python2 /usr/share/vdsm/vdsmd
           ââ16845 /usr/libexec/ioprocess –read-pipe-fd 51 –write-pipe-fd 50 –max-threads 10 –max-queued-requests 10

Mar 31 00:39:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd…001d610>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task…
Mar 31 00:40:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd…c0b96d0>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task…
Mar 31 00:40:23 mdskvm-p01.mds.xyz vdsm[2875]: WARN unhandled close event
Mar 31 00:40:35 mdskvm-p01.mds.xyz fence_ilo[20843]: Unable to connect/login to fencing device
Mar 31 00:40:37 mdskvm-p01.mds.xyz fence_ilo[20889]: Unable to connect/login to fencing device
Mar 31 00:41:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd…009b650>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task…
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agen… removed
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_a… removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agen… removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_a… removed
Hint: Some lines were ellipsized, use -l to show in full.
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]# systemctl status vdsmd.service -l
â vdsmd.service – Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2018-03-30 23:18:02 EDT; 23h ago
  Process: 2787 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh –pre-start (code=exited, status=0/SUCCESS)
 Main PID: 2875 (vdsmd)
   CGroup: /system.slice/vdsmd.service
           ââ 2875 /usr/bin/python2 /usr/share/vdsm/vdsmd
           ââ16845 /usr/libexec/ioprocess –read-pipe-fd 51 –write-pipe-fd 50 –max-threads 10 –max-queued-requests 10

Mar 31 00:39:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd92bdf1 at 0x7fcd3c0b9950> timeout=30.0, duration=0 at 0x7fcd4001d610>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
                                                   task()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
                                                   self._callable()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 349, in __call__
                                                   self._execute()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 391, in _execute
                                                   self._vm.updateDriveVolume(drive)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4209, in updateDriveVolume
                                                   vmDrive.volumeID)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6119, in _getVolumeSize
                                                   (domainID, volumeID))
                                               StorageUnavailableError: Unable to get volume size for domain 73fa156c-f085-466f-b409-130a9795a667 volume 81186557-9080-42d1-ba6a-633fb8b805e5
Mar 31 00:40:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd92bdf1 at 0x7fcd5805cd90> timeout=30.0, duration=0 at 0x7fcd3c0b96d0>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
                                                   task()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
                                                   self._callable()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 349, in __call__
                                                   self._execute()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 391, in _execute
                                                   self._vm.updateDriveVolume(drive)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4209, in updateDriveVolume
                                                   vmDrive.volumeID)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6119, in _getVolumeSize
                                                   (domainID, volumeID))
                                               StorageUnavailableError: Unable to get volume size for domain 73fa156c-f085-466f-b409-130a9795a667 volume 81186557-9080-42d1-ba6a-633fb8b805e5
Mar 31 00:40:23 mdskvm-p01.mds.xyz vdsm[2875]: WARN unhandled close event
Mar 31 00:40:35 mdskvm-p01.mds.xyz fence_ilo[20843]: Unable to connect/login to fencing device
Mar 31 00:40:37 mdskvm-p01.mds.xyz fence_ilo[20889]: Unable to connect/login to fencing device
Mar 31 00:41:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd92bdf1 at 0x3adeb90> timeout=30.0, duration=0 at 0x7fcd2009b650>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
                                                   task()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
                                                   self._callable()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 349, in __call__
                                                   self._execute()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 391, in _execute
                                                   self._vm.updateDriveVolume(drive)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4209, in updateDriveVolume
                                                   vmDrive.volumeID)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6119, in _getVolumeSize
                                                   (domainID, volumeID))
                                               StorageUnavailableError: Unable to get volume size for domain 73fa156c-f085-466f-b409-130a9795a667 volume 81186557-9080-42d1-ba6a-633fb8b805e5
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agent.0 already removed
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_agent.0 already removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agent.0 already removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_agent.0 already removed
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]# systemctl restart vdsmd.service -l
[root@mdskvm-p01 log]# systemctl status vdsmd.service -l
â vdsmd.service – Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2018-04-01 00:04:52 EDT; 2s ago
  Process: 22701 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh –post-stop (code=exited, status=0/SUCCESS)
  Process: 22705 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh –pre-start (code=exited, status=0/SUCCESS)
 Main PID: 22783 (vdsmd)
   CGroup: /system.slice/vdsmd.service
           ââ22783 /usr/bin/python2 /usr/share/vdsm/vdsmd

Apr 01 00:04:50 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running prepare_transient_repository
Apr 01 00:04:51 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running syslog_available
Apr 01 00:04:51 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running nwfilter
Apr 01 00:04:51 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running dummybr
Apr 01 00:04:52 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running tune_system
Apr 01 00:04:52 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running test_space
Apr 01 00:04:52 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running test_lo
Apr 01 00:04:52 mdskvm-p01.mds.xyz systemd[1]: Started Virtual Desktop Server Manager.
Apr 01 00:04:53 mdskvm-p01.mds.xyz vdsm[22783]: WARN MOM not available.
Apr 01 00:04:53 mdskvm-p01.mds.xyz vdsm[22783]: WARN MOM not available, KSM stats will be missing.
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#

XFS metadata corruption shows up ( /var/log/messages ):

Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xebffc502
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xebffc502 ("xfs_trans_read_buf_map") error 117 numblks 1
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xefffc402
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ovirtmgmt: received packet on bond0 with own address as source address (addr:78:e7:d1:8f:4d:26, vlan:0)
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ovirtmgmt: received packet on bond0 with own address as source address (addr:78:e7:d1:8f:4d:26, vlan:0)
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xefffc402 ("xfs_trans_read_buf_map") error 117 numblks 1
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xf3ffc302
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xf3ffc302 ("xfs_trans_read_buf_map") error 117 numblks 1
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xf7ffc202
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xf7ffc202 ("xfs_trans_read_buf_map") error 117 numblks 1

So we fix using the following after going into runlevel 1 and unmounting the volume:

xfs_repair -n /dev/mdskvmsanvg/mdskvmsanlv 2>&1 | more

Cheers,
TK

missing icmp_seq numbers

In case you see these missing icmp_seq numbers:

[root@mdskvm-p01 ~]# ping 192.168.0.149
PING 192.168.0.149 (192.168.0.149) 56(84) bytes of data.
64 bytes from 192.168.0.149: icmp_seq=1 ttl=64 time=0.536 ms
64 bytes from 192.168.0.149: icmp_seq=3 ttl=64 time=0.240 ms
64 bytes from 192.168.0.149: icmp_seq=7 ttl=64 time=0.330 ms
64 bytes from 192.168.0.149: icmp_seq=11 ttl=64 time=0.331 ms
64 bytes from 192.168.0.149: icmp_seq=15 ttl=64 time=0.353 ms
64 bytes from 192.168.0.149: icmp_seq=19 ttl=64 time=0.271 ms

 

Then you should check ifcfg-bond0 setting:

BONDING_OPTS='mode=4 miimon=100'

In our case, we had 2/4 NIC ports plugged into the switch so the above tried to do link aggregation using inactive ports.

BONDING_OPTS='mode=1 miimon=100'

We also tried mode=2 but that didn't work.  Further reading is available from RedHat Bonding Modes.

Some of the other common issues exibited by this include:

no route to host
Destination Host Unreachable

 

Cheers,
Tom

 

Extending the size of your mdadm array.

Extending the size of your mdadm array.  Now that you've replaced all the failed disks, we can double the size of our array to 8TB from 4TB.

We start off with this array:

[root@mbpc-pc log]# mdadm –detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Mar 26 00:06:24 2012
     Raid Level : raid6
     Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
  Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Mar 29 23:02:24 2018
          State : active
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : mbpc:0
           UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
         Events : 1333503

    Number   Major   Minor   RaidDevice State
       8       8       64        0      active sync   /dev/sde
       9       8       32        1      active sync   /dev/sdc
       7       8       16        2      active sync   /dev/sdb
      11       8       48        3      active sync   /dev/sdd
       6       8       80        4      active sync   /dev/sdf
      10       8        0        5      active sync   /dev/sda
[root@mbpc-pc log]#

So let's do this:

[root@mbpc-pc log]#
[root@mbpc-pc log]# mdadm –grow /dev/md0 –size=max
mdadm: component size of /dev/md0 has been set to 1953513536K
unfreeze
[root@mbpc-pc log]# mdadm –detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Mar 26 00:06:24 2012
     Raid Level : raid6
     Array Size : 7814054144 (7452.06 GiB 8001.59 GB)
  Used Dev Size : 1953513536 (1863.02 GiB 2000.40 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Mar 29 23:42:32 2018
          State : active, resyncing
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Resync Status : 51% complete

           Name : mbpc:0
           UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
         Events : 1333507

    Number   Major   Minor   RaidDevice State
       8       8       64        0      active sync   /dev/sde
       9       8       32        1      active sync   /dev/sdc
       7       8       16        2      active sync   /dev/sdb
      11       8       48        3      active sync   /dev/sdd
       6       8       80        4      active sync   /dev/sdf
      10       8        0        5      active sync   /dev/sda
[root@mbpc-pc log]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdd[11] sde[8] sdc[9] sdb[7] sdf[6] sda[10]
      7814054144 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]
      [==========>..........]  resync = 51.5% (1007560660/1953513536) finish=373.8min speed=42168K/sec
      bitmap: 7/8 pages [28KB], 131072KB chunk

unused devices: <none>
[root@mbpc-pc log]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdd[11] sde[8] sdc[9] sdb[7] sdf[6] sda[10]
      7814054144 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]
      [==========>..........]  resync = 51.5% (1007603712/1953513536) finish=405.9min speed=38830K/sec
      bitmap: 8/8 pages [32KB], 131072KB chunk

unused devices: <none>
[root@mbpc-pc log]#

And now you wait.  Once done, use the usual LVM commands such as PVS, VGS, LVS to resize those components.

Some reading available here.

Now that you've done that, it's time to resize the LVM physical volume:

[root@mbpc-pc ~]# pvs
  PV         VG          Fmt  Attr PSize   PFree
  /dev/md0   MBPCStorage lvm2 a–    3.64t 931.70g
  /dev/sdg2  mbpcvg      lvm2 a–    1.18t      0
  /dev/sdg4  mbpcvg      lvm2 a–  465.75g 415.75g
[root@mbpc-pc ~]# pvresize /dev/md0
  Physical volume "/dev/md0" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized
[root@mbpc-pc ~]# pvs
  PV         VG          Fmt  Attr PSize   PFree
  /dev/md0   MBPCStorage lvm2 a–    7.28t   4.55t
  /dev/sdg2  mbpcvg      lvm2 a–    1.18t      0
  /dev/sdg4  mbpcvg      lvm2 a–  465.75g 415.75g
[root@mbpc-pc ~]# vgs
  VG          #PV #LV #SN Attr   VSize   VFree
  MBPCStorage   1   1   0 wz–n-   7.28t   4.55t
  mbpcvg        2   3   0 wz–n-   1.64t 415.75g
[root@mbpc-pc ~]# lvs
  LV         VG          Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  MBPCBackup MBPCStorage -wi-ao—-   2.73t
  fmlv       mbpcvg      -wi-ao—-   1.15t
  rootlv     mbpcvg      -wi-ao—-  81.25g
  swaplv     mbpcvg      -wi-ao—-   4.00g
[root@mbpc-pc ~]#

 

And you're set.

Cheers,
TK

pam_reply called with result [4]: System error.

So you're trying to login and get these messages on ovirt01 (192.168.0.145) and ipaclient01 (192.168.0.236).  What could be wrong: 

(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [ldb] (0×4000): cancel ldb transaction (nesting: 2)
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [sysdb_mod_group_member] (0×0080): ldb_modify failed: [No such object](32)[ldb_wait from ldb_modify with LDB_WAIT_ALL: No such object (32)]
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [sysdb_mod_group_member] (0×0400): Error: 2 (No such file or directory)
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [sysdb_update_members_ex] (0×0020): Could not add member [tom@mds.xyz] to group [name=tom@mds.xyz,cn=groups,cn=mds.xyz,cn=sysdb]. Skipping.

(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [k5c_setup_fast] (0×0020): check_fast_ccache failed.
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [k5c_setup_fast] (0×0020): 2618: [-1765328203][Key table entry not found]
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [privileged_krb5_setup] (0×0040): Cannot set up FAST
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [main] (0×0020): privileged_krb5_setup failed.
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [main] (0×0020): krb5_child failed!

(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [read_pipe_handler] (0×0400): EOF received, client finished

(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [parse_krb5_child_response] (0×0020): message too short.
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [krb5_auth_done] (0×0040): The krb5_child process returned an error. Please inspect the krb5_child.log file or the journal for more information
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [krb5_auth_done] (0×0040): Could not parse child response [22]: Invalid argument
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [check_wait_queue] (0×1000): Wait queue for user [tom@mds.xyz] is empty.
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [krb5_auth_queue_done] (0×0040): krb5_auth_recv failed with: 22
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [ipa_pam_auth_handler_krb5_done] (0×0040): KRB5 auth failed [22]: Invalid argument
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [dp_req_done] (0×0400): DP Request [PAM Preauth #2]: Request handler finished [0]: Success

(Thu Mar 22 23:59:26 2018) [sssd[pam]] [pam_dp_process_reply] (0×0200): received: [4 (System error)][mds.xyz]
(Thu Mar 22 23:59:26 2018) [sssd[pam]] [pam_reply] (0×0200): pam_reply called with result [4]: System error.

More intrieguing is that the reverse dig output had two PTR records for one IP and none for the other IP:

[root@ovirt01 network-scripts]# dig -x 192.168.0.145

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7_4.2 <<>> -x 192.168.0.145
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47551
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;145.0.168.192.in-addr.arpa.    IN      PTR

;; ANSWER SECTION:
145.0.168.192.in-addr.arpa. 1200 IN     PTR     ovirt01.nix.mds.xyz.
145.0.168.192.in-addr.arpa. 1200 IN     PTR     ipaclient01.nix.mds.xyz.

;; AUTHORITY SECTION:
0.168.192.in-addr.arpa. 86400   IN      NS      idmipa01.nix.mds.xyz.
0.168.192.in-addr.arpa. 86400   IN      NS      idmipa02.nix.mds.xyz.

;; ADDITIONAL SECTION:
idmipa01.nix.mds.xyz.   1200    IN      A       192.168.0.44
idmipa02.nix.mds.xyz.   1200    IN      A       192.168.0.45

;; Query time: 1 msec
;; SERVER: 192.168.0.44#53(192.168.0.44)
;; WHEN: Fri Mar 23 00:04:25 EDT 2018
;; MSG SIZE  rcvd: 192

[root@ovirt01 network-scripts]#

Whilst the other IP had no PTR records returned:

[root@ovirt01 network-scripts]# dig -x 192.168.0.236

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7_4.2 <<>> -x 192.168.0.236
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 64699
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;236.0.168.192.in-addr.arpa.    IN      PTR

;; AUTHORITY SECTION:
0.168.192.in-addr.arpa. 3600    IN      SOA     idmipa01.nix.mds.xyz. hostmaster.nix.mds.xyz. 1521778151 3600 900 1209600 3600

;; Query time: 1 msec
;; SERVER: 192.168.0.44#53(192.168.0.44)
;; WHEN: Fri Mar 23 00:27:22 EDT 2018
;; MSG SIZE  rcvd: 122

[root@ovirt01 network-scripts]#

Is because I was copying the /etc/sssd/sssd.conf config from one client to the other.  More specifically, I was copying the config from ipaclient01 to ovirt01:

[root@ipaclient01 ~]# grep -Ei ipa_hostname /etc/sssd/sssd.conf
ipa_hostname = ipaclient01.nix.mds.xyz
[root@ipaclient01 ~]#

[root@ovirt01 network-scripts]# grep -Ei ipa_hostname /etc/sssd/sssd.conf
ipa_hostname = ipaclient01.nix.mds.xyz
[root@ovirt01 network-scripts]#

Changing the above quickly resolved my login issue.

Cheers,
TK

Windows 7 Cannot Resolve hostnames via PING but nslookup works.

It may happen that you can't ping a hostname either internally on a local DNS you may be running or externally.   Flushing the DNS cache may not work either:

C:\Users\tom>ipconfig /flushdns

Windows IP Configuration

Successfully flushed the DNS Resolver Cache.

C:\Users\tom>ping vcsa01
Ping request could not find host vcsa01. Please check the name and try again.

C:\Users\tom>ping vcsa01

So what you can also do is disable IPv6 in windows for the Interface you're using under Control Panel -> Network and Sharing Center.

If that doesn't work, consider if you are using OpenVPN.  If the OpenVPN client is up and you're using VPN to move in and out of your infrastructure, consider turning it off or restarting DHCP Client in windows Services.  

Alternately, you may have a third DNS server in the IPv4 Advanced Properties panel of your network card properties in Windows.  Review your DNS and remove any extra DNS entries that can't resolve the hostnames you are trying to get too.

Cheers,
TK


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License