Header Shadow Image


kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO and mount hangs

Mount still hangs and restarting NFS and autofs doesn't work to resolve this error below:

Apr  8 12:12:46 ovirt01 systemd: Stopping Automounts filesystems on demand…
Apr  8 12:12:46 ovirt01 kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO
Apr  8 12:12:47 ovirt01 automount[1487]: umount_autofs_indirect: ask umount returned busy /n
Apr  8 12:12:48 ovirt01 ovsdb-server: ovs|111192|stream_ssl|ERR|Private key must be configured to use SSL
Apr  8 12:12:48 ovirt01 ovsdb-server: ovs|111193|stream_ssl|ERR|Certificate must be configured to use SSL
Apr  8 12:12:49 ovirt01 systemd: Starting Automounts filesystems on demand…
Apr  8 12:12:49 ovirt01 automount[2165]: lookup_read_map: lookup(sss): getautomntent_r: No such file or directory
Apr  8 12:12:49 ovirt01 systemd: Started Automounts filesystems on demand.

This looks like a potential kernel bug as indicated through this thread.  Upgrade kernel and try again.  And this is the RH document that speak to more of this.  However despite the kernel versions, the issue continued again.  See the next dated post for more info.  Now despite this, we continued to have issues, even through reboots and it appeared that the above kernel upgrade didn't do very much at all.  So we investigated by attempting to mount the share remotely using both participating nodes and the VIP:

[root@ipaclient01 /]# mount nfs02:/n /n -v
mount.nfs: timeout set for Sun Apr  8 19:58:13 2018
mount.nfs: trying text-based options 'vers=4.1,addr=192.168.0.119,clientaddr=192.168.0.236'
mount.nfs: mount(2): No such file or directory
mount.nfs: trying text-based options 'addr=192.168.0.119'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying 192.168.0.119 prog 100003 vers 3 prot TCP port 2049
mount.nfs: prog 100005, trying vers=3, prot=17
mount.nfs: trying 192.168.0.119 prog 100005 vers 3 prot UDP port 20048
mount.nfs: mount(2): Permission denied
mount.nfs: access denied by server while mounting nfs02:/n

[root@ipaclient01 /]# showmount -e nfs02
Export list for nfs02:
/n (everyone)
[root@ipaclient01 /]#

Next we try with the following, remembering to unmount /n right after each attempt:

mount nfs02:/n /n -v
mount nfs01:/n /n -v
mount nfs-c01:/n /n -v

[root@ipaclient01 /]# showmount -e nfs02
Export list for nfs02:
/n (everyone)
[root@ipaclient01 /]#

 

Though we're still not 100% the exact solution that fixed this but we did recompile NFS Ganesha and reinstalled it also taking the Gluster FS offline and online numerous times as well as doing the same to NFS Ganesha and ensuring that HAPROXY and keepalived is working on both nodes nodes.  We also added HAPROXY statistics to the configuration with these lines at the end of each /etc/haproxy/haproxy.cfg file:

[root@nfs02 ganesha]# cat /etc/haproxy/haproxy.cfg
global
    log         127.0.0.1 local2 debug
    stats       socket /var/run/haproxy.sock mode 0600 level admin
    # stats     socket /var/lib/haproxy/stats
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    debug

defaults
    mode                    tcp
    log                     global
    option                  dontlognull
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000


frontend nfs-in
    bind nfs-c01:2049
    mode tcp
    option tcplog
    default_backend             nfs-back


backend nfs-back
    balance     roundrobin
    server      nfs02.nix.mds.xyz    nfs02.nix.mds.xyz:2049 check
    server      nfs01.nix.mds.xyz    nfs01.nix.mds.xyz:2049 check

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats realm Haproxy\ Statistics
    stats uri /haproxy-stats
    stats auth admin:passw0rd
[root@nfs02 ganesha]#

Sure enough, once this was done, the mounts worked like a charm from all the other hosts.  Another issue was that we were calling dhclient within a certain script at boot time.  This allocated a secondary IP to the interface potentially confusing the NFS servers though this is unconfirmed.  Nontheless, the host was reconfigured with only one static IP.

Digging some more:


==> n.log <==
[2018-04-09 05:08:13.704156] I [MSGID: 100030] [glusterfsd.c:2556:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.13.2 (args: /usr/sbin/glusterfs –process-name fuse –volfile-server=nfs01 –volfile-id=/gv01 /n)
[2018-04-09 05:08:13.711255] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction
[2018-04-09 05:08:13.728297] W [socket.c:3216:socket_connect] 0-glusterfs: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
[2018-04-09 05:08:13.729025] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-04-09 05:08:13.737757] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-04-09 05:08:13.738114] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3
[2018-04-09 05:08:13.738203] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4
[2018-04-09 05:08:13.738324] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 5
[2018-04-09 05:08:13.738330] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 6
[2018-04-09 05:08:13.738655] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 7
[2018-04-09 05:08:13.738742] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 8
[2018-04-09 05:08:13.739460] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-gv01-readdir-ahead: option 'parallel-readdir' is not recognized
[2018-04-09 05:08:13.739787] I [MSGID: 114020] [client.c:2360:notify] 0-gv01-client-0: parent translators are ready, attempting connect on transport
[2018-04-09 05:08:13.747040] W [socket.c:3216:socket_connect] 0-gv01-client-0: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
[2018-04-09 05:08:13.747372] I [MSGID: 114020] [client.c:2360:notify] 0-gv01-client-1: parent translators are ready, attempting connect on transport
[2018-04-09 05:08:13.747883] E [MSGID: 114058] [client-handshake.c:1571:client_query_portmap_cbk] 0-gv01-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2018-04-09 05:08:13.748026] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-gv01-client-0: disconnected from gv01-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2018-04-09 05:08:13.748070] W [MSGID: 108001] [afr-common.c:5391:afr_notify] 0-gv01-replicate-0: Client-quorum is not met
[2018-04-09 05:08:13.754493] W [socket.c:3216:socket_connect] 0-gv01-client-1: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
Final graph:
+——————————————————————————+
  1: volume gv01-client-0
  2:     type protocol/client
  3:     option ping-timeout 42
  4:     option remote-host nfs01
  5:     option remote-subvolume /bricks/0/gv01
  6:     option transport-type socket
  7:     option transport.address-family inet
  8:     option username 916ccf06-dc1d-467f-bc3d-f00a7449618f
  9:     option password a44739e0-9587-411f-8e6a-9a6a4e46156c
 10:     option event-threads 8
 11:     option transport.tcp-user-timeout 0
 12:     option transport.socket.keepalive-time 20
 13:     option transport.socket.keepalive-interval 2
 14:     option transport.socket.keepalive-count 9
 15:     option send-gids true
 16: end-volume
 17:
 18: volume gv01-client-1
 19:     type protocol/client
 20:     option ping-timeout 42
 21:     option remote-host nfs02
 22:     option remote-subvolume /bricks/0/gv01
 23:     option transport-type socket
 24:     option transport.address-family inet
 25:     option username 916ccf06-dc1d-467f-bc3d-f00a7449618f
 26:     option password a44739e0-9587-411f-8e6a-9a6a4e46156c
 27:     option event-threads 8
 28:     option transport.tcp-user-timeout 0
 29:     option transport.socket.keepalive-time 20
 30:     option transport.socket.keepalive-interval 2
 31:     option transport.socket.keepalive-count 9
 32:     option send-gids true
 33: end-volume
 34:
 35: volume gv01-replicate-0
 36:     type cluster/replicate
 37:     option afr-pending-xattr gv01-client-0,gv01-client-1
 38:     option quorum-type auto
 39:     option use-compound-fops off
 40:     subvolumes gv01-client-0 gv01-client-1
 41: end-volume
 42:
 43: volume gv01-dht
 44:     type cluster/distribute
 45:     option lock-migration off
 46:     subvolumes gv01-replicate-0
 47: end-volume
 48:
 49: volume gv01-write-behind
 50:     type performance/write-behind
 51:     option cache-size 8MB
 52:     subvolumes gv01-dht
 53: end-volume
 54:
 55: volume gv01-read-ahead
 56:     type performance/read-ahead
 57:     subvolumes gv01-write-behind
 58: end-volume
 59:
 60: volume gv01-readdir-ahead
 61:     type performance/readdir-ahead
 62:     option parallel-readdir off
 63:     option rda-request-size 131072
 64:     option rda-cache-limit 10MB
 65:     subvolumes gv01-read-ahead
 66: end-volume
 67:
 68: volume gv01-io-cache
 69:     type performance/io-cache
 70:     option cache-size 1GB
 71:     subvolumes gv01-readdir-ahead
 72: end-volume
 73:
 74: volume gv01-quick-read
 75:     type performance/quick-read
 76:     option cache-size 1GB
 77:     subvolumes gv01-io-cache
 78: end-volume
 79:
 80: volume gv01-open-behind
 81:     type performance/open-behind
 82:     subvolumes gv01-quick-read
 83: end-volume
 84:
 85: volume gv01-md-cache
 86:     type performance/md-cache
 87:     subvolumes gv01-open-behind
 88: end-volume
 89:
 90: volume gv01
 91:     type debug/io-stats
 92:     option log-level INFO
 93:     option latency-measurement off
 94:     option count-fop-hits off
 95:     subvolumes gv01-md-cache
 96: end-volume
 97:
 98: volume meta-autoload
 99:     type meta
100:     subvolumes gv01
101: end-volume
102:
+——————————————————————————+
[2018-04-09 05:08:13.922631] E [socket.c:2374:socket_connect_finish] 0-gv01-client-1: connection to 192.168.0.119:24007 failed (No route to host); disconnecting socket
[2018-04-09 05:08:13.922690] E [MSGID: 108006] [afr-common.c:5164:__afr_handle_child_down_event] 0-gv01-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2018-04-09 05:08:13.926201] I [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2018-04-09 05:08:13.926245] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 0
[2018-04-09 05:08:13.926518] I [MSGID: 108006] [afr-common.c:5444:afr_local_init] 0-gv01-replicate-0: no subvolumes up
[2018-04-09 05:08:13.926671] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-gv01-dht: dict is null
[2018-04-09 05:08:13.926762] E [fuse-bridge.c:4271:fuse_first_lookup] 0-fuse: first lookup on root failed (Transport endpoint is not connected)
[2018-04-09 05:08:13.927207] I [MSGID: 108006] [afr-common.c:5444:afr_local_init] 0-gv01-replicate-0: no subvolumes up
[2018-04-09 05:08:13.927262] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-gv01-dht: dict is null
[2018-04-09 05:08:13.927301] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001: failed to resolve (Transport endpoint is not connected)
[2018-04-09 05:08:13.927339] E [fuse-bridge.c:900:fuse_getattr_resume] 0-glusterfs-fuse: 2: GETATTR 1 (00000000-0000-0000-0000-000000000001) resolution failed
[2018-04-09 05:08:13.931497] I [MSGID: 108006] [afr-common.c:5444:afr_local_init] 0-gv01-replicate-0: no subvolumes up
[2018-04-09 05:08:13.931558] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-gv01-dht: dict is null
[2018-04-09 05:08:13.931599] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001: failed to resolve (Transport endpoint is not connected)
[2018-04-09 05:08:13.931623] E [fuse-bridge.c:900:fuse_getattr_resume] 0-glusterfs-fuse: 3: GETATTR 1 (00000000-0000-0000-0000-000000000001) resolution failed
[2018-04-09 05:08:13.937258] I [fuse-bridge.c:5093:fuse_thread_proc] 0-fuse: initating unmount of /n
[2018-04-09 05:08:13.938043] W [glusterfsd.c:1393:cleanup_and_exit] (–>/lib64/libpthread.so.0(+0x7e25) [0x7fb80b05ae25] –>/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x560b52471675] –>/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x560b5247149b] ) 0-: received signum (15), shutting down
[2018-04-09 05:08:13.938086] I [fuse-bridge.c:5855:fini] 0-fuse: Unmounting '/n'.
[2018-04-09 05:08:13.938106] I [fuse-bridge.c:5860:fini] 0-fuse: Closing fuse connection to '/n'.

==> glusterd.log <==
[2018-04-09 05:08:15.118078] W [socket.c:3216:socket_connect] 0-management: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"

==> glustershd.log <==
[2018-04-09 05:08:15.282192] W [socket.c:3216:socket_connect] 0-gv01-client-0: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"
[2018-04-09 05:08:15.289508] W [socket.c:3216:socket_connect] 0-gv01-client-1: Error disabling sockopt IPV6_V6ONLY: "Protocol not available"

We see that when one volume in the cluster is offline, gluster takes the other one offline too.  We might need a third server?

[root@nfs01 /]# mount -t glusterfs nfs01:/gv01 /n
Mount failed. Please check the log file for more details.
[root@nfs01 /]#

Can't mount the volume:

[root@nfs01 /]# gluster volume start gv01
volume start: gv01: failed: Quorum not met. Volume operation not allowed.
[root@nfs01 /]#

In this case we set two gluster tunables to allow the single brick to start:

[root@nfs01 glusterfs]# gluster volume set VOL cluster.server-quorum-type none
volume set: failed: Volume VOL does not exist
[root@nfs01 glusterfs]# gluster volume set gv01 cluster.server-quorum-type none
volume set: success
[root@nfs01 glusterfs]# gluster volume set gv01 cluster.quorum-type none
volume set: success
[root@nfs01 glusterfs]#
[root@nfs01 glusterfs]#
[root@nfs01 glusterfs]#
[root@nfs01 glusterfs]# gluster volume status
Status of volume: gv01
Gluster process                             TCP Port  RDMA Port  Online  Pid
——————————————————————————
Brick nfs01:/bricks/0/gv01                  49152     0          Y       28139
Self-heal Daemon on localhost               N/A       N/A        Y       28026

Task Status of Volume gv01
——————————————————————————
There are no active volume tasks

[root@nfs01 glusterfs]#

And set the same in /etc/glusterfs/glusterd.vol:

[root@nfs01 glusterfs]# cat /etc/glusterfs/glusterd.vol
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 0
    option event-threads 1
    option cluster.quorum-type none
    option cluster.server-quorum-type none
#   option lock-timer 180
#   option transport.address-family inet6
#   option base-port 49152
#   option max-port  65535
end-volume
[root@nfs01 glusterfs]#

 

Then you can mount remotely again.

Cheers,
TK

Comments are closed.


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License