Header Shadow Image


command ConnectStoragePoolVDS failed: Cannot find master domain:

So we receive the following error from oVirt:

VDSM mdskvm-p01.mds.xyz command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=87ec67c6-8da8-4161-afdf-180778a4b595, msdUUID=73fa156c-f085-466f-b409-130a9795a667'

and dig in a bit deeper to see what's going on:

[root@mdskvm-p01 log]# systemctl status vdsmd.service
â vdsmd.service – Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2018-03-30 23:18:02 EDT; 23h ago
  Process: 2787 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh –pre-start (code=exited, status=0/SUCCESS)
 Main PID: 2875 (vdsmd)
   CGroup: /system.slice/vdsmd.service
           ââ 2875 /usr/bin/python2 /usr/share/vdsm/vdsmd
           ââ16845 /usr/libexec/ioprocess –read-pipe-fd 51 –write-pipe-fd 50 –max-threads 10 –max-queued-requests 10

Mar 31 00:39:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd…001d610>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task…
Mar 31 00:40:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd…c0b96d0>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task…
Mar 31 00:40:23 mdskvm-p01.mds.xyz vdsm[2875]: WARN unhandled close event
Mar 31 00:40:35 mdskvm-p01.mds.xyz fence_ilo[20843]: Unable to connect/login to fencing device
Mar 31 00:40:37 mdskvm-p01.mds.xyz fence_ilo[20889]: Unable to connect/login to fencing device
Mar 31 00:41:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd…009b650>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task…
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agen… removed
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_a… removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agen… removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_a… removed
Hint: Some lines were ellipsized, use -l to show in full.
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]# systemctl status vdsmd.service -l
â vdsmd.service – Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2018-03-30 23:18:02 EDT; 23h ago
  Process: 2787 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh –pre-start (code=exited, status=0/SUCCESS)
 Main PID: 2875 (vdsmd)
   CGroup: /system.slice/vdsmd.service
           ââ 2875 /usr/bin/python2 /usr/share/vdsm/vdsmd
           ââ16845 /usr/libexec/ioprocess –read-pipe-fd 51 –write-pipe-fd 50 –max-threads 10 –max-queued-requests 10

Mar 31 00:39:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd92bdf1 at 0x7fcd3c0b9950> timeout=30.0, duration=0 at 0x7fcd4001d610>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
                                                   task()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
                                                   self._callable()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 349, in __call__
                                                   self._execute()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 391, in _execute
                                                   self._vm.updateDriveVolume(drive)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4209, in updateDriveVolume
                                                   vmDrive.volumeID)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6119, in _getVolumeSize
                                                   (domainID, volumeID))
                                               StorageUnavailableError: Unable to get volume size for domain 73fa156c-f085-466f-b409-130a9795a667 volume 81186557-9080-42d1-ba6a-633fb8b805e5
Mar 31 00:40:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd92bdf1 at 0x7fcd5805cd90> timeout=30.0, duration=0 at 0x7fcd3c0b96d0>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
                                                   task()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
                                                   self._callable()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 349, in __call__
                                                   self._execute()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 391, in _execute
                                                   self._vm.updateDriveVolume(drive)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4209, in updateDriveVolume
                                                   vmDrive.volumeID)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6119, in _getVolumeSize
                                                   (domainID, volumeID))
                                               StorageUnavailableError: Unable to get volume size for domain 73fa156c-f085-466f-b409-130a9795a667 volume 81186557-9080-42d1-ba6a-633fb8b805e5
Mar 31 00:40:23 mdskvm-p01.mds.xyz vdsm[2875]: WARN unhandled close event
Mar 31 00:40:35 mdskvm-p01.mds.xyz fence_ilo[20843]: Unable to connect/login to fencing device
Mar 31 00:40:37 mdskvm-p01.mds.xyz fence_ilo[20889]: Unable to connect/login to fencing device
Mar 31 00:41:03 mdskvm-p01.mds.xyz vdsm[2875]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=d8dfd596-1e87-4e98-87ff-269edd92bdf1 at 0x3adeb90> timeout=30.0, duration=0 at 0x7fcd2009b650>
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
                                                   task()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
                                                   self._callable()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 349, in __call__
                                                   self._execute()
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 391, in _execute
                                                   self._vm.updateDriveVolume(drive)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4209, in updateDriveVolume
                                                   vmDrive.volumeID)
                                                 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6119, in _getVolumeSize
                                                   (domainID, volumeID))
                                               StorageUnavailableError: Unable to get volume size for domain 73fa156c-f085-466f-b409-130a9795a667 volume 81186557-9080-42d1-ba6a-633fb8b805e5
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agent.0 already removed
Mar 31 00:41:57 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_agent.0 already removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.ovirt-guest-agent.0 already removed
Mar 31 00:43:29 mdskvm-p01.mds.xyz vdsm[2875]: WARN File: /var/lib/libvirt/qemu/channels/d8dfd596-1e87-4e98-87ff-269edd92bdf1.org.qemu.guest_agent.0 already removed
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]# systemctl restart vdsmd.service -l
[root@mdskvm-p01 log]# systemctl status vdsmd.service -l
â vdsmd.service – Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2018-04-01 00:04:52 EDT; 2s ago
  Process: 22701 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh –post-stop (code=exited, status=0/SUCCESS)
  Process: 22705 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh –pre-start (code=exited, status=0/SUCCESS)
 Main PID: 22783 (vdsmd)
   CGroup: /system.slice/vdsmd.service
           ââ22783 /usr/bin/python2 /usr/share/vdsm/vdsmd

Apr 01 00:04:50 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running prepare_transient_repository
Apr 01 00:04:51 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running syslog_available
Apr 01 00:04:51 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running nwfilter
Apr 01 00:04:51 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running dummybr
Apr 01 00:04:52 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running tune_system
Apr 01 00:04:52 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running test_space
Apr 01 00:04:52 mdskvm-p01.mds.xyz vdsmd_init_common.sh[22705]: vdsm: Running test_lo
Apr 01 00:04:52 mdskvm-p01.mds.xyz systemd[1]: Started Virtual Desktop Server Manager.
Apr 01 00:04:53 mdskvm-p01.mds.xyz vdsm[22783]: WARN MOM not available.
Apr 01 00:04:53 mdskvm-p01.mds.xyz vdsm[22783]: WARN MOM not available, KSM stats will be missing.
[root@mdskvm-p01 log]#
[root@mdskvm-p01 log]#

XFS metadata corruption shows up ( /var/log/messages ):

Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xebffc502
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xebffc502 ("xfs_trans_read_buf_map") error 117 numblks 1
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xefffc402
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ovirtmgmt: received packet on bond0 with own address as source address (addr:78:e7:d1:8f:4d:26, vlan:0)
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811e7aa1230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ovirtmgmt: received packet on bond0 with own address as source address (addr:78:e7:d1:8f:4d:26, vlan:0)
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xefffc402 ("xfs_trans_read_buf_map") error 117 numblks 1
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xf3ffc302
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8811f8ba2230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xf3ffc302 ("xfs_trans_read_buf_map") error 117 numblks 1
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Metadata corruption detected at xfs_agi_read_verify+0x5e/0×110 [xfs], xfs_agi block 0xf7ffc202
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): Unmount and run xfs_repair
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): First 64 bytes of corrupted metadata buffer:
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: ffff8808e5335c30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  …………….
Mar 29 09:37:55 mdskvm-p01 kernel: XFS (dm-3): metadata I/O error: block 0xf7ffc202 ("xfs_trans_read_buf_map") error 117 numblks 1

So we fix using the following after going into runlevel 1 and unmounting the volume:

xfs_repair -n /dev/mdskvmsanvg/mdskvmsanlv 2>&1 | more

Cheers,
TK

missing icmp_seq numbers

In case you see these missing icmp_seq numbers:

[root@mdskvm-p01 ~]# ping 192.168.0.149
PING 192.168.0.149 (192.168.0.149) 56(84) bytes of data.
64 bytes from 192.168.0.149: icmp_seq=1 ttl=64 time=0.536 ms
64 bytes from 192.168.0.149: icmp_seq=3 ttl=64 time=0.240 ms
64 bytes from 192.168.0.149: icmp_seq=7 ttl=64 time=0.330 ms
64 bytes from 192.168.0.149: icmp_seq=11 ttl=64 time=0.331 ms
64 bytes from 192.168.0.149: icmp_seq=15 ttl=64 time=0.353 ms
64 bytes from 192.168.0.149: icmp_seq=19 ttl=64 time=0.271 ms

 

Then you should check ifcfg-bond0 setting:

BONDING_OPTS='mode=4 miimon=100'

In our case, we had 2/4 NIC ports plugged into the switch so the above tried to do link aggregation using inactive ports.

BONDING_OPTS='mode=1 miimon=100'

We also tried mode=2 but that didn't work.  Further reading is available from RedHat Bonding Modes.

Some of the other common issues exibited by this include:

no route to host
Destination Host Unreachable

 

Cheers,
Tom

 

Extending the size of your mdadm array.

Extending the size of your mdadm array.  Now that you've replaced all the failed disks, we can double the size of our array to 8TB from 4TB.

We start off with this array:

[root@mbpc-pc log]# mdadm –detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Mar 26 00:06:24 2012
     Raid Level : raid6
     Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
  Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Mar 29 23:02:24 2018
          State : active
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : mbpc:0
           UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
         Events : 1333503

    Number   Major   Minor   RaidDevice State
       8       8       64        0      active sync   /dev/sde
       9       8       32        1      active sync   /dev/sdc
       7       8       16        2      active sync   /dev/sdb
      11       8       48        3      active sync   /dev/sdd
       6       8       80        4      active sync   /dev/sdf
      10       8        0        5      active sync   /dev/sda
[root@mbpc-pc log]#

So let's do this:

[root@mbpc-pc log]#
[root@mbpc-pc log]# mdadm –grow /dev/md0 –size=max
mdadm: component size of /dev/md0 has been set to 1953513536K
unfreeze
[root@mbpc-pc log]# mdadm –detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Mar 26 00:06:24 2012
     Raid Level : raid6
     Array Size : 7814054144 (7452.06 GiB 8001.59 GB)
  Used Dev Size : 1953513536 (1863.02 GiB 2000.40 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Mar 29 23:42:32 2018
          State : active, resyncing
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Resync Status : 51% complete

           Name : mbpc:0
           UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
         Events : 1333507

    Number   Major   Minor   RaidDevice State
       8       8       64        0      active sync   /dev/sde
       9       8       32        1      active sync   /dev/sdc
       7       8       16        2      active sync   /dev/sdb
      11       8       48        3      active sync   /dev/sdd
       6       8       80        4      active sync   /dev/sdf
      10       8        0        5      active sync   /dev/sda
[root@mbpc-pc log]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdd[11] sde[8] sdc[9] sdb[7] sdf[6] sda[10]
      7814054144 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]
      [==========>..........]  resync = 51.5% (1007560660/1953513536) finish=373.8min speed=42168K/sec
      bitmap: 7/8 pages [28KB], 131072KB chunk

unused devices: <none>
[root@mbpc-pc log]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdd[11] sde[8] sdc[9] sdb[7] sdf[6] sda[10]
      7814054144 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]
      [==========>..........]  resync = 51.5% (1007603712/1953513536) finish=405.9min speed=38830K/sec
      bitmap: 8/8 pages [32KB], 131072KB chunk

unused devices: <none>
[root@mbpc-pc log]#

And now you wait.  Once done, use the usual LVM commands such as PVS, VGS, LVS to resize those components.

Some reading available here.

Now that you've done that, it's time to resize the LVM physical volume:

[root@mbpc-pc ~]# pvs
  PV         VG          Fmt  Attr PSize   PFree
  /dev/md0   MBPCStorage lvm2 a–    3.64t 931.70g
  /dev/sdg2  mbpcvg      lvm2 a–    1.18t      0
  /dev/sdg4  mbpcvg      lvm2 a–  465.75g 415.75g
[root@mbpc-pc ~]# pvresize /dev/md0
  Physical volume "/dev/md0" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized
[root@mbpc-pc ~]# pvs
  PV         VG          Fmt  Attr PSize   PFree
  /dev/md0   MBPCStorage lvm2 a–    7.28t   4.55t
  /dev/sdg2  mbpcvg      lvm2 a–    1.18t      0
  /dev/sdg4  mbpcvg      lvm2 a–  465.75g 415.75g
[root@mbpc-pc ~]# vgs
  VG          #PV #LV #SN Attr   VSize   VFree
  MBPCStorage   1   1   0 wz–n-   7.28t   4.55t
  mbpcvg        2   3   0 wz–n-   1.64t 415.75g
[root@mbpc-pc ~]# lvs
  LV         VG          Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  MBPCBackup MBPCStorage -wi-ao—-   2.73t
  fmlv       mbpcvg      -wi-ao—-   1.15t
  rootlv     mbpcvg      -wi-ao—-  81.25g
  swaplv     mbpcvg      -wi-ao—-   4.00g
[root@mbpc-pc ~]#

 

And you're set.

Cheers,
TK

pam_reply called with result [4]: System error.

So you're trying to login and get these messages on ovirt01 (192.168.0.145) and ipaclient01 (192.168.0.236).  What could be wrong: 

(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [ldb] (0×4000): cancel ldb transaction (nesting: 2)
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [sysdb_mod_group_member] (0×0080): ldb_modify failed: [No such object](32)[ldb_wait from ldb_modify with LDB_WAIT_ALL: No such object (32)]
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [sysdb_mod_group_member] (0×0400): Error: 2 (No such file or directory)
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [sysdb_update_members_ex] (0×0020): Could not add member [tom@mds.xyz] to group [name=tom@mds.xyz,cn=groups,cn=mds.xyz,cn=sysdb]. Skipping.

(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [k5c_setup_fast] (0×0020): check_fast_ccache failed.
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [k5c_setup_fast] (0×0020): 2618: [-1765328203][Key table entry not found]
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [privileged_krb5_setup] (0×0040): Cannot set up FAST
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [main] (0×0020): privileged_krb5_setup failed.
(Thu Mar 22 23:59:26 2018) [[sssd[krb5_child[3246]]]] [main] (0×0020): krb5_child failed!

(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [read_pipe_handler] (0×0400): EOF received, client finished

(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [parse_krb5_child_response] (0×0020): message too short.
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [krb5_auth_done] (0×0040): The krb5_child process returned an error. Please inspect the krb5_child.log file or the journal for more information
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [krb5_auth_done] (0×0040): Could not parse child response [22]: Invalid argument
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [check_wait_queue] (0×1000): Wait queue for user [tom@mds.xyz] is empty.
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [krb5_auth_queue_done] (0×0040): krb5_auth_recv failed with: 22
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [ipa_pam_auth_handler_krb5_done] (0×0040): KRB5 auth failed [22]: Invalid argument
(Thu Mar 22 23:59:26 2018) [sssd[be[nix.mds.xyz]]] [dp_req_done] (0×0400): DP Request [PAM Preauth #2]: Request handler finished [0]: Success

(Thu Mar 22 23:59:26 2018) [sssd[pam]] [pam_dp_process_reply] (0×0200): received: [4 (System error)][mds.xyz]
(Thu Mar 22 23:59:26 2018) [sssd[pam]] [pam_reply] (0×0200): pam_reply called with result [4]: System error.

More intrieguing is that the reverse dig output had two PTR records for one IP and none for the other IP:

[root@ovirt01 network-scripts]# dig -x 192.168.0.145

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7_4.2 <<>> -x 192.168.0.145
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47551
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;145.0.168.192.in-addr.arpa.    IN      PTR

;; ANSWER SECTION:
145.0.168.192.in-addr.arpa. 1200 IN     PTR     ovirt01.nix.mds.xyz.
145.0.168.192.in-addr.arpa. 1200 IN     PTR     ipaclient01.nix.mds.xyz.

;; AUTHORITY SECTION:
0.168.192.in-addr.arpa. 86400   IN      NS      idmipa01.nix.mds.xyz.
0.168.192.in-addr.arpa. 86400   IN      NS      idmipa02.nix.mds.xyz.

;; ADDITIONAL SECTION:
idmipa01.nix.mds.xyz.   1200    IN      A       192.168.0.44
idmipa02.nix.mds.xyz.   1200    IN      A       192.168.0.45

;; Query time: 1 msec
;; SERVER: 192.168.0.44#53(192.168.0.44)
;; WHEN: Fri Mar 23 00:04:25 EDT 2018
;; MSG SIZE  rcvd: 192

[root@ovirt01 network-scripts]#

Whilst the other IP had no PTR records returned:

[root@ovirt01 network-scripts]# dig -x 192.168.0.236

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7_4.2 <<>> -x 192.168.0.236
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 64699
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;236.0.168.192.in-addr.arpa.    IN      PTR

;; AUTHORITY SECTION:
0.168.192.in-addr.arpa. 3600    IN      SOA     idmipa01.nix.mds.xyz. hostmaster.nix.mds.xyz. 1521778151 3600 900 1209600 3600

;; Query time: 1 msec
;; SERVER: 192.168.0.44#53(192.168.0.44)
;; WHEN: Fri Mar 23 00:27:22 EDT 2018
;; MSG SIZE  rcvd: 122

[root@ovirt01 network-scripts]#

Is because I was copying the /etc/sssd/sssd.conf config from one client to the other.  More specifically, I was copying the config from ipaclient01 to ovirt01:

[root@ipaclient01 ~]# grep -Ei ipa_hostname /etc/sssd/sssd.conf
ipa_hostname = ipaclient01.nix.mds.xyz
[root@ipaclient01 ~]#

[root@ovirt01 network-scripts]# grep -Ei ipa_hostname /etc/sssd/sssd.conf
ipa_hostname = ipaclient01.nix.mds.xyz
[root@ovirt01 network-scripts]#

Changing the above quickly resolved my login issue.

Cheers,
TK

Windows 7 Cannot Resolve hostnames via PING but nslookup works.

It may happen that you can't ping a hostname either internally on a local DNS you may be running or externally.   Flushing the DNS cache may not work either:

C:\Users\tom>ipconfig /flushdns

Windows IP Configuration

Successfully flushed the DNS Resolver Cache.

C:\Users\tom>ping vcsa01
Ping request could not find host vcsa01. Please check the name and try again.

C:\Users\tom>ping vcsa01

So what you can also do is disable IPv6 in windows for the Interface you're using under Control Panel -> Network and Sharing Center.

If that doesn't work, consider if you are using OpenVPN.  If the OpenVPN client is up and you're using VPN to move in and out of your infrastructure, consider turning it off or restarting DHCP Client in windows Services.  

Alternately, you may have a third DNS server in the IPv4 Advanced Properties panel of your network card properties in Windows.  Review your DNS and remove any extra DNS entries that can't resolve the hostnames you are trying to get too.

Cheers,
TK

failed command: READ FPDMA QUEUED FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

So my last Seagate SATA drive in my RAID 6 Array died spectacularly taking out my 4.8.4 Kernel and locking up my storage to the point where the only way I can get to it is via the kernel boot parameter init=/bin/bash .  The disk lasted about 5.762 years:  

Read the rest of this entry »

GlusterFS: Configuration and Setup w/ NFS-Ganesha for an HA NFS Cluster (Quick Start Guide)

This is a much shorter version of our troubleshooting article on NFS Ganesh we created earlier.  This is meant as a quick start guide for those who just want to get this server up and running very quickly.  The point of High Availabilty is that the best implement HA solutions never allow any outage to be noticed by the client.  It's not the client's job to put up with the fallout of a failure, it's the sysadmins job to ensure they never have too. In this configuration, however, we will use a 3 node Gluster Cluster. In short, we'll be using the following techs to setup an HA configuration:

  • GlusterFS
  • NFS Ganesha
  • CentOS 7 
  • HAPROXY
  • keepalived
  • firewalld
  • selinux

Here's a summary configuration for this whole work:

HOST SETTING DESCRIPTION
nfs01 / nfs02 / nfs03

Create and reserve some IP's for your hosts.  We are using the FreeIPA project to provide DNS and Kerberos functionality here:

192.168.0.80 nfs-c01 (nfs01, nfs02, nfs03)  VIP DNS Entry

192.168.0.131 nfs01
192.168.0.119 nfs02
192.168.0.125 nfs03

Add the hosts to your DNS server for a clean setup. Alternately  add them to /etc/hosts (ugly)
nfs01 / nfs02 / nfs03

wget https://github.com/nfs-ganesha/nfs-ganesha/archive/V2.6-.0.tar.gz

[root@nfs01 ~]# ganesha.nfsd -v
NFS-Ganesha Release = V2.6.0
nfs-ganesha compiled on Feb 20 2018 at 08:55:23
Release comment = GANESHA file server is 64 bits compliant and supports NFS v3,4.0,4.1 (pNFS) and 9P
Git HEAD = 97867975b2ee69d475876e222c439b1bc9764a78
Git Describe = V2.6-.0-0-g9786797
[root@nfs01 ~]#

DETAILED INSTRUCTIONS:

https://github.com/nfs-ganesha/nfs-ganesha/wiki/Compiling

https://github.com/nfs-ganesha/nfs-ganesha/wiki/GLUSTER
https://github.com/nfs-ganesha/nfs-ganesha/wiki/XFSLUSTRE

PACKAGES:

yum install glusterfs-api-devel.x86_64
yum install xfsprogs-devel.x86_64
yum install xfsprogs.x86_64
xfsdump-3.1.4-1.el7.x86_64
libguestfs-xfs-1.36.3-6.el7_4.3.x86_64
libntirpc-devel-1.5.4-1.el7.x86_64
libntirpc-1.5.4-1.el7.x86_64

libnfsidmap-devel-0.25-17.el7.x86_64
jemalloc-devel-3.6.0-1.el7.x86_64

COMMANDS

git clone https://github.com/nfs-ganesha/nfs-ganesha.git
cd nfs-ganesha;
git checkout V2.6-stable

git submodule update init recursive
yum install gcc-c++
yum install cmake

ccmake /root/ganesha/nfs-ganesha/src/
# Press the c, e, c, g keys to create and generate the config and make files.
make
make install

Compile and build
nfsganesha 2.60+
from source.  (At
this time RPM
packages did not work) Install the listed packages before compiling as well.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

nfs01 / nfs02 / nfs03 Add a disk to the VM such as /dev/sdb . Add secondary
disk for the
shared GlusterFS
nfs01 / nfs02 / nfs03

Create the FS on the new disk and mount it:

mkfs.xfs /dev/sdb
mkdir -p /bricks/0
mount /dev/sdb /bricks/0

yum install centos-release-gluster
systemctl enable glusterd.service
yum -y install glusterfs glusterfs-fuse glusterfs-server glusterfs-api glusterfs-cli


On node01 ONLY: 

gluster volume create gv01 replica 2 nfs01:/bricks/0/gv01 nfs02:/bricks/0/gv01

gluster volume info
gluster volume status

Add subsequent bricks:

(from existing cluster member ) gluster peer probe nfs03
gluster volume add-brick gv01 replica 3 nfs03:/bricks/0/gv01

Mount the storage locally:

systemctl disable autofs
mkdir /n

Example:

[root@nfs01 ~]# mount -t glusterfs nfs01:/gv01 /n
[root@nfs02 ~]# mount -t glusterfs nfs02:/gv01 /n
[root@nfs03 ~]# mount -t glusterfs nfs03:/gv01 /n

Ensure the following options are set on the gluster volume:

[root@nfs01 glusterfs]# gluster volume set gv01 cluster.quorum-type auto
volume set: success
[root@nfs01 glusterfs]# gluster volume set gv01 cluster.server-quorum-type server
volume set: success

Here is an example Gluster volume configuration we used:

cluster.server-quorum-type: server
cluster.quorum-type: auto
server.event-threads: 8
client.event-threads: 8
performance.readdir-ahead: on
performance.write-behind-window-size: 8MB
performance.io-thread-count: 16
performance.cache-size: 1GB
nfs.trusted-sync: on
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet

 

Configure the
GlusterFS filesystem
using
nfs01 / nfs02 / nfs03

PACKAGES:
yum install haproxy     # ( 1.5.18-6.el7.x86_64 used in this case )

/etc/haproxy/haproxy.cfg

global
    log         127.0.0.1 local2
    stats       socket /var/run/haproxy.sock mode 0600 level admin
    # stats     socket /var/lib/haproxy/stats
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    debug

defaults
    mode                    tcp
    log                     global
    option                  dontlognull
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000

frontend nfs-in
    bind nfs-c01:2049
    mode tcp
    option tcplog
    default_backend             nfs-back


backend nfs-back
    balance     roundrobin
    server      nfs01.nix.mine.dom    nfs01.nix.mine.dom:2049 check
    server      nfs02.nix.mine.dom    nfs02.nix.mine.dom:2049 check

    server      nfs03.nix.mine.dom    nfs03.nix.mine.dom:2049 check

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats realm Haproxy\ Statistics
    stats uri /haproxy-stats
    stats auth admin:s3cretw0rd

Install and
Configure HAPROXY. great source that helped with this part.
nfs01 / nfs02 / nfs03 # echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.conf
# echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf
# sysctl -p
net.ipv4.ip_nonlocal_bind = 1
net.ipv4.ip_forward = 1
Turn on kernel parameters.  These allow keepalived below to function properly.
nfs01 / nfs02 / nfs03 

PACKAGES:

yum install keepalived    # ( Used 1.3.5-1.el7.x86_64 in this case )

NFS01:

vrrp_script chk_haproxy {
  script "killall -0 haproxy"           # check the haproxy process
  interval 2                            # every 2 seconds
  weight 2                              # add 2 points if OK
}

vrrp_instance VI_1 {
  interface eth0                        # interface to monitor
  state MASTER                          # MASTER on haproxy1, BACKUP on haproxy2
  virtual_router_id 51
  priority 101                          # 101 on haproxy1, 100 on haproxy2
  virtual_ipaddress {
       192.168.0.80                        # virtual ip address
  }
  track_script {
       chk_haproxy
  }
}

NFS02:

vrrp_script chk_haproxy {
  script "killall -0 haproxy"           # check the haproxy process
  interval 2                            # every 2 seconds
  weight 2                              # add 2 points if OK
}

vrrp_instance VI_1 {
  interface eth0                        # interface to monitor
  state BACKUP                          # MASTER on haproxy1, BACKUP on haproxy2
  virtual_router_id 51
  priority 102                          # 101 on haproxy1, 100 on haproxy2
  virtual_ipaddress {
    192.168.0.80                        # virtual ip address
  }
  track_script {
    chk_haproxy
  }
}

Configure keepalived. A great source that helped with this as well.

nfs01 / nfs02 / nfs03

This step can be made quicker by copying the xml definitions from one host to the other if you already have one defined:

/etc/firewalld/zones/dmz.xml
/etc/firewalld/zones/public.xml

Individual setup:

# cat public.bash

firewall-cmd –zone=public –permanent –add-port=2049/tcp

firewall-cmd –zone=public –permanent –add-port=111/tcp

firewall-cmd –zone=public –permanent –add-port=111/udp

firewall-cmd –zone=public –permanent –add-port=24007-24008/tcp

firewall-cmd –zone=public –permanent –add-port=49152/tcp

firewall-cmd –zone=public –permanent –add-port=38465-38469/tcp

firewall-cmd –zone=public –permanent –add-port=4501/tcp

firewall-cmd –zone=public –permanent –add-port=4501/udp

firewall-cmd –zone=public –permanent –add-port=20048/udp

firewall-cmd –zone=public –permanent –add-port=20048/tcp
firewall-cmd –reload

# cat dmz.bash

firewall-cmd –zone=dmz –permanent –add-port=2049/tcp

firewall-cmd –zone=dmz –permanent –add-port=111/tcp

firewall-cmd –zone=dmz –permanent –add-port=111/udp

firewall-cmd –zone=dmz –permanent –add-port=24007-24008/tcp

firewall-cmd –zone=dmz –permanent –add-port=49152/tcp

firewall-cmd –zone=dmz –permanent –add-port=38465-38469/tcp

firewall-cmd –zone=dmz –permanent –add-port=4501/tcp

firewall-cmd –zone=dmz –permanent –add-port=4501/udp

firewall-cmd –zone=dmz –permanent –add-port=20048/tcp

firewall-cmd –zone=dmz –permanent –add-port=20048/udp

firewall-cmd –reload

#

# On Both

firewall-cmd –permanent –direct –add-rule ipv4 filter INPUT 0 -m pkttype –pkt-type multicast -j ACCEPT
firewall-cmd –reload

 

HANDY STUFF:

firewall-cmd –zone=dmz –list-all
firewall-cmd –zone=public –list-all
firewall-cmd –set-log-denied=all
firewall-cmd –permanent –add-service=haproxy
firewall-cmd –list-all
firewall-cmd –runtime-to-permanent

Configure firewalld.
DO NOT
disable
firewalld .
nfs01 / nfs02 / nfs03

Run any of the following command, or a combination of, on deny entries in /var/log/audit/audit.log that may appear as you stop, start or install above services:

METHOD 1:
grep AVC /var/log/audit/audit.log | audit2allow -M systemd-allow
semodule -i systemd-allow.pp

METHOD 2:
audit2allow -a
audit2allow -a -M ganesha_<NUM>_port
semodule -i ganesha_<NUM>_port.pp

USEFULL THINGS:

ausearch –interpret
aureport

Configure selinux. 
Don't disable it.
This actually  makes your host safer and is actually
easy to work with using just these commands.
nfs01 / nfs02 / nfs03

NODE 1:

[root@nfs01 ~]# cat /etc/ganesha/ganesha.conf
###################################################
#
# EXPORT
#
# To function, all that is required is an EXPORT
#
# Define the absolute minimal export
#
###################################################


NFS_Core_Param {
        Bind_addr = 192.168.0.131;
        NFS_Port = 2049;
        MNT_Port = 20048;
        NLM_Port = 38468;
        Rquota_Port = 4501;
}

%include "/etc/ganesha/export.conf"
[root@nfs01 ~]# cat /etc/ganesha/export.conf
EXPORT{
    Export_Id = 1 ;                             # Export ID unique to each export
    Path = "/n";                                # Path of the volume to be exported. Eg: "/test_volume"

    FSAL {
        name = GLUSTER;
        hostname = "nfs01.nix.mine.dom";         # IP of one of the nodes in the trusted pool
        volume = "gv01";                        # Volume name. Eg: "test_volume"
    }

    Access_type = RW;                           # Access permissions
    Squash = No_root_squash;                    # To enable/disable root squashing
    Disable_ACL = FALSE;                        # To enable/disable ACL
    Pseudo = "/n";                              # NFSv4 pseudo path for this export. Eg: "/test_volume_pseudo"
    Protocols = "3","4";                        # NFS protocols supported
    Transports = "UDP","TCP" ;                  # Transport protocols supported
    SecType = "sys";                            # Security flavors supported
}
[root@nfs01 ~]#

NODE 2:

[root@nfs02 ~]# cd /etc/ganesha/
[root@nfs02 ganesha]# cat ganesha.conf
###################################################
#
# EXPORT
#
# To function, all that is required is an EXPORT
#
# Define the absolute minimal export
#
###################################################


NFS_Core_Param {
        Bind_addr=192.168.0.119;
        NFS_Port=2049;
        MNT_Port=20048;
        NLM_Port=38468;
        Rquota_Port=4501;
}

%include "/etc/ganesha/export.conf"
[root@nfs02 ganesha]# cat export.conf
EXPORT{
    Export_Id = 1 ;                             # Export ID unique to each export
    Path = "/n";                                # Path of the volume to be exported. Eg: "/test_volume"

    FSAL {
        name = GLUSTER;
        hostname = "nfs02.nix.mine.dom";         # IP of one of the nodes in the trusted pool
        volume = "gv01";                        # Volume name. Eg: "test_volume"
    }

    Access_type = RW;                           # Access permissions
    Squash = No_root_squash;                    # To enable/disable root squashing
    Disable_ACL = FALSE;                        # To enable/disable ACL
    Pseudo = "/n";                              # NFSv4 pseudo path for this export. Eg: "/test_volume_pseudo"
    Protocols = "3","4";                        # NFS protocols supported
    Transports = "UDP","TCP" ;                  # Transport protocols supported
    SecType = "sys";                            # Security flavors supported
}
[root@nfs02 ganesha]#

 

NODE 3:

[root@nfs03 ~]# cd /etc/ganesha/
[root@nfs03 ganesha]# cat ganesha.conf
###################################################
#
# EXPORT
#
# To function, all that is required is an EXPORT
#
# Define the absolute minimal export
#
###################################################


NFS_Core_Param {
        Bind_addr=192.168.0.125;
        NFS_Port=2049;
        MNT_Port=20048;
        NLM_Port=38468;
        Rquota_Port=4501;
}

%include "/etc/ganesha/export.conf"
[root@nfs03 ganesha]# cat export.conf
EXPORT{
    Export_Id = 1 ;                             # Export ID unique to each export
    Path = "/n";                                # Path of the volume to be exported. Eg: "/test_volume"

    FSAL {
        name = GLUSTER;
        hostname = "nfs03.nix.mine.dom";         # IP of one of the nodes in the trusted pool
        volume = "gv01";                        # Volume name. Eg: "test_volume"
    }

    Access_type = RW;                           # Access permissions
    Squash = No_root_squash;                    # To enable/disable root squashing
    Disable_ACL = FALSE;                        # To enable/disable ACL
    Pseudo = "/n";                              # NFSv4 pseudo path for this export. Eg: "/test_volume_pseudo"
    Protocols = "3","4";                        # NFS protocols supported
    Transports = "UDP","TCP" ;                  # Transport protocols supported
    SecType = "sys";                            # Security flavors supported
}
[root@nfs03 ganesha]#

STARTUP:

systemctl start nfs-ganesha
(Only if you did not extract the startup script) /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT

 

Configure NFS Ganesha
nfs01 / nfs02 / nfs03

 

[root@nfs01 ~]# cat /etc/fstab|grep -Ei "brick|gv01"
/dev/sdb /bricks/0                              xfs     defaults        0 0
nfs01:/gv01 /n                                  glusterfs defaults      0 0
[root@nfs01 ~]#

[root@nfs01 ~]# mount|grep -Ei "brick|gv01"
/dev/sdb on /bricks/0 type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
nfs01:/gv01 on /n type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@nfs01 ~]#

 

[root@nfs01 ~]# ps -ef|grep -Ei "haproxy|keepalived|ganesha"; netstat -pnlt|grep -Ei "haproxy|ganesha|keepalived"
root      1402     1  0 00:59 ?        00:00:00 /usr/sbin/keepalived -D
root      1403  1402  0 00:59 ?        00:00:00 /usr/sbin/keepalived -D
root      1404  1402  0 00:59 ?        00:00:02 /usr/sbin/keepalived -D
root     13087     1  0 01:02 ?        00:00:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy  13088 13087  0 01:02 ?        00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
haproxy  13089 13088  0 01:02 ?        00:00:01 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
root     13129     1 15 01:02 ?        00:13:11 /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     19742 15633  0 02:30 pts/2    00:00:00 grep –color=auto -Ei haproxy|keepalived|ganesha
tcp        0      0 192.168.0.80:2049       0.0.0.0:*               LISTEN      13089/haproxy
tcp6       0      0 192.168.0.131:20048     :::*                    LISTEN      13129/ganesha.nfsd
tcp6       0      0 :::564                  :::*                    LISTEN      13129/ganesha.nfsd
tcp6       0      0 192.168.0.131:4501      :::*                    LISTEN      13129/ganesha.nfsd
tcp6       0      0 192.168.0.131:2049      :::*                    LISTEN      13129/ganesha.nfsd
tcp6       0      0 192.168.0.131:38468     :::*                    LISTEN      13129/ganesha.nfsd
[root@nfs01 ~]#

 

Ensure mounts are
done and everything
is started up.
nfs01 / nfs02 / nfs03

yumdownloader nfs-ganesha.x86_64
rpm2cpio nfs-ganesha-2.5.5-1.el7.x86_64.rpm | cpio -idmv ./usr/lib/systemd/system/nfs-ganesha-lock.service
rpm2cpio nfs-ganesha-2.5.5-1.el7.x86_64.rpm | cpio -idmv ./usr/lib/systemd/system/nfs-ganesha.service
rpm2cpio nfs-ganesha-2.5.5-1.el7.x86_64.rpm | cpio -idmv ./usr/lib/systemd/system/nfs-ganesha-config.service
rpm2cpio nfs-ganesha-2.5.5-1.el7.x86_64.rpm | cpio -idmv ./usr/libexec/ganesha/nfs-ganesha-config.sh

Copy above to the same folders under / instead of ./ :

systemctl enable nfs-ganesha.service
systemctl status nfs-ganesha.service

Since you compiled from source you don't have nice startup scripts.  To get your nice startup scripts from an existing ganesha RPM do the following.  Then use systemctl to stop and start nfs-ganesha as you would any other service.
 
ANY

Enable dumps:

gluster volume set gv01 server.statedump-path /var/log/glusterfs/
gluster volume statedump gv01

 

Enable state dumps for issue isolation.
Enable Samba / SMB for Windows File Sharing ( Optional )

Packages:

samba-common-4.7.1-6.el7.noarch
samba-client-libs-4.7.1-6.el7.x86_64
libsmbclient-4.7.1-6.el7.x86_64
samba-libs-4.7.1-6.el7.x86_64
samba-4.7.1-6.el7.x86_64
libsmbclient-devel-4.7.1-6.el7.x86_64
samba-common-libs-4.7.1-6.el7.x86_64
samba-common-tools-4.7.1-6.el7.x86_64
samba-client-4.7.1-6.el7.x86_64

# cat /etc/samba/smb.conf|grep NFS -A 12
[NFS]
        comment = NFS Shared Storage
        path = /n
        valid users = root
        public = no
        writable = yes
        read only = no
        browseable = yes
        guest ok = no
        printable = no
        write list = root tom@mds.xyz tomk@nix.mds.xyz
        directory mask = 0775
        create mask = 664

Start the service after enabling it:

systemctl enable smb
systemctl start smb

Samba permissions to access NFS directories, fusefs and allow export.

Likewise for fusefs filesystems:

# setsebool -P samba_share_fusefs on
# getsebool samba_share_fusefs
samba_share_fusefs –> on

 

Likewise, for NFS shares, you'll need the following to allow sharing out of NFS shares:

# setsebool -P samba_share_nfs on
# getsebool samba_share_nfs
samba_share_nfs –> on
#

And some firewalls ports to go along with it:

firewall-cmd –zone=public –permanent –add-port=445/tcp
firewall-cmd –zone=
public –permanent –add-port=139/tcp
firewall-cmd –zone=
public –permanent –add-port=138/udp
firewall-cmd –zone=
public –permanent –add-port=137/udp
firewall-cmd –reload

 

We can also enable SMB / Samba file sharing on the individual cluster hosts and allow visibility to the Gluster FS / NFS – Ganesha from Windows.

TESTING

Now let's do some checks on our NFS HA.  Mount the share using the VIP from a client then create a test file:

[root@ipaclient01 /]# mount -t nfs4 nfs-c01:/n /n
[root@ipaclient01 n]# echo -ne "Hacked It.  Gluster, NFS Ganesha, HAPROXY, keepalived scalable NFS server." > some-people-find-this-awesome.txt

[root@ipaclient01 n]# mount|grep nfs4
nfs-c01:/n on /n type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.236,local_lock=none,addr=192.168.0.80)
[root@ipaclient01 n]#

 

Then check each brick to see if the file was replicated:

[root@nfs01 n]# cat /bricks/0/gv01/some-people-find-this-awesome.txt
Hacked It.  Gluster, NFS Ganesha, HAPROXY, keepalived scalable NFS server.
[root@nfs01 n]# mount|grep -Ei gv01
nfs01:/gv01 on /n type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@nfs01 n]#

[root@nfs02 n]# cat /bricks/0/gv01/some-people-find-this-awesome.txt
Hacked It.  Gluster, NFS Ganesha, HAPROXY, keepalived scalable NFS server.
[root@nfs02 n]# mount|grep -Ei gv01
nfs02:/gv01 on /n type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@nfs02 n]#

Good!  Now let's hard shutdown one node, nfs01, the primary node.  Expected behaviour is that we need to see failover to nfs02 and then when we bring back the nfs01 server, we need to see the file is replicated.  While we do this, the client ipaclient01 is not supposed to loose any connection to the NFS mount via the VIP.  Here are the results:

[root@nfs02 n]# ps -ef|grep -Ei "haproxy|ganesha|keepalived"
root     12245     1  0 Feb19 ?        00:00:03 /usr/sbin/keepalived -D
root     12246 12245  0 Feb19 ?        00:00:03 /usr/sbin/keepalived -D
root     12247 12245  0 Feb19 ?        00:00:41 /usr/sbin/keepalived -D
root     12409     1 16 Feb20 ?        00:13:05 /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     17892     1  0 00:37 ?        00:00:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy  17893 17892  0 00:37 ?        00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
haproxy  17894 17893  0 00:37 ?        00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
root     17918 21084  0 00:38 pts/0    00:00:00 grep –color=auto -Ei haproxy|ganesha|keepalived
[root@nfs02 n]# ps -ef|grep -Ei "haproxy|ganesha|keepalived"; netstat -pnlt|grep -Ei ganesha; netstat -pnlt|grep -Ei haproxy; netstat -pnlt|grep -Ei keepalived
root     12245     1  0 Feb19 ?        00:00:03 /usr/sbin/keepalived -D
root     12246 12245  0 Feb19 ?        00:00:03 /usr/sbin/keepalived -D
root     12247 12245  0 Feb19 ?        00:00:41 /usr/sbin/keepalived -D
root     12409     1 16 Feb20 ?        00:13:09 /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     17892     1  0 00:37 ?        00:00:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy  17893 17892  0 00:37 ?        00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
haproxy  17894 17893  0 00:37 ?        00:00:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
root     17947 21084  0 00:38 pts/0    00:00:00 grep –color=auto -Ei haproxy|ganesha|keepalived
tcp6       0      0 192.168.0.119:20048     :::*                    LISTEN      12409/ganesha.nfsd
tcp6       0      0 :::564                  :::*                    LISTEN      12409/ganesha.nfsd
tcp6       0      0 192.168.0.119:4501      :::*                    LISTEN      12409/ganesha.nfsd
tcp6       0      0 192.168.0.119:2049      :::*                    LISTEN      12409/ganesha.nfsd
tcp6       0      0 192.168.0.119:38468     :::*                    LISTEN      12409/ganesha.nfsd
tcp        0      0 192.168.0.80:2049       0.0.0.0:*               LISTEN      17894/haproxy
[root@nfs02 n]#
[root@nfs02 n]#
[root@nfs02 n]#
[root@nfs02 n]# ssh nfs-c01
Password:
Last login: Wed Feb 21 00:37:28 2018 from nfs-c01.nix.mine.dom
[root@nfs02 ~]# logout
Connection to nfs-c01 closed.
[root@nfs02 n]#

From client we can still see all the files (seemless with no interruption to the NFS service).  As a bonus, while we started this first test, we noticed that HAPROXY was offline on nfs02.  While trying to list the client files, it appeared hung but still responded then listed files right after we started HAPROXY on nfs02

[root@ipaclient01 n]# ls -altri some-people-find-this-awesome.txt
11782527620043058273 -rw-r–r–. 1 nobody nobody 74 Feb 21 00:26 some-people-find-this-awesome.txt
[root@ipaclient01 n]# df -h .
Filesystem      Size  Used Avail Use% Mounted on
nfs-c01:/n      128G   43M  128G   1% /n
[root@ipaclient01 n]# ssh nfs-c01
Password:
Last login: Wed Feb 21 00:41:06 2018 from nfs-c01.nix.mine.dom
[root@nfs02 ~]#

Checking the gluster volume on nfs02:

[root@nfs02 n]# gluster volume status
Status of volume: gv01
Gluster process                             TCP Port  RDMA Port  Online  Pid
——————————————————————————
Brick nfs02:/bricks/0/gv01                  49152     0          Y       16103
Self-heal Daemon on localhost               N/A       N/A        Y       16094

Task Status of Volume gv01
——————————————————————————
There are no active volume tasks

[root@nfs02 n]#

Now let's bring back the first node and fail the second after nfs01 is up again.  As soon as we bring nfs01 back up, the VIP fails over to nfs01 without any hickup or manual invervention on the client end:

[root@ipaclient01 n]# ls -altri
total 11
                 128 dr-xr-xr-x. 21 root   root   4096 Feb 18 22:24 ..
11782527620043058273 -rw-r–r–.  1 nobody nobody   74 Feb 21 00:26 some-people-find-this-awesome.txt
                   1 drwxr-xr-x.  3 nobody nobody 4096 Feb 21 00:26 .
[root@ipaclient01 n]#
[root@ipaclient01 n]#
[root@ipaclient01 n]#
[root@ipaclient01 n]# ssh nfs-c01
Password:
Last login: Wed Feb 21 00:59:56 2018
[root@nfs01 ~]#

So now let's fail the second node.  NFS still works:

[root@ipaclient01 ~]# ssh nfs-c01
Password:
Last login: Wed Feb 21 01:31:50 2018
[root@nfs01 ~]# logout
Connection to nfs-c01 closed.
[root@ipaclient01 ~]# cd /n
[root@ipaclient01 n]# ls -altri some-people-find-this-awesome.txt
11782527620043058273 -rw-r–r–. 1 nobody nobody 74 Feb 21 00:26 some-people-find-this-awesome.txt
[root@ipaclient01 n]# df -h .
Filesystem      Size  Used Avail Use% Mounted on
nfs-c01:/n      128G   43M  128G   1% /n
[root@ipaclient01 n]#

So we bring the second node back up.  And that concludes the configuration!  All works like a charm!

You can also check out our guest post for the same on loadbalancer.org!

Good Luck!

Cheers,
Tom K.

Cannot find key for kvno in keytab

If you are getting this:

krb5_child.log:(Tue Mar  6 23:18:46 2018) [[sssd[krb5_child[3193]]]] [map_krb5_error] (0×0020): 1655: [-1765328340][Cannot find key for nfs/nfs01.nix.my.dom@NIX.my.dom kvno 6 in keytab]

Then you can resolve it by copying the old keytab file back (or removing the incorrect entries using ktutil).  In our case we had made a saved copy and readded the NFS principals to the keytab file.  You can list out the current principals in the keytab file using:

klist -kte /etc/krb5.keytab

This was followed up by readding missing keytab keys from the IPA server:

ipa-getkeytab -s idmipa01.nix.my.dom -p nfs/nfs-c01.nix.my.dom -k /etc/krb5.keytab
ipa-getkeytab -s idmipa01.nix.my.dom -p nfs/nfs01.nix.my.dom -k /etc/krb5.keytab

Alternately, create the keytab entries manually using ktutil above.

Cheers,
Tom

 

Name resolution for the name timed out after none of the configured DNS servers responded.

You're getting this: 

Name resolution for the name <URL> timed out after none of the configured DNS servers responded.

One of the resolutions is to adjust a few network parameters: 

netsh interface tcp set global rss=disabled
netsh interface tcp set global autotuninglevel=disabled
netsh int ip set global taskoffload=disabled

Then set these registry options: 

regedit: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
 
EnableTCPChimney=dword:00000000
EnableTCPA=dword:00000000
EnableRSS=dword:00000000

Cheers,
Tom K.

Ping request could not find host HOST. Please check the name and try again .

ping cannot find host but nslookup on a host works just fine:  

Ping request could not find host HOST. Please check the name and try again.

Restart the DNS Client Service in Windows Services to resolve this one.  A few other commands to try:  

ipconfig /flushdns
ipconfig / registerdns

Following this, check eventviewer why it stopped working to begin with.  The service is started using:

C:\Windows\system32\svchost.exe -k NetworkService

Alternately, stopping the caching daemon also works.

Cheers,
Tom K


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License