Header Shadow Image


GlusterFS: Configuration and Setup w/ NFS-Ganesha for an HA NFS Cluster

In this post we will go over how to setup a highly available NFS Cluster using:

  • GlusterFS
  • NFS Ganesha
  • CentOS 7 
  • HAPROXY
  • keepalived
  • firewalld
  • selinux

This post is very lengthy and goes over quite a few details on the way to configuring this setup.  We document virtually every step including how to build out a GlusterFS filesystem on both physical or virtual environments.  For those interested in a quick setup, please skip to the SUMMARY or TESTING sections at the bottom for a summary of commands and configuration files used.  If you run into problems, just search the page for the issue you have, as it's likely listed, and read the solution given.

Read the rest of this entry »

Replication bind with GSSAPI auth failed: LDAP error 49 (Invalid credentials) ()

FreeIPA replication failes for about 13 minutes with no activity on the first IDM server.  Not clear why at first.

Feb 12 10:06:56 idmipa01 named-pkcs11[2529]: zone nix.mds.xyz/IN: sending notifies (serial 1518448016)
Feb 12 10:07:06 idmipa01 named-pkcs11[2529]: error (chase DS servers) resolving 'mds.xyz/DS/IN': 192.168.0.224#53
Feb 12 10:07:14 idmipa01 ns-slapd: [12/Feb/2018:10:07:14.130840773 -0500] – ERR – NSMMReplicationPlugin – bind_and_check_pwp – agmt="cn=meToidmipa02.nix.mds.xyz" (idmipa02:389) – Replication bind with GSSAPI auth failed: LDAP error 49 (Invalid credentials) ()
Feb 12 10:20:01 idmipa01 systemd: Created slice user-0.slice.
Feb 12 10:20:01 idmipa01 systemd: Starting user-0.slice.

The problem was again with NTP and time/date settings.

[root@idmipa02 log]# date
Wed Feb 14 00:05:58 EST 2018
[root@idmipa02 log]#

 

[root@idmipa01 log]# date
Wed Feb 14 00:00:14 EST 2018
You have new mail in /var/spool/mail/root
[root@idmipa01 log]#

Over 5 minute difference.  Checking further we see the following in the logs:

Feb 12 10:13:00 idmipa02 rc.local: Error resolving ca.pool.ntp.org: Name or service not known (-2)
Feb 12 10:13:00 idmipa02 rc.local: 12 Feb 10:13:00 ntpdate[963]: Can't find host ca.pool.ntp.org: Name or service not known (-2)
Feb 12 10:13:00 idmipa02 rc.local: 12 Feb 10:13:00 ntpdate[963]: no servers can be used, exiting

So we need to keep the time between the two masters in sync otherwise this replication issue will reoccur.  But we need to ensure our NTP servers are resolvable.  So we may need to put extra conditions in our NTP servers.  We have:

[root@idmipa01 log]# cat /etc/rc.local |grep -Evi "#"

touch /var/lock/subsys/local
ntpdate -u ca.pool.ntp.org;
[root@idmipa01 log]#

But we should use a single IP in case of failure (We are using NLB on our AD DC servers and we noted a failure on that host earlier which we just fixed.):

[root@idmipa01 log]# cat /etc/rc.local |grep -Evi "#"

touch /var/lock/subsys/local
ntpdate -u ca.pool.ntp.org || ntpdate -u 206.108.0.132 || ntpdate -u 159.203.8.72;

[root@idmipa01 log]#

This gives us some safety in case the name can't be resolved due to DNS issues.  We will also reconfigure our NTP servers as follows:

[root@idmipa02 log]# grep -Evi "#" /etc/ntp.conf | sed -e "/^$/d"
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
fudge   127.127.1.0 stratum 10
restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap
restrict 127.0.0.1
restrict ::1
driftfile /var/lib/ntp/ntp.drift
logfile /var/log/ntp.log
server 0.ca.pool.ntp.org prefer
server 1.ca.pool.ntp.org
server 2.ca.pool.ntp.org
server 3.ca.pool.ntp.org

server 198.50.139.209

server 207.210.46.249
includefile /etc/ntp/crypto/pw
keys /etc/ntp/keys
disable monitor
[root@idmipa02 log]#

and

[root@idmipa01 log]# grep -Evi "#" /etc/ntp.conf|sed -e "/^$/d"
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
fudge   127.127.1.0 stratum 10
restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap
restrict 127.0.0.1
restrict ::1
driftfile /var/lib/ntp/ntp.drift
logfile /var/log/ntp.log

server 207.210.46.249
server 198.50.139.209
server 0.ca.pool.ntp.org
server 1.ca.pool.ntp.org
server 2.ca.pool.ntp.org
server 3.ca.pool.ntp.org prefer
includefile /etc/ntp/crypto/pw
keys /etc/ntp/keys
disable monitor
[root@idmipa01 log]#

Noticed the preferred NTP servers are different on each of our NTP servers.  We're attempting to prevent a scenario where the same external NTP server is polled twice from two different servers simultaneously.  No clear evidence if this causes an issue but setting an alternate preferred server for each of our NTP servers prevents that from occurring just in case it could ever be true.  We also add 2 IP's from one the domains above in case DNS errors cause us issues.  We will be immune to this if it were ever to come up. The difference is significant:

[root@idmipa02 log]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.          10 l    4   64    1    0.000    0.000   0.000
 k8s-w04.tblflp. 152.2.133.55     2 u    3   64    1   21.943  906.098   0.000
 echo.baxterit.n 213.251.128.249  2 u    2   64    1   39.255  908.220   0.000
 k8s-w01.tblflp. 152.2.133.55     2 u    1   64    1   18.415  903.549   0.000
 portal.switch.c 213.251.128.249  2 u    -   64    1   16.560  901.799   0.000
 mirror3.rafal.c .INIT.          16 u    -   64    0    0.000    0.000   0.000
 198.50.139.209  .INIT.          16 u    -   64    0    0.000    0.000   0.000
[root@idmipa02 log]#

[root@idmipa01 log]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.          10 l   34   64    1    0.000    0.000   0.000
 198.50.139.209  35.73.197.144    2 u   33   64    1   19.071  -84.149   0.000
 mirror3.rafal.c 53.27.192.223    2 u   32   64    1   18.490  -56.439   0.000
 ns522433.ip-158 18.26.4.105      2 u   31   64    1   17.833  -80.900   0.000
 echo.baxterit.n 213.251.128.249  2 u   30   64    1   16.688  -82.694   0.000
 209.115.181.102 206.108.0.133    2 u   29   64    1   72.834  -82.194   0.000
 mongrel.ahem.ca .INIT.          16 u    -   64    0    0.000    0.000   0.000
[root@idmipa01 log]#

Good Luck!

Cheers,
TK

Getting asked for password when using host shortname with kerberos delegation

When trying to ssh into a host using the server's short name, you get challenged or asked for a password.  You need to set the following to:  

  • First item to set is the following:

dns_canonicalize_hostname = true

in /etc/krb5.conf.  It will then prevent from asking a password.  Using the server's FQDN will work without issues.  

  • Second item to set is to also ensure your sshd_config contains the following lines (This may or may not necessarily work however as I haven't tested all the configuration options.):

KerberosAuthentication yes
ChallengeResponseAuthentication yes

  • The other important item to check and set is the following that you have properly configured /etc/resolv.conf and ifcfg-eth0 interface.  After configuring above items, this item finally got passless single-host sign on authentication to work (DOMAIN is reported to work on certain Linux versions while SEARCH on others.  Doesn't hurt to set both.  In this case order is important for either:  mds.xyz before nix.mds.xyz):

[root@cm-r01en02 ssh]# cat /etc/resolv.conf
; generated by /usr/sbin/dhclient-script
nameserver 192.168.0.44
nameserver 192.168.0.45
search mds.xyz nix.mds.xyz

[root@cm-r01en02 ssh]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
TYPE=Ethernet
NAME=eth0
BOOTPROTO=static
PEERDNS=no
UUID=62904293-0bde-4ea9-b4a1-6a65191663f3
ONBOOT=yes
IPADDR=192.168.0.133
NETMASK="255.255.255.0"
GATEWAY="192.168.0.1"
USERCTL=no
NM_CONTROLLED=no
HOSTNAME=cm-r01en02.nix.mds.xyz
DOMAIN="mds.xyz nix.mds.xyz"
SEARCH="mds.xyz nix.mds.xyz"
DNS1=192.168.0.44
DNS2=192.168.0.45
DNS3=192.168.0.224

[root@cm-r01en02 ssh]#

My entire sshd_config file had the following set:

AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
AcceptEnv LC_IDENTIFICATION LC_ALL LANGUAGE
AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
AcceptEnv XMODIFIERS
AuthorizedKeysCommandUser nobody
AuthorizedKeysCommand /usr/bin/sss_ssh_authorizedkeys
AuthorizedKeysFile      .ssh/authorized_keys
ChallengeResponseAuthentication no
GSSAPIAuthentication yes
GSSAPICleanupCredentials no
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key
HostKey /etc/ssh/ssh_host_rsa_key
KerberosAuthentication yes
PasswordAuthentication yes
PubkeyAuthentication yes
Subsystem       sftp    /usr/libexec/openssh/sftp-server
SyslogFacility AUTHPRIV
UsePAM yes
X11Forwarding yes

Note that PEERDNS is set to no.  This is important or your config will be overwritten on reboot or network restart.  If you can't set it to no for some other reason, simply change the  immutable bit on /etc/resolv.conf using chattr -i /etc/resolv.conf .

Still doesn't work?  You just might need a little bit of patience now:

-sh-4.2$ ssh ipaclient01 -vvvv
debug1: Unspecified GSS failure.  Minor code may provide more information
Clock skew too great

debug3: send packet: type 50

Meaning your NTP daemon hasn't synced up the clock yet.  Give it some time.  Then try again.

Good luck!

Cheers,
TK

 

kinit: Cannot find KDC for realm while getting initial credentials

Problem is that you need 

dns_lookup_kdc = true

in your /etc/krb5.conf under the [libdefaults] section file:

[root@mysql01 ~]# kinit tom@mds.xyz
kinit: Cannot find KDC for realm "mds.xyz" while getting initial credentials
[root@mysql01 ~]#
[root@mysql01 ~]# vi /etc/krb5.conf
[root@mysql01 ~]# systemctl restart sssd
[root@mysql01 ~]# kinit tom@mds.xyz
Password for tom@mds.xyz:
[root@mysql01 ~]#

Cheers,
TK

 

8524 The DSA operation is unable to proceed because of a DNS lookup failure.

Reason for the below failure:

The Active Directory Domain Services Installation Wizard (Dcpromo) was unable to establish connection with the following domain controller. 

 
Domain controller:
winad01.mds.xyz 
 
Additional Data 
Error value:
8524 The DSA operation is unable to proceed because of a DNS lookup failure.

and the subsequent failure in Promotion of a Server to an Active Directory Domain Controller was due to the two nics on each host having DNS settings other then 127.0.0.1.  Two nics were present, one was a LAN and the other NLB on each host.  Once fixed, AD DC promotion went along further but still failed.

This ended up being a DNS issue between the two AD DC's.  First AD DC had a DNS server as well so had to have itself as a DNS server.  So enter first DNS server's IP into the DNS 1 field and enter the router's (usually 192.168.0.1) into DNS 2 field.

Likewise for DNS 2.  Enter the IP of the second DNS server into the NIC DNS 1 field of this second DNS / AD DC server.  DNS 2 should be the main router 192.168.0.1

DNS / AD DC 1:
IP: 192.168.0.123
DNS 1: 192.168.0.123
DNS 2: 192.168.0.1

DNS / AD DC 2:
IP: 192.168.0.124
DNS1: 192.168.0.124
DNS2: 192.168.0.1

Cheers,
TK

The Directory Server detected that the database has been replaced.

The full error is as follows:

The Directory Server detected that the database has been replaced.  This is an unsafe and unsupported operation. The service will stop until the problem is corrected.
 
 User Action:
 Restore the previous copy of the database that was in use on this machine.
 In the future, the user is strongly encouraged to use the backup and restore facility to rollback the database.
 
 This error can be suppressed and the database repaired by removing the following registry key.
 
 
Additional Data 
Registry key:
System\CurrentControlSet\Services\NTDS\Parameters 
Registry value:
DSA Database Epoch

This ended up being a DNS issue between the two AD DC's.  First AD DC had a DNS server as well so had to have itself as a DNS server.  So enter first DNS server's IP into the DNS 1 field and enter the router's (usually 192.168.0.1) into DNS 2 field.

Likewise for DNS 2.  Enter the IP of the second DNS server into the NIC DNS 1 field of this second DNS / AD DC server.  DNS 2 should be the main router 192.168.0.1

Cheers,
TK

The Local Security Authority cannot be contacted.

You are likely getting this error because you have the following checked off:

Allow connections only from computers running Remote Desktop with Network Level Authentication (recommended)

Uncheck this setting and you  will be able to log back in using RDP.

We got this error trying to add a secondary AD DC to a cluster.  After some troubleshooting, we removed this setting to allow us to login and complete the AD DC cluster. 

Unless your cluster is part of an AD DC, you will not be able to login with your AD credentials until you add it to a domain. 

Check,
TK

The Local Security Authority cannot be contacted

Move the windows server into a dummy workgroup then back into the domain it was originally on to resolve:

The Local Security Authority cannot be contacted

However you may get:

The following error occurred attempting to join the domain "abc.123":

The request is not supported.

Checking the logs we see this:

%systemroot%\debug\netsetup.log

01/23/2018 19:57:02:446 NetpIsTargetImageADC: Determined this is a DC image as RegQueryValueExW loaded Services\NTDS\Parameters\DSA Database file: 0×0
01/23/2018 19:57:02:446 NetpOpenRegistry: The image at C:\Windows\system32\config\SYSTEM is a DC: 0×32.

In this case find out the path of the DSA Database file:

Location: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NTDS\Parameters\
Field: DSA Database file

then remove it.  Or move it out of the way:

PS C:\windows\ntds> mv ntds.dit ntds.dit-bad
PS C:\windows\ntds> dir


    Directory: C:\windows\ntds


Mode                LastWriteTime     Length Name
—-                ————-     —— —-
-a—         1/23/2018   2:21 PM       8192 edb.chk
-a—         1/23/2018   2:21 PM   10485760 edb.log
-a—         1/23/2018   2:21 PM   10485760 edbres00001.jrs
-a—         1/23/2018   2:21 PM   10485760 edbres00002.jrs
-a—         1/23/2018   2:21 PM   10485760 edbtmp.log
-a—         1/23/2018   2:21 PM   12599296 ntds.dit-bad
-a—         1/23/2018   2:21 PM    2113536 temp.edb


PS C:\windows\ntds>

Cheers,
TK

Cloning NTFS 1TB disk to 240GB SSD

The key to this is that the enterity of the data HAS to fig into the new space.  If it doesn't, you'll either need to get a bigger disk or start deleting stuff off the old disk.  In my case the data was 120GB so it can easily fit in a 240GB SSD.  Now, if you just want a quick solution, skip to step 21.  But if you want the longer Linux attempt, read steps through first, then follow them from top to bottom.  However in my case, I could not get the Linux solution to work.  However, others have.  NOTE: I don't make any guarantees here.  Ensure you make a backup copy of your drive first.  Never work directly with the individual if given a choice.


1) Create smaller partitions including the Recovery and System Reserved and main partition USING Windows.  Try to mimic the partition definition of your existing disk, except make the primary partitions that hold all the data smaller.  I found the Linux commands don't really create the boot records appropriately and your system won't boot if you try to use Linux to creating these.  YMMV however.  Tools like ntfs-3g help but still fall short.

2) Once you have the partitions defined, add the disk to a Linux system and rsync the various folder of the old disks to the new disk.  To make this safer and less prone to potential errors, I used ntfs-3g to clone the old disk partitions then mount the old partitions on my linux image via the loopback device ( mount -o loop ….. )  In case of issues, I only corrupt the image I've taken, NOT the original drive.

3) Mount the images on Linux.

4) Rsync everything over.  In this case I used something like this for all 3 windows partitions:

36828  rsync -rltDvu –modify-window=1 –progress –delete /mnt/sdm1/ /mnt/sdg1/
36831  rsync -rltDvu –modify-window=1 –progress –delete /mnt/sdm2/ /mnt/sdg2/
36834  rsync -rltDvu –modify-window=1 –progress –delete /mnt/sdm3/ /mnt/sdg3/

5) Wait.

6) Wait some more.

7) Or just let the thing run overnight.

8) Make the original primary partition active.  You can check your windows original disk to see what that was or fdisk -l will tell you. 

9) Try to boot up.

10) You might get this error:

Recovery

Your PC/Device needs to be repaired
The application or operating system couldn't be loaded because a required file is missing or contains errors.
File:\\WINDOWS\system32\winload.exe
Error code: 0xc000000e

You'll need to use recovery tools.  If you don't have any installation medis (like a disc or USB device), contact your PC administrator or PC/Device manufacturer.

11) Remount the disk on Linux and check the C:\WINDOWS\ folder.  It might be empty. 

12) You may get this error:

[root@mbpc-pc mnt]# mount /dev/sdg2 /mnt/sdg2
Windows is hibernated, refused to mount.
Failed to mount '/dev/sdg2': Operation not permitted
The NTFS partition is in an unsafe state. Please resume and shutdown
Windows fully (no hibernation or fast restarting), or mount the volume
read-only with the 'ro' mount option.
[root@mbpc-pc mnt]#

13) You may be told this will fix it:

[root@mbpc-pc mnt]# ntfsfix /dev/sdg2
Mounting volume… Windows is hibernated, refused to mount.
FAILED
Attempting to correct errors…
Processing $MFT and $MFTMirr…
Reading $MFT… OK
Reading $MFTMirr… OK
Comparing $MFTMirr to $MFT… OK
Processing of $MFT and $MFTMirr completed successfully.
Setting required flags on partition… OK
Going to empty the journal ($LogFile)… OK
Windows is hibernated, refused to mount.
Remount failed: Operation not permitted
[root@mbpc-pc mnt]#

14) But in reality this is what you might need (CAREFUL: This removed the hibernate file so you might loose saved data.)

remove_hiberfile
              When  the  NTFS volume is hibernated, a read-write mount is denied and a read-only mount is forced.
              One needs either to resume Windows and shutdown it properly, or use this option which  will  remove
              the  Windows  hibernation file. Please note, this means that the saved Windows session will be com-
              pletely lost. Use this option under your own responsibility.

15) You're going to try it.

[root@mbpc-pc mnt]# ntfs-3g -o remove_hiberfile /dev/sdg2 /mnt/sdg2
[root@mbpc-pc mnt]#

16) Happy Moment!

[root@mbpc-pc sdg2]# mount|grep -Ei sdg2
/dev/sdg2 on /mnt/sdg2 type fuseblk (rw,allow_other,blksize=4096)
[root@mbpc-pc sdg2]#

17) Run the rsync again.  This time pipe the output to a log file to analyze.

rsync -rltDvu –modify-window=1 –progress –delete /mnt/sdm2/ /mnt/sdg2/ 2>&1 | tee -a./what-the-hell.log

18) Analyze the log or check what was missed.

19) But alas it was not meant to be.  However I try to add the 3 plugins on the ntfs-3g page, they fail to detect on this Scientific Linux 6.7 with Kernel 4.8.4. More on the issue of unsupported reparse point issue.

20) Fcuk!

21) Ok, we're waisting time now.  Obviously the Linux route isn't yet fully mature being rough around the edges.  Great project, but not quite there yet.  Searching, I came across the AOEMI Backupper.  ( And believe me, it functions better then the name would suggest )

22) Following the instructions, the free version did EVERYTHING cloning the 1TB disk over to the 240GB SSD.  And yes, it even shrunk the partitions to scale.  The only thing you had to do is make sure the data on the disk would fit on a 240GB SSD.  Otherwise the bets are off.

And to think, this process was way easier 10-15 years ago.  :D

Cheers,
TK

ntfs-g3 / ntfsclone warning: careful about use of parameters in this manner.

Careful when copying or cloning using ntfs-g3 / ntfsclone command when using overwrite.  In the below example /dev/sdg4 is NOT the source.  It is the TARGET.  :(

ntfsclone –overwrite /dev/sdg4 /dev/sdm1

Though nothing happened here, these were two test volumes, this is opposite to the way other commands work like cp, scp, rsync:  The last entry is the target.  The first entry is the source.  The paramaters are modifiers on the function of the command during the operation.

In the case of ntfsclone command, the overwrite parameter takes a file as it's target completing the command but causes a reverse situation that can be potentially destructive. Hence, if you're thinking of the standard <COMMAND> <TARGET> <SOURCE> syntax, this command will not work:

ntfsclone /dev/sdg4 /dev/sdm1

It just prints the help page without indicating what's wrong.  Though you might expect similar behaviour to other Linux commands.  For new Linux users, this might not be a major issue but for seasoned professionals who are used to the tradtitional Linux syntax, this can be a nasty gotcha: The TARGET is not the last parameter. 

[root@mbpc-pc ntfs-3g_ntfsprogs-2017.3.23]# ntfsclone –overwrite /dev/sdg4 /dev/sdm1
ntfsclone v2017.3.23 (libntfs-3g)
NTFS volume version: 3.1
Cluster size       : 4096 bytes
Current volume size: 240054763520 bytes (240055 MB)
Current device size: 240054764544 bytes (240055 MB)
Scanning volume …
100.00 percent completed
Accounting clusters …
Space in use       : 2346 MB (1.0%)
Cloning NTFS …
100.00 percent completed
Syncing …
[root@mbpc-pc ntfs-3g_ntfsprogs-2017.3.23]#

Though the man page hints at this:

       -O, –overwrite FILE
              Clone NTFS to FILE, which can be an existing partition or a regular file which will be overwritten if it exists.

someone used to the tradtitional syntax could quickly forget this overwriting their files.  Hence why this version of the command will make sense given the function described in the man pages:

ntfsclone /dev/sdm1 –overwrite /dev/sdg4

Cheers,
TK


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License