Cloudera: SSLError: certificate verify failed
Receiving the following when enabling SSL Certs on remote Cloudera Worker nodes from Azure, AWS or GCP?
[17/May/2020 13:07:32 +0000] 3332 MainThread agent ERROR Heartbeating to 108.168.115.113:7182 failed.
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1387, in _send_heartbeat
self.cfg.max_cert_depth)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 139, in __init__
self.conn.connect()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/httpslib.py", line 69, in connect
sock.connect((self.host, self.port))
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 309, in connect
ret = self.connect_ssl()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
return m2.ssl_connect(self.ssl, self._timeout)
SSLError: certificate verify failed
Resolve it by ensuring the private key of the new server is correctly added to the keystore used by Cloudera:
[root@cm-r01nn01 .ssh]# keytool -importkeystore -srckeystore /root/cm-awn01.nix.mds.xyz.keystore.jks -destkeystore /etc/cloudera/keystore -srcalias cm-awn01.nix.mds.xyz -deststorepass <PASS> -srcstorepass <SRCPASS> -destalias cm-awn01.nix.mds.xyz
Importing keystore /root/cm-awn01.nix.mds.xyz.keystore.jks to /etc/cloudera/keystore…
Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/cloudera/keystore -destkeystore/etc/cloudera/keystore -deststoretype pkcs12".
[root@cm-r01nn01 .ssh]# keytool -list -keystore /etc/cloudera/keystore -storepass <PASS> | grep -Ei "cm|srv"
cm-awn01.nix.mds.xyz, May 17, 2020, PrivateKeyEntry,
Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/cloudera/keystore -destkeystore/etc/cloudera/keystore -deststoretype pkcs12".
[root@cm-r01nn01 .ssh]#
Then retry your connection once more.
A similar situation but a different resolution is that the client agent.pem configuration is pointing to the wrong PEM file. Take this example for instance:
[18/May/2020 13:03:51 +0000] 2413 MainThread agent ERROR Heartbeating to srv-c01.mws.mds.xyz:7182 failed.
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1387, in _send_heartbeat
self.cfg.max_cert_depth)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 139, in __init__
self.conn.connect()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/httpslib.py", line 69, in connect
sock.connect((self.host, self.port))
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 309, in connect
ret = self.connect_ssl()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
return m2.ssl_connect(self.ssl, self._timeout)
SSLError: certificate verify failed
The connection is being made to srv-c01.mws.mds.xyz:7182 but the agent.pem points to a different PEM file:
[root@cm-awn01 pki]# ls -altri
total 44
655285 -rw-r–r– 1 cloudera-scm cloudera-scm 1453 May 12 08:06 cm-r01nn01.mws.mds.xyz.pem
655286 -rw-r–r– 1 cloudera-scm cloudera-scm 1453 May 12 08:06 cm-r01nn02.mws.mds.xyz.pem
655287 lrwxrwxrwx 1 cloudera-scm cloudera-scm 53 May 12 08:19 agent.pem -> /opt/cloudera/security/pki/cm-r01nn01.mws.mds.xyz.pem
[root@cm-awn01 pki]#
Repoint to the correct PEM file:
[root@cm-awn01 pki]# ls -altri
652341 -rw-r–r– 1 cloudera-scm cloudera-scm 1505 May 12 08:50 srv-c01.mws.mds.xyz.pem
691030 lrwxrwxrwx 1 root root 23 May 18 13:04 agent.pem -> srv-c01.mws.mds.xyz.pem
691034 drwxr-xr-x 2 root root 4096 May 18 13:04 .
[root@cm-awn01 pki]#
The key is in understanding where the connection is happening and what pem files are required for that connection.
———————————————————————————————————————————————————————–
Another source of this error is when packages are getting downloaded:
[18/May/2020 17:22:57 +0000] 5627 Thread-13 https ERROR Failed to retrieve/store URL: https://cm-r01nn01.mws.mds.xyz:7183/cmf/parcel/download/CDH-6.2.0-1.cdh6.2.0.p0.967373-el7.parcel.torrent -> /opt/cloudera/parcel-cache/CDH-6.2.0-1.cdh6.2.0.p0.967373-el7.parcel.torrent certificate verify failed
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 193, in fetch_to_file
resp = self.open(req_url)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 188, in open
return self.opener(url, *pargs, **kwargs)
File "/usr/lib64/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 179, in https_open
return self.do_open(opener, req)
File "/usr/lib64/python2.7/urllib2.py", line 1211, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/usr/lib64/python2.7/httplib.py", line 1041, in request
self._send_request(method, url, body, headers)
File "/usr/lib64/python2.7/httplib.py", line 1075, in _send_request
self.endheaders(body)
File "/usr/lib64/python2.7/httplib.py", line 1037, in endheaders
self._send_output(message_body)
File "/usr/lib64/python2.7/httplib.py", line 881, in _send_output
self.send(msg)
File "/usr/lib64/python2.7/httplib.py", line 843, in send
self.connect()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/httpslib.py", line 69, in connect
sock.connect((self.host, self.port))
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 309, in connect
ret = self.connect_ssl()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
return m2.ssl_connect(self.ssl, self._timeout)
SSLError: certificate verify failed
So what to do with this guy? Use this line to verify the certificate much like Cloudera would.
openssl s_client -connect cm-r01nn01:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null
Use the above on a non working host
[root@cm-awn01 pki]# openssl s_client -connect cm-r01nn01:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null
CONNECTED(00000003)
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = cm-c01.mws.mds.xyz
verify error:num=18:self signed certificate
verify return:1
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = cm-c01.mws.mds.xyz
verify return:1
.
.
.
Start Time: 1589842560
Timeout : 300 (sec)
Verify return code: 18 (self signed certificate)
—
DONE
and a working host.
[root@cm-r01wn01 pki]# openssl s_client -connect cm-r01nn01:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null
CONNECTED(00000003)
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = srv-c01.mws.mds.xyz
verify return:1
.
.
.
Start Time: 1589842676
Timeout : 300 (sec)
Verify return code: 0 (ok)
—
DONE
[root@cm-r01wn01 pki]#
This is because on a working node (cm-r01wn01), verify_cert_file points to:
/opt/cloudera/security/pki:
server.jks -> /opt/cloudera/security/pki/cm-r01wn01.mws.mds.xyz.keystore.jks
cluster-vip.pem -> srv-c01.mws.mds.xyz.pem
and on a non working host (cm-awn01) we have this as well:
/opt/cloudera/security/pki:
server.jks -> cm-awn01.nix.mds.xyz.keystore.jks
cluster-vip.pem -> srv-c01.mws.mds.xyz.pem
However, a non-working node is going through an HAProxy and Keepalived VIP whereas the working node presents the certificate directly. So the hostname, in blue above, returned is different and secondly our certificate doesn't appear in any truststore, hence 18 (self signed certificate). Our config has the subjectAltName defined which validates this certs for three hostnames:
openssl x509 -in srv-c01.mws.mds.xyz.pem -noout -text|grep DNS
DNS:srv-c01.mws.mds.xyz, DNS:cm-r01nn01.mws.mds.xyz, DNS:cm-r01nn02.mws.mds.xyz
On cm-r01nn01:7183, the server.jks points to the srv-c01.mws.mds.xyz.keystore.jks file allowing validation to succeed:
[root@cm-r01nn01 pki]# ls -altri
135962827 -rw-r–r–. 1 cloudera-scm cloudera-scm 2422 Jul 18 2019 srv-c01.mws.mds.xyz.keystore.jks
135962831 lrwxrwxrwx. 1 root root 32 Jul 27 2019 server.jks -> srv-c01.mws.mds.xyz.keystore.jks
135962833 -rw-r–r–. 1 cloudera-scm cloudera-scm 1505 Jul 27 2019 srv-c01.mws.mds.xyz.pem
135962837 lrwxrwxrwx. 1 root root 23 Jul 27 2019 cluster-vip.pem -> srv-c01.mws.mds.xyz.pem
This keystore has all the correct SAN entries per the openssl check above. Most importantly, because of the hostname returned by the above command on our Azure box, cm-c01.mws.mds.xyz, it throws the certificate verification off since cm-c01.mws.mds.xyz isn't in our truststore and the hostname doesn't match.
In other words, our Azure box has:
- cluster-vip.pem -> srv-c01.mws.mds.xyz.pem
- Connects to HAproxy / Keepalived via the external VIP ( /etc/hosts -> 100.100.100.10 cm-r01nn01.mws.mds.xyz cm-r01nn01 )
- HAProxy and Keepalived returns the certs for cm-c01.mws.mds.xyz
- srv-c01.mws.mds.xyz.pem doesn't match cm-c01.mws.mds.xyz
Our local box has:
- cluster-vip.pem -> srv-c01.mws.mds.xyz.pem
- Connects directly to host cm-r01nn01:7183
- Host returns srv-c01.mws.mds.xyz
- srv-c01.mws.mds.xyz.pem matches srv-c01.mws.mds.xyz
- Keystore valudation succeeds because keystore points to server.jks -> srv-c01.mws.mds.xyz.keystore.jks on the namenode, and so validation succeeds via verify_cert_file=/opt/cloudera/security/pki/cluster-vip.pem
So how to fix this mess. The easiest way to fix this without mucking around with too many SSL certs, is to redirect the traffic on port 7183 directly to the CM servers behind the VIP instead of applying a new certificate to the VIP directly. Here is the correct config:
frontend cm7183in
log 127.0.0.1:514 local0 debug
bind cm-c01:7183
default_backend cm7183back
backend cm7183back
log 127.0.0.1:514 local0 debug
mode tcp
balance source
server cm-r01nn01.mws.mds.xyz cm-r01nn01.mws.mds.xyz:7183 check
server cm-r01nn02.mws.mds.xyz cm-r01nn02.mws.mds.xyz:7183 check
What we had is this:
# frontend cm7183in
# bind cm-c01:7183 ssl crt /etc/haproxy/certs/cm-c01.mws.mds.xyz-haproxy.pem no-sslv3
# default_backend cmback
backend cmback
mode http
balance roundrobin
server cm-r01nn01.mws.mds.xyz cm-r01nn01.mws.mds.xyz:7183 ssl check verify none port 7183 inter 12000 rise 3 fall 3
server cm-r01nn02.mws.mds.xyz cm-r01nn02.mws.mds.xyz:7183 ssl check verify none port 7183 inter 12000 rise 3 fall 3
Which applied another certificate, cm-c01.mws.mds.xyz, to the communication between the worker and the CM servers. So now let's verify if this actually works as advertised.
[root@cm-awn01 pki]# openssl s_client -connect cm-r01nn01.mws.mds.xyz:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null
CONNECTED(00000003)
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = srv-c01.mws.mds.xyz
verify return:1
—
.
.
.
Start Time: 1589846762
Timeout : 300 (sec)
Verify return code: 0 (ok)
—
DONE
[root@cm-awn01 pki]#
Because we're passing the traffic directly via TCP, and not applying any certs along the way, now our verification returns OK. Makes sense?
Warmest and Kindest Regards w/ Hugs,
TK