Header Shadow Image


Cloudera: SSLError: certificate verify failed

Receiving the following when enabling SSL Certs on remote Cloudera Worker nodes from Azure, AWS or GCP?

[17/May/2020 13:07:32 +0000] 3332 MainThread agent ERROR    Heartbeating to 108.168.115.113:7182 failed.
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1387, in _send_heartbeat
    self.cfg.max_cert_depth)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 139, in __init__
    self.conn.connect()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/httpslib.py", line 69, in connect
    sock.connect((self.host, self.port))
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 309, in connect
    ret = self.connect_ssl()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
    return m2.ssl_connect(self.ssl, self._timeout)
SSLError: certificate verify failed

Resolve it by ensuring the private key of the new server is correctly added to the keystore used by Cloudera:

[root@cm-r01nn01 .ssh]# keytool -importkeystore -srckeystore /root/cm-awn01.nix.mds.xyz.keystore.jks -destkeystore /etc/cloudera/keystore -srcalias cm-awn01.nix.mds.xyz  -deststorepass <PASS> -srcstorepass <SRCPASS> -destalias cm-awn01.nix.mds.xyz
Importing keystore /root/cm-awn01.nix.mds.xyz.keystore.jks to /etc/cloudera/keystore

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/cloudera/keystore -destkeystore/etc/cloudera/keystore -deststoretype pkcs12".

[root@cm-r01nn01 .ssh]# keytool -list -keystore /etc/cloudera/keystore -storepass <PASS> | grep -Ei "cm|srv"
cm-awn01.nix.mds.xyz, May 17, 2020, PrivateKeyEntry,

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/cloudera/keystore -destkeystore/etc/cloudera/keystore -deststoretype pkcs12".
[root@cm-r01nn01 .ssh]#

Then retry your connection once more.

A similar situation but a different resolution is that the client agent.pem configuration is pointing to the wrong PEM file.  Take this example for instance:

[18/May/2020 13:03:51 +0000] 2413 MainThread agent        ERROR    Heartbeating to srv-c01.mws.mds.xyz:7182 failed.
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1387, in _send_heartbeat
    self.cfg.max_cert_depth)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 139, in __init__
    self.conn.connect()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/httpslib.py", line 69, in connect
    sock.connect((self.host, self.port))
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 309, in connect
    ret = self.connect_ssl()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
    return m2.ssl_connect(self.ssl, self._timeout)
SSLError: certificate verify failed

The connection is being made to srv-c01.mws.mds.xyz:7182 but the agent.pem points to a different PEM file:

[root@cm-awn01 pki]# ls -altri
total 44
  655285 -rw-r–r– 1 cloudera-scm cloudera-scm 1453 May 12 08:06 cm-r01nn01.mws.mds.xyz.pem
  655286 -rw-r–r– 1 cloudera-scm cloudera-scm 1453 May 12 08:06 cm-r01nn02.mws.mds.xyz.pem
  655287 lrwxrwxrwx 1 cloudera-scm cloudera-scm   53 May 12 08:19 agent.pem -> /opt/cloudera/security/pki/cm-r01nn01.mws.mds.xyz.pem
[root@cm-awn01 pki]#

Repoint to the correct PEM file:

[root@cm-awn01 pki]# ls -altri
  652341 -rw-r–r– 1 cloudera-scm cloudera-scm 1505 May 12 08:50 srv-c01.mws.mds.xyz.pem
  691030 lrwxrwxrwx 1 root         root           23 May 18 13:04 agent.pem -> srv-c01.mws.mds.xyz.pem
  691034 drwxr-xr-x 2 root         root         4096 May 18 13:04 .
[root@cm-awn01 pki]#

The key is in understanding where the connection is happening and what pem files are required for that connection. 

———————————————————————————————————————————————————————–

Another source of this error is when packages are getting downloaded:

[18/May/2020 17:22:57 +0000] 5627 Thread-13 https        ERROR    Failed to retrieve/store URL: https://cm-r01nn01.mws.mds.xyz:7183/cmf/parcel/download/CDH-6.2.0-1.cdh6.2.0.p0.967373-el7.parcel.torrent -> /opt/cloudera/parcel-cache/CDH-6.2.0-1.cdh6.2.0.p0.967373-el7.parcel.torrent certificate verify failed
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 193, in fetch_to_file
    resp = self.open(req_url)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 188, in open
    return self.opener(url, *pargs, **kwargs)
  File "/usr/lib64/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 179, in https_open
    return self.do_open(opener, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1211, in do_open
    h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1041, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1075, in _send_request
    self.endheaders(body)
  File "/usr/lib64/python2.7/httplib.py", line 1037, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python2.7/httplib.py", line 881, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.7/httplib.py", line 843, in send
    self.connect()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/httpslib.py", line 69, in connect
    sock.connect((self.host, self.port))
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 309, in connect
    ret = self.connect_ssl()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
    return m2.ssl_connect(self.ssl, self._timeout)
SSLError: certificate verify failed

So what to do with this guy?  Use this line to verify the certificate much like Cloudera would.

openssl s_client -connect cm-r01nn01:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null

Use the above on a non working host

[root@cm-awn01 pki]# openssl s_client -connect cm-r01nn01:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null
CONNECTED(00000003)
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = cm-c01.mws.mds.xyz
verify error:num=18:self signed certificate
verify return:1
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = cm-c01.mws.mds.xyz
verify return:1
.
.
.
    Start Time: 1589842560
    Timeout   : 300 (sec)
    Verify return code: 18 (self signed certificate)

DONE

and a working host.

[root@cm-r01wn01 pki]# openssl s_client -connect cm-r01nn01:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null
CONNECTED(00000003)
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = srv-c01.mws.mds.xyz
verify return:1
.
.
.
    Start Time: 1589842676
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)

DONE
[root@cm-r01wn01 pki]#

This is because on a working node (cm-r01wn01), verify_cert_file points to:

/opt/cloudera/security/pki:
server.jks -> /opt/cloudera/security/pki/cm-r01wn01.mws.mds.xyz.keystore.jks
cluster-vip.pem -> srv-c01.mws.mds.xyz.pem

and on a non working host (cm-awn01) we have this as well:

/opt/cloudera/security/pki:
server.jks -> cm-awn01.nix.mds.xyz.keystore.jks
cluster-vip.pem -> srv-c01.mws.mds.xyz.pem

However, a non-working node is going through an HAProxy and Keepalived VIP whereas the working node presents the certificate directly.  So the hostname, in blue above, returned is different and secondly our certificate doesn't appear in any truststore, hence 18 (self signed certificate).  Our config has the subjectAltName defined which validates this certs for three hostnames:

openssl x509 -in srv-c01.mws.mds.xyz.pem -noout -text|grep DNS
                DNS:srv-c01.mws.mds.xyz, DNS:cm-r01nn01.mws.mds.xyz, DNS:cm-r01nn02.mws.mds.xyz

On cm-r01nn01:7183, the server.jks points to the srv-c01.mws.mds.xyz.keystore.jks file allowing validation to succeed:

[root@cm-r01nn01 pki]# ls -altri
135962827 -rw-r–r–. 1 cloudera-scm cloudera-scm 2422 Jul 18  2019 srv-c01.mws.mds.xyz.keystore.jks
135962831 lrwxrwxrwx. 1 root         root           32 Jul 27  2019 server.jks -> srv-c01.mws.mds.xyz.keystore.jks
135962833 -rw-r–r–. 1 cloudera-scm cloudera-scm 1505 Jul 27  2019 srv-c01.mws.mds.xyz.pem
135962837 lrwxrwxrwx. 1 root         root           23 Jul 27  2019 cluster-vip.pem -> srv-c01.mws.mds.xyz.pem

This keystore has all the correct SAN entries per the openssl check above.  Most importantly, because of the hostname returned by the above command on our Azure box, cm-c01.mws.mds.xyz, it throws the certificate verification off since cm-c01.mws.mds.xyz isn't in our truststore and the hostname doesn't match.  

In other words, our Azure box has:

  • cluster-vip.pem -> srv-c01.mws.mds.xyz.pem
  • Connects to HAproxy / Keepalived via the external VIP ( /etc/hosts -> 100.100.100.10 cm-r01nn01.mws.mds.xyz cm-r01nn01 )
  • HAProxy and Keepalived returns the certs for cm-c01.mws.mds.xyz
  • srv-c01.mws.mds.xyz.pem doesn't match cm-c01.mws.mds.xyz

Our local box has:

  • cluster-vip.pem -> srv-c01.mws.mds.xyz.pem
  • Connects directly to host cm-r01nn01:7183
  • Host returns srv-c01.mws.mds.xyz
  • srv-c01.mws.mds.xyz.pem matches srv-c01.mws.mds.xyz
  • Keystore valudation succeeds because keystore points to server.jks -> srv-c01.mws.mds.xyz.keystore.jks on the namenode, and so validation succeeds via verify_cert_file=/opt/cloudera/security/pki/cluster-vip.pem

So how to fix this mess.  The easiest way to fix this without mucking around with too many SSL certs, is to redirect the traffic on port 7183 directly to the CM servers behind the VIP instead of applying a new certificate to the VIP directly.  Here is the correct config:

frontend cm7183in
        log                             127.0.0.1:514   local0          debug
        bind    cm-c01:7183
        default_backend cm7183back

backend cm7183back
        log                             127.0.0.1:514   local0          debug
        mode tcp
        balance source

        server cm-r01nn01.mws.mds.xyz cm-r01nn01.mws.mds.xyz:7183 check
        server cm-r01nn02.mws.mds.xyz cm-r01nn02.mws.mds.xyz:7183 check

What we had is this:

# frontend cm7183in
#         bind    cm-c01:7183 ssl crt /etc/haproxy/certs/cm-c01.mws.mds.xyz-haproxy.pem no-sslv3
#         default_backend cmback

backend cmback
        mode http
        balance roundrobin

        server cm-r01nn01.mws.mds.xyz cm-r01nn01.mws.mds.xyz:7183 ssl check verify none port 7183 inter 12000 rise 3 fall 3
        server cm-r01nn02.mws.mds.xyz cm-r01nn02.mws.mds.xyz:7183 ssl check verify none port 7183 inter 12000 rise 3 fall 3

Which applied another certificate, cm-c01.mws.mds.xyz, to the communication between the worker and the CM servers. So now let's verify if this actually works as advertised.

[root@cm-awn01 pki]# openssl s_client -connect cm-r01nn01.mws.mds.xyz:7183 -CAfile $(grep -v '^#' /etc/cloudera-scm-agent/config.ini | grep "verify_cert_file=" |sed s/verify_cert_file=//) -verify_hostname cm-r01nn01.mws.mds.xyz </dev/null
CONNECTED(00000003)
depth=0 C = US, ST = California, L = Los Angeles, O = MDS, OU = MDS, CN = srv-c01.mws.mds.xyz
verify return:1

.
.
.
    Start Time: 1589846762
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)

DONE
[root@cm-awn01 pki]#

Because we're passing the traffic directly via TCP, and not applying any certs along the way, now our verification returns OK.  Makes sense?

Warmest and Kindest Regards w/ Hugs,
TK

Leave a Reply

You must be logged in to post a comment.


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License