Header Shadow Image


FATAL: remaining connection slots are reserved for non-replication superuser connections

Getting this?

FATAL:  remaining connection slots are reserved for non-replication superuser connections

Fix that by updating the Patroni configuration like like so:

[root@psql01 log]# patronictl -c /etc/patroni.yml edit-config postgres

+++
@@ -1,9 +1,10 @@
 loop_wait: 10
 maximum_lag_on_failover: 1048576
 postgresql:
+  parameters:
–  max_connections: 256
+    max_connections: 256
–  max_replication_slots: 64
+    max_replication_slots: 64
–  max_wal_senders: 32
+    max_wal_senders: 32
   use_pg_rewind: true
 retry_timeout: 10
 ttl: 30

Apply these changes? [y/N]: y
Configuration changed
[root@psql01 log]#
[root@psql01 log]#
[root@psql01 log]# patronictl -c /etc/patroni.yml restart postgres
+———-+————-+—————+——–+———+———–+
| Cluster  |    Member   |      Host     |  Role  |  State  | Lag in MB |
+———-+————-+—————+——–+———+———–+
| postgres | postgresql0 | 192.168.0.108 | Leader | running |       0.0 |
| postgres | postgresql1 | 192.168.0.124 |        | running |       0.0 |
| postgres | postgresql2 | 192.168.0.118 |        | running |       0.0 |
+———-+————-+—————+——–+———+———–+
Are you sure you want to restart members postgresql0, postgresql1, postgresql2? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
When should the restart take place (e.g. 2015-10-01T14:30)  [now]:
Success: restart on member postgresql0
Success: restart on member postgresql1
Success: restart on member postgresql2
[root@psql01 log]# sudo su – postgres
Last login: Sat Sep 14 09:15:34 EDT 2019 on pts/0
-bash-4.2$ psql -h psql-c01 -p 5432 -W
Password:
psql (10.5)
Type "help" for help.

postgres=#
postgres=#
postgres=#
postgres=# show max_connections; show  max_replication_slots;
 max_connections
—————–
 256
(1 row)

 max_replication_slots
———————–
 64
(1 row)

postgres=#
 

Keep in mind that cluster name above is your scope from the config file:

[root@psql01 patroni]# cat /etc/patroni.yml
scope: postgres

Alternately, update the PostgresSQL settings with the above, if you're not running Patroni.

Cheers,
TK

REF: My post on the project page: https://github.com/zalando/patroni/issues/1177

touch: cannot touch /atlas/atlassian/confluence/logs/catalina.out: Permission denied

Getting this?

[confluence@atlas02 logs]$ logout
[root@atlas02 atlassian]# systemctl status confluence.service -l
â confluence.service – LSB: Atlassian Confluence
   Loaded: loaded (/etc/rc.d/init.d/confluence; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2019-09-10 22:07:18 EDT; 2min 5s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 11361 ExecStop=/etc/rc.d/init.d/confluence stop (code=exited, status=0/SUCCESS)
  Process: 11925 ExecStart=/etc/rc.d/init.d/confluence start (code=exited, status=1/FAILURE)

Sep 10 22:07:18 atlas02.nix.mds.xyz confluence[11925]: at com.atlassian.confluence.bootstrap.SynchronyProxyWatchdog.main(SynchronyProxyWatchdog.ja   va:47)
Sep 10 22:07:18 atlas02.nix.mds.xyz confluence[11925]: 2019-09-10 22:07:18,348 INFO [main] [atlassian.confluence.bootstrap.SynchronyProxyWatchdog]    A Context element for ${confluence.context.path}/synchrony-proxy is found in /atlas/atlassian/confluence/conf/server.xml. No further action is re   quired
Sep 10 22:07:18 atlas02.nix.mds.xyz confluence[11925]: —————————————————————————
Sep 10 22:07:18 atlas02.nix.mds.xyz confluence[11925]: touch: cannot touch â/atlas/atlassian/confluence/logs/catalina.outâ: Permission denied
Sep 10 22:07:18 atlas02.nix.mds.xyz confluence[11925]: /atlas/atlassian/confluence/bin/catalina.sh: line 464: /atlas/atlassian/confluence/logs/cat   alina.out: Permission denied
Sep 10 22:07:18 atlas02.nix.mds.xyz runuser[11930]: pam_unix(runuser:session): session closed for user confluence1
Sep 10 22:07:18 atlas02.nix.mds.xyz systemd[1]: confluence.service: control process exited, code=exited status=1
Sep 10 22:07:18 atlas02.nix.mds.xyz systemd[1]: Failed to start LSB: Atlassian Confluence.
Sep 10 22:07:18 atlas02.nix.mds.xyz systemd[1]: Unit confluence.service entered failed state.
Sep 10 22:07:18 atlas02.nix.mds.xyz systemd[1]: confluence.service failed.
[root@atlas02 atlassian]# ls -altri /atlas/atlassian/confluence/conf/server.xml.
ls: cannot access /atlas/atlassian/confluence/conf/server.xml.: No such file or directory
[root@atlas02 atlassian]#

 

And seeing this from journalctl -xe:

— Unit confluence.service has begun starting up.
Sep 10 22:11:18 atlas02.nix.mds.xyz confluence[12241]: To run Confluence in the foreground, start the server with start-confluence.sh -fg
Sep 10 22:11:18 atlas02.nix.mds.xyz confluence[12241]: executing using dedicated user: confluence1
Sep 10 22:11:18 atlas02.nix.mds.xyz runuser[12246]: pam_unix(runuser:session): session opened for user confluence1 by (uid=0)
Sep 10 22:11:18 atlas02.nix.mds.xyz confluence[12241]: If you encounter issues starting up Confluence, please see the Installation guide at http:/
Sep 10 22:11:18 atlas02.nix.mds.xyz confluence[12241]: Server startup logs are located in /atlas/atlassian/confluence/logs/catalina.out
Sep 10 22:11:18 atlas02.nix.mds.xyz confluence[12241]: —————————————————————————
Sep 10 22:11:18 atlas02.nix.mds.xyz confluence[12241]: Using Java: /atlas/atlassian/confluence/jre//bin/java
Sep 10 22:11:18 atlas02.nix.mds.xyz automount[5344]: st_expire: state 1 path /n
Sep 10 22:11:18 atlas02.nix.mds.xyz automount[5344]: expire_proc: exp_proc = 140606617675520 path /n
Sep 10 22:11:18 atlas02.nix.mds.xyz automount[5344]: expire_proc_indirect: expire /n/mds.xyz
Sep 10 22:11:18 atlas02.nix.mds.xyz automount[5344]: 1 remaining in /n
Sep 10 22:11:18 atlas02.nix.mds.xyz automount[5344]: expire_cleanup: got thid 140606617675520 path /n stat 3
Sep 10 22:11:18 atlas02.nix.mds.xyz automount[5344]: expire_cleanup: sigchld: exp 140606617675520 finished, switching from 2 to 1
Sep 10 22:11:18 atlas02.nix.mds.xyz automount[5344]: st_ready: st_ready(): state = 2 path /n
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: log4j:ERROR setFile(null,true) call failed.
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: java.io.FileNotFoundException: /atlas/atlassian/confluence/logs/synchrony-proxy-watchdog.lo
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at java.io.FileOutputStream.open0(Native Method)
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at java.io.FileOutputStream.open(FileOutputStream.java:270)
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207)
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at com.atlassian.confluence.bootstrap.SynchronyProxyWatchdog.addLogFileAppender(SynchronyPr
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: at com.atlassian.confluence.bootstrap.SynchronyProxyWatchdog.main(SynchronyProxyWatchdog.ja
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: 2019-09-10 22:11:19,321 INFO [main] [atlassian.confluence.bootstrap.SynchronyProxyWatchdog]
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: —————————————————————————
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: touch: cannot touch â/atlas/atlassian/confluence/logs/catalina.outâ: Permission denied
Sep 10 22:11:19 atlas02.nix.mds.xyz confluence[12241]: /atlas/atlassian/confluence/bin/catalina.sh: line 464: /atlas/atlassian/confluence/logs/cat
Sep 10 22:11:19 atlas02.nix.mds.xyz runuser[12246]: pam_unix(runuser:session): session closed for user confluence1
Sep 10 22:11:19 atlas02.nix.mds.xyz systemd[1]: confluence.service: control process exited, code=exited status=1
Sep 10 22:11:19 atlas02.nix.mds.xyz systemd[1]: Failed to start LSB: Atlassian Confluence.
— Subject: Unit confluence.service has failed
— Defined-By: systemd

 

It turns out that confluence creates a new user everytime you install it.  Why?  Who knows.  First time I have ever seen anything like that in an application.  It's very unusual and annoying especially if you try to reinstall confluence only to find it made itself a new user.  And also when searching for the real user using standard process commands can be misleading when two or more of these exist:

[root@atlas02 logs]# ps -ef|grep -Ei confluence|grep logs
conflue+ 10256     1 43 01:23 ?        00:01:30 /atlas/atlassian/confluence/jre//bin/java

To fix this, do the following.  

Change the user to the earlier confluence user.  In our case, change confluence1 to confluence:

[root@atlas02 bin]# grep -Ei confluence1 *
grep: synchrony: Is a directory
user.sh:CONF_USER="confluence1" # user created by installer
[root@atlas02 bin]#
[root@atlas02 bin]#
[root@atlas02 bin]#
[root@atlas02 bin]# vi user.sh
[root@atlas02 bin]# pwd
/atlas/atlassian/confluence/bin
[root@atlas02 bin]#

Next change the directory permissions on the confluence folder:

[root@atlas02 atlas]# pwd
/atlas
[root@atlas02 atlas]# ls -altri
total 17
11318803973829525516 -rw-r–r–.  1 root       root          8 Nov 15  2018 you.there
                 128 dr-xr-xr-x. 24 root       root       4096 Mar 12 21:23 ..
12124534773086893833 drwxr-xr-x.  4 root       root       4096 Mar 23 12:34 atlassian.bak
                   1 drwxr-xr-x.  5 root       root       4096 Mar 23 13:23 .
13456417161533701348 drwxr-xr-x.  4 confluence confluence 4096 Mar 23 13:28 atlassian
[root@atlas02 atlas]# chown -R confluence.confluence atlassian

And restart confluence using:

systemctl restart confluence

Cheers,
TK

 

Application application_1567571625367_0006 failed 2 times due to AM Container for appattempt_1567571625367_0006_000002 exited with  exitCode: -1000

Getting this?

19/09/07 23:41:56 ERROR repl.Main: Failed to initialize Spark session.
org.apache.spark.SparkException: Application application_1567571625367_0006 failed 2 times due to AM Container for appattempt_1567571625367_0006_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2019-09-07 23:41:54.934]Application application_1567571625367_0006 initialization failed (exitCode=255) with output: main : command provided 0
main : run as user is tom
main : requested yarn user is tom
User tom not found

For more detailed output, check the application tracking page: http://cm-r01nn02.mws.mds.xyz:8088/cluster/app/application_1567571625367_0006 Then click on links to logs of each attempt.
. Failing the application.

 

This is likely due to incorrect auth_to_local rules in HDFS -> Configuration:


RULE:[2:$1@$0](HTTP@\QMWS.MDS.XYZ\E$)s/@\QMWS.MDS.XYZ\E$//
RULE:[1:$1@$0](.*@\QMWS.MDS.XYZ\E$)s/@\QMWS.MDS.XYZ\E$///L
RULE:[2:$1@$0](.*@\QMWS.MDS.XYZ\E$)s/@\QMWS.MDS.XYZ\E$///L
RULE:[2:$1@$0](HTTP@\Qmws.mds.xyz\E$)s/@\Qmws.mds.xyz\E$//
RULE:[1:$1@$0](.*@\Qmws.mds.xyz\E$)s/@\Qmws.mds.xyz\E$///L
RULE:[2:$1@$0](.*@\Qmws.mds.xyz\E$)s/@\Qmws.mds.xyz\E$///L
RULE:[2:$1@$0](HTTP@\QMDS.XYZ\E$)s/@\QMDS.XYZ\E$//
RULE:[1:$1@$0](.*@\QMDS.XYZ\E$)s/@\QMDS.XYZ\E$///L
RULE:[2:$1@$0](.*@\QMDS.XYZ\E$)s/@\QMDS.XYZ\E$///L
RULE:[2:$1@$0](HTTP@\Qmds.xyz\E$)s/@\Qmds.xyz\E$//
RULE:[1:$1@$0](.*@\Qmds.xyz\E$)s/@\Qmds.xyz\E$///L
RULE:[2:$1@$0](.*@\Qmds.xyz\E$)s/@\Qmds.xyz\E$///L

 

In our case, removed the above rules.  More fine-tuning would be needed to make them both HDFS and Spark friendly. 

Cheers,
TK

Configure Cloudera HUE with FreeIPA

Configuring HUE with LDAP / FreeIPA:

[root@idmipa03 ~]# LDAPTLS_CACERT=/etc/ipa/ca.crt ldapsearch -H ldaps://idmipa03.mws.mds.xyz:636 -D "uid=admin,cn=users,cn=compat,dc=mws,dc=mds,dc=xyz" -w "<SECRET>" -b "dc=mws,dc=mds,dc=xyz" -v "(&(objectClass=posixAccount)(uid=*))"  |grep dn:
ldap_initialize( ldaps://idmipa03.mws.mds.xyz:636/??base )
filter: (&(objectClass=posixAccount)(uid=*))
requesting: All userApplication attributes
dn: uid=cmadmin-530029b6,cn=users,cn=compat,dc=mws,dc=mds,dc=xyz
dn: uid=admin,cn=users,cn=compat,dc=mws,dc=mds,dc=xyz
dn: uid=admin,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz
dn: uid=cmadmin-530029b6,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz
[root@idmipa03 ~]#

Ensure the following settings:

Authentication Backend ( backend ) : desktop.authentication.backend.LdapBackend
PAM Backend Service Name ( pam_service) : login
LDAP URL  ( ldap_url ) : ldaps://idmipa03.mws.mds.xyz:636
LDAP Server CA Certificate ( ldap_cert ) : /etc/ipa/ca.crt
Enable LDAP TLS ( use_start_tls ) : <CHECKED>
Use Search Bind Authentication (search_bind_authentication) : <CHECKED>
Create LDAP users on login ( create_users_on_login ) : <CHECKED>
LDAP Search Base ( base_dn ) : cn=compat,dc=mws,dc=mds,dc=xyz
LDAP Bind User Distinguished Name ( bind_dn ) : uid=admin,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz
LDAP Bind Password ( bind_password ) : <SECRET>
LDAP User Filter ( user_filter ) : (objectClass=posixAccount)
LDAP Username Attribute ( user_name_attr ) : uid
LDAP Group Filter ( group_filter ) : (objectClass=posixGroup)
LDAP Group Name Attribute ( group_name_attr ) : cn

Test the configuration ( Hue – Actions – Test LDAP Configuration ):

Test LDAP Configuration
Status  Finished  Context 
Hue
  Sep 2, 7:09:09 PM  35.8s 
Hue's LDAP configuration is valid.
 
Completed 1 of 1 step(s).
  Show All Steps    Show Only Failed Steps    Show Only Running Steps
Testing the Hue LDAP configuration.        
Hue Server (cm-r01en01)
Sep 2, 7:09:09 PM    35.8s

 

You may receive this error:

[root@cm-r01en01 hue-httpd]# ldapsearch -Y GSSAPI -w "<SECRET>" -H 'ldaps://idmipa-c01.mws.mds.xyz:636' -b 'dc=mws,dc=mds,dc=xyz' -D 'uid=admin,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz' '(&(objectClass=posixAccount)(uid=tom))' -d1 |grep dn:
TLS: hostname (idmipa-c01.mws.mds.xyz) does not match common name in certificate (idmipa04.mws.mds.xyz).

This means you'll need a SAN certificate with 1) VIP, 2) idmipa03 and 3) idmipa04 listed as valid hostnames.  Otherwise, use the single IPA server node.

To find users in AD DC ( Active Directory / Domain Controllers ) use the explicit format:

[root@cm-r01en01 cloudera-scm-agent]# LDAPTLS_CACERT=/etc/ipa/ca.crt   ldapsearch -Y GSSAPI -H ldaps://idmipa03.mws.mds.xyz:636 -D "uid=admin,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz" -w "<SECRET>" -b "dc=mws,dc=mds,dc=xyz" "(uid=tom@mds.xyz)" -v|grep dn:

As per RFC 2307 . 

While configuring, we into the following:

/var/run/cloudera-scm-agent/process/2231-hue-HUE_SERVER/logs/stderr.log
[02/Sep/2019 20:05:57 +0000] backend      WARNING  Cannot configure LDAP with SSL and enable STARTTLS.
[02/Sep/2019 20:05:58 +0000] config       ERROR    search_s('dc=mws,dc=mds,dc=xyz', 2, '(&(uid=tom@mds.xyz)(*))') raised FILTER_ERROR({'desc': 'Bad search filter'},)
[02/Sep/2019 20:05:58 +0000] config       DEBUG    search_s('dc=mws,dc=mds,dc=xyz', 2, '(&(uid=%(user)s)(*))') returned 0 objects:
[02/Sep/2019 20:05:58 +0000] backend      DEBUG    Authentication failed for tom@mds.xyz: failed to map the username to a DN.
[02/Sep/2019 20:05:59 +0000] access       WARNING  192.168.0.76 -anon- – "POST /hue/accounts/login HTTP/1.1" (mem: 132mb)– Failed login for user: tom@mds.xyz

Debugging a little further reveals:

[root@cm-r01en01 logs]# LDAPTLS_CACERT=/etc/ipa/ca.crt   ldapsearch -Y GSSAPI -H ldaps://idmipa03.mws.mds.xyz:636 -D "uid=admin,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz" -w "<SECRET>" -b "cn=compat,dc=mws,dc=mds,dc=xyz" "(uid=tom@mds.xyz))" -v|grep dn:
ldap_initialize( ldaps://idmipa03.mws.mds.xyz:636/??base )
SASL/GSSAPI authentication started
SASL username: hdfs/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ
SASL SSF: 256
SASL data security layer installed.
filter: (uid=tom@mds.xyz))
requesting: All userApplication attributes
ldap_search_ext: Bad search filter (-7)
[root@cm-r01en01 logs]#

With a few commands, we quickly figure out the correct mappings:

USER:
LDAPTLS_CACERT=/etc/ipa/ca.crt   ldapsearch -Y GSSAPI -H ldaps://idmipa03.mws.mds.xyz:636 -D "uid=admin,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz" -w "<SECRET>" -b "cn=compat,dc=mws,dc=mds,dc=xyz" "(&(uid=tom@mds.xyz)(objectClass=posixAccount))" -v

GROUP:
LDAPTLS_CACERT=/etc/ipa/ca.crt   ldapsearch -Y GSSAPI -H ldaps://idmipa03.mws.mds.xyz:636 -D "uid=admin,cn=users,cn=accounts,dc=mws,dc=mds,dc=xyz" -w "<SECRET>" -b "cn=compat,dc=mws,dc=mds,dc=xyz" "(&(cn=cdhadmins)(objectClass=posixGroup))" -v

And we are greeted with a successful login message:

[02/Sep/2019 20:34:19 +0000] middleware   DEBUG    {"username": "tom@mds.xyz", "impersonator": "hue", "eventTime": 1567481659975, "operationText": "Successful login for user: tom@mds.xyz", "service": "hue", "url": "/hue/accounts/login", "allowed": true, "operation": "USER_LOGIN", "ipAddress": "192.168.0.76"}

using our AD DC user!  

Successful Hue IPA Integration

Cheers,
TK

There is a problem processing audits for HIVESERVER2.

Getting this?

There is a problem processing audits for HIVESERVER2.

[02/Sep/2019 12:36:30 +0000] 32165 Audit-Plugin throttling_logger ERROR    (341 skipped) Error occurred when sending entry to server: [02/Sep/2019 12:36:30 +0000] 32165 Audit-Plugin throttling_logger INFO     (341 skipped) Unable to send data to nav server. Will try again.

Diggig further, we see this error as well:

[02/Sep/2019 11:31:55 +0000] 4044 Profile-Plugin navigator_plugin INFO     Pipelines updated for Profile Plugin: set([])
[02/Sep/2019 11:31:55 +0000] 4044 Audit-Plugin navigator_plugin_pipeline INFO     Starting with navigator log None for role HIVESERVER2 and pipeline HiveSentryOnFailureHookTP
[02/Sep/2019 11:31:55 +0000] 4044 Metadata-Plugin navigator_plugin ERROR    Exception caught when trying to refresh Metadata Plugin for conf.cloudera.spark_on_yarn with count 0 pipelines names [].
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/audit/navigator_plugin.py", line 198, in immediate_refresh
    self._recreate_pipelines_for_csd()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/audit/navigator_plugin.py", line 157, in _recreate_pipelines_for_csd
    existing_logs = [name for name in os.listdir(self.nav_conf.log_dir)
AttributeError: 'NoneType' object has no attribute 'log_dir'
[02/Sep/2019 11:31:55 +0000] 4044 Metadata-Plugin navigator_plugin INFO     Pipelines updated for Metadata Plugin: []

Redeploying the Spark config should solve this:

Execute DeployClusterClientConfig for {yarn,solr,hbase,kafka,hdfs,hive,spark_on_yarn} in parallel.

We can only surmise what may have occurred in this case.  Apparently config updates were being done to the config while an earlier config deployment was happening, corrupting the setup.  This may not solve it, however.  YMMV.  

In all likelihood, your free license has expired.  In that case navigate to Cloudera Management Service then turn off / uncheck the following:

Navigator Audit Server Role Health Test

But that wasn't it either.  Finally, remove the Navigator Audit Server from Cloudera Management Services instances since no valid license exists.

Cheers,
TK

How do I connect to HiveServer2 (HS2) through beeline

How do I connect to HiveServer2 (HS2) through beeline:

beeline> !connect jdbc:hive2://cm-r01en01.mws.mds.xyz:10000/default;principal=hive/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ
Connecting to jdbc:hive2://cm-r01en01.mws.mds.xyz:10000/default;principal=hive/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ
Connected to: Apache Hive (version 2.1.1-cdh6.3.0)
Driver: Hive JDBC (version 2.1.1-cdh6.3.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://cm-r01en01.mws.mds.xyz:10000/>
0: jdbc:hive2://cm-r01en01.mws.mds.xyz:10000/>
0: jdbc:hive2://cm-r01en01.mws.mds.xyz:10000/> show tables;
INFO  : Compiling command(queryId=hive_20190902102937_4ef97c5b-19ff-47b8-be81-dacb2edeece0): show tables
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190902102937_4ef97c5b-19ff-47b8-be81-dacb2edeece0); Time taken: 2.67 seconds
INFO  : Executing command(queryId=hive_20190902102937_4ef97c5b-19ff-47b8-be81-dacb2edeece0): show tables
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20190902102937_4ef97c5b-19ff-47b8-be81-dacb2edeece0); Time taken: 1.333 seconds
INFO  : OK
+———–+
| tab_name  |
+———–+
+———–+
No rows selected (5.048 seconds)
0: jdbc:hive2://cm-r01en01.mws.mds.xyz:10000/> show databases;
INFO  : Compiling command(queryId=hive_20190902103034_18cb927f-0ab7-4a2d-b311-206b6ebb2cc2): show databases
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190902103034_18cb927f-0ab7-4a2d-b311-206b6ebb2cc2); Time taken: 0.059 seconds
INFO  : Executing command(queryId=hive_20190902103034_18cb927f-0ab7-4a2d-b311-206b6ebb2cc2): show databases
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20190902103034_18cb927f-0ab7-4a2d-b311-206b6ebb2cc2); Time taken: 0.039 seconds
INFO  : OK
+—————-+
| database_name  |
+—————-+
| default        |
+—————-+
1 row selected (0.362 seconds)
0: jdbc:hive2://cm-r01en01.mws.mds.xyz:10000/>

 

Cheers,
TK

Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1024 MB) of this cluster!

Getting this?

java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1024 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

Either increase the Container Memory:

yarn.nodemanager.resource.memory-mb to 2GB from 1GB

Or reduce the maximum container memory:

yarn.scheduler.maximum-allocation-mb from 1GB to 512MB

in the YARN configuration settings.  However, you may get this error:

Service ResourceManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Invalid resource scheduler memory allocation configuration: yarn.scheduler.minimum-allocation-mb=1024, yarn.scheduler.maximum-allocation-mb=512.  Both values must be greater than or equal to 0and the maximum allocation value must be greater than or equal tothe minimum allocation value.

So we can set the container memory minimal to 1/2 the max:

yarn.scheduler.minimum-allocation-mb from 1G to 256MB

But that didn't work.  Ultimately setting this to 2GB worked:

yarn.scheduler.maximum-allocation-mb

Cheers,
TK

 

Cloudera Clusters: Running kinit on a kerberos principal returns with a 1969 date.

Are you getting this result?

[root@cm-r01en01 process]# systemctl restart cloudera-scm-agent
[root@cm-r01en01 process]#
[root@cm-r01en01 process]#
[root@cm-r01en01 process]#
[root@cm-r01en01 process]# kinit -kt ./1401-hdfs-NFSGATEWAY/hdfs.keytab hdfs/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ
[root@cm-r01en01 process]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: host/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ

Valid starting       Expires              Service principal
12/31/1969 19:00:00  12/31/1969 19:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
[root@cm-r01en01 process]#

 

Solve it by restarting nfs-ganesha on the NFS server!   I know what you're saying but keep reading.

Turns out our NFS server was stuck on one of the three cluster nodes making up our NFS cluster.  This affected the cloudera-scm-agent (CMA) . It couldn't properly communicate with or report back to the CMS ( Cloudera Manager Server ).  CMA does FS checks.  It got stuck trying to read the NFS mount.  

Restarting nfs-ganesha allowed NFS reads again.  CMA could then do it's FS checks and report back.  Renewing the ticket after that fixed the above issue:

[root@cm-r01en01 process]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ

Valid starting       Expires              Service principal
08/31/2019 07:30:14  09/01/2019 07:30:14  krbtgt/MWS.MDS.XYZ@MWS.MDS.XYZ
        renew until 09/07/2019 07:30:14
[root@cm-r01en01 process]#

Cheers,
TK

org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

Getting this?

[root@cm-r01en01 ~]# hdfs dfs -ls /
19/08/25 22:43:19 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
ls: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "cm-r01en01.mws.mds.xyz/192.168.0.140"; destination host is: "cm-r01nn02.mws.mds.xyz":8020;
[root@cm-r01en01 ~]#

Fix it by doing the following:

  • Ensure the following setting is commented out and following two settings exist in the krb5.conf:

    # default_ccache_name = KEYRING:persistent:%{uid}

    renew_lifetime = 7d
    forwardable = true

  • Stop the cluster and CM.

  • Regenerate the Cluster Kerberos Credentials in AdministrationSecurity.

  • Start CM and CDH services.

  • Try the procedure again.

Try the operation again:

[root@cm-r01en01 ~]# ls -altri /var/run/cloudera-scm-agent/process/*hdfs*/hdfs.keytab
   40996 -rw——-. 1 hdfs hdfs 534 Aug 25 01:58 /var/run/cloudera-scm-agent/process/1016-hdfs-NFSGATEWAY/hdfs.keytab
   57096 -rw——-. 1 hdfs hdfs 534 Aug 25 02:01 /var/run/cloudera-scm-agent/process/1089-hdfs-NFSGATEWAY/hdfs.keytab
17388393 -rw——-. 1 hdfs hdfs 534 Aug 25 08:48 /var/run/cloudera-scm-agent/process/1174-hdfs-NFSGATEWAY/hdfs.keytab
17814727 -rw——-. 1 hdfs hdfs 534 Aug 25 20:16 /var/run/cloudera-scm-agent/process/1249-hdfs-NFSGATEWAY/hdfs.keytab
17871689 -rw——-. 1 hdfs hdfs 534 Aug 25 21:29 /var/run/cloudera-scm-agent/process/1329-hdfs-NFSGATEWAY/hdfs.keytab
[root@cm-r01en01 ~]# kinit -kt /var/run/cloudera-scm-agent/process/1329-hdfs-NFSGATEWAY/hdfs.keytab hdfs/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ
[root@cm-r01en01 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ

Valid starting       Expires              Service principal
08/25/2019 22:44:06  08/26/2019 22:44:06  krbtgt/MWS.MDS.XYZ@MWS.MDS.XYZ
        renew until 09/01/2019 22:44:06
[root@cm-r01en01 ~]#
[root@cm-r01en01 ~]#
[root@cm-r01en01 ~]#
[root@cm-r01en01 ~]# hdfs dfs -ls /
Found 4 items
drwxr-xr-x   – hbase hbase               0 2019-08-25 21:30 /hbase
drwxrwxr-x   – solr  solr                0 2019-08-13 00:41 /solr
drwxrwxrwt   – hdfs  supergroup          0 2019-08-17 21:28 /tmp
drwxr-xr-x   – hdfs  supergroup          0 2019-08-17 22:38 /user
[root@cm-r01en01 ~]# klist -fe
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/cm-r01en01.mws.mds.xyz@MWS.MDS.XYZ

Valid starting       Expires              Service principal
08/25/2019 22:44:06  08/26/2019 22:44:06  krbtgt/MWS.MDS.XYZ@MWS.MDS.XYZ
        renew until 09/01/2019 22:44:06, Flags: FRIA
        Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96
[root@cm-r01en01 ~]#

Note the flags above FRIA.  The R stands for renewable, a requirements for Cloudera.

UNVERIFIED

An alternate solution to this could be to set the following to privacy however we never tried it.  

hadoop.rpc.protection

Cheers,
TK

The readiness of the Impala Daemon to process queries is not known.

Getting this from the Impala Daemon?

The readiness of the Impala Daemon to process queries is not known.

Investigation reveals:

Aug 25 09:16:39 cm-r01wn01 kernel: FINAL_REJECT: IN=eth0 OUT= MAC=00:50:56:86:7e:b7:00:50:56:86:79:2a:08:00 SRC=192.168.0.132 DST=192.168.0.160 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=6227 DF PROTO=TCP SPT=59318 DPT=23000 WINDOW=29200 RES=0x00 SYN URGP=0

Solve it by adding the port to the firewall configuration and distributing it:

[root@awx01 ansible]# grep -Ei 23000 adhoc/public.xml
  <port protocol="tcp" port="23000"/>
  <port protocol="udp" port="23000"/>
[root@awx01 ansible]# ansible cm* -i infra -m copy -a 'src=adhoc/public.xml dest=/etc/firewalld/zones/public.xml'
[root@awx01 ansible]# cd /ansible && ansible 'cm*' -m shell -a 'systemctl restart firewalld';

Cheers,
TK


     
  Copyright © 2003 - 2013 Tom Kacperski (microdevsys.com). All rights reserved.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License