“Big Data is Huge (No Pun Intended)!” By Rizwan Ali

Rizwan Ali is a Senior Data Warehouse Architect with deep experience in database design and big data. As data collection and analysis have become increasingly important to most technology organizations, we asked Rizwan to share some of his expertise. Below he discusses managing and customizing Cloudera Manager, in addition to providing a few tips to save time while working in a Hadoop environment.

Fun Fact: The “My” in MySQL database language refers to the daughter (whose name is My) of the co-founder of the technology (Michael Widenius)

Managing Cloudera Manager 

Big data is huge (no pun intended)! There are tons of moving pieces with a vast array of technologies such as Hadoop Distributed File System (HDFS), Map Reduce, Hive, HBase, Zookeeper (and many others) coming together harmoniously in a cluster and delivering the tools and frameworks required to extract rich analytics from our big data. Managing and maintaining a Hadoop cluster with multiple nodes is not an easy task, however tools like Cloudera Manager (CM) and Cloudera’s Distribution of Hadoop (CDH) make life a lot easier.

When installing Cloudera Manager for the first time, it will ask you for the root password or private keys for the nodes. I highly recommend using the private keys to connect to the nodes because if you decide to change the root password then you will not have to update CM again. Second, you should be using some kind of config manager like Puppet, Chef or Etch to keep user, groups and GUIDs in sync across all of the nodes which will prevent CM from complaining regarding mismatch users and groups.

Starting with CM 4.6, Cloudera included activity monitoring called Cloudera Management Services (CMS) which records activities, stats and events from all nodes. CMS puts a moderate amount of load on the server, so in my experience installing CMS on the secondary name node instead of a data node results in better performance. If you are managing one cluster, then using the embedded PostgreSQL DB for CM’s internal data yields higher performance compared with using a dedicated MySQL DB. However, MySQL will provide greater flexibility when managing more than one cluster.

I cringe every time I see the dreaded “Bad Health” or exclamation mark next to a node, which you may see frequently, if you are using the default rules and thresholds for alerts. I usually set the Data Expiration Period for monitoring services to 72 hours instead of the default 168 to prevent the logs from filling up the disks. I also disable the HDFS Canary Health Check on development cluster which checks that a client can create, read, write, and delete files. By default, CM will send all alerts to [email protected], make sure to change it to an appropriate email and set proper smtp settings from the Alerts Publisher configuration page.

All of the Hadoop nodes constantly communicate with each other. Creating a separate vlan for the hadoop cluster will minimize the chatter from the rest of the network. CM will complain if it cannot communicate with any of the Hadoop nodes and this usually happens if it cannot resolve the hostname. An easy solution is to update the host file on all the nodes with the IP address associated with the hostnames in the cluster.

What would you do if you accidently deleted all the data in your Hadoop cluster? It wouldn’t be a pretty scenario but luckily Hadoop comes with a Trash feature which saves any deleted files or directories in a special .Trash directory for easy recovery. If enabled, CM configures Hadoop to empty the trash every 24 hours, which is ideal for normal operations, but you can increase it depending on the disk size of your Hadoop cluster.  The settings for Hadoop, Hive, Map Reduce and other services are stored in CM and you can easily add or override any settings by supplying your own XML properties in the Configuration Safety Valve section of the service.

Finally, HBase service should go on its own set of dedicated nodes instead of the Hadoop data nodes since HBase requires tons of memory and CPU for processing. The HDFS service is not required to depend on Zookeeper, however it is highly recommended for the Hive Metastore service to depend on Zookeeper to prevent corruption of Hive Metastore data in concurrency scenarios. By default CM will make Hive Metastore service depend on Zookeeper and if the Zookeeper service is not running then Hive will silently hang while executing any hive query.  Also, if you are not planning to use Impala then you should think about bypassing the Hive Metastore service since Hive clients will talk directly to the Metastore DB resulting in faster queries.

Big data might be big, but Cloudera Manager is huge when it comes to configuration and performance tuning on its own. There are tons of other settings that can be discussed which are beyond the scope of this article. Feel free to tweak and test your Cloudera Manager settings to fit your Hadoop environment.

Feel free to contact Rizwan at linkedin.com/in/wrist1ali.


Pin It on Pinterest