Hadoop Cluster Setup, SSH Key Authentication
So you have spent your time in pseudo mode and you have finally started moving to your own cluster? Perhaps you just jumped right into the cluster setup? In any case, a distributed Hadoop cluster setup requires your “master” node [name node & job tracker] to be able to SSH (without requiring a password, so key based authentication) to all other “slave” nodes (e.g. data nodes).
The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them, etc. This is also required to be setup on the secondary name node (which is listed in your masters file) so that [presuming it is running on another machine which is a VERY good idea for a production cluster] will be started from your name node with ./start-dfs.sh and job tracker node with ./start-mapred.sh
Make sure you are the hadoop user for all of these commands. If you have not yet installed Hadoop and/or created the hadoop user you should do that first. Depending on your distribution (please follow it’s directions for setup) this will be slightly different (e.g. Cloudera creates the hadoop user for your when going through the rpm install).
First from your “master” node check that you can ssh to the localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P “” -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
On your master node try to ssh again (as the hadoop user) to your localhost and if you are still getting a password prompt then.
$ chmod go-w $HOME $HOME/.ssh
$ chmod 600 $HOME/.ssh/authorized_keys
$ chown `whoami` $HOME/.ssh/authorized_keys
Now you need to copy (however you want to-do this please go ahead) your public key to all of your “slave” machine (don’t forget your secondary name node). It is possible (depending on if these are new machines) that the slave’s hadoop user does not have a .ssh directory and if not you should create it ($ mkdir ~/.ssh)
$ scp ~/.ssh/id_dsa.pub slave1:~/.ssh/master.pub
Now login (as the hadoop user) to your slave machine. While on your slave machine add your master machine’s hadoop user’s public key to the slave machine’s hadoop authorized key store.
$ cat ~/.ssh/master.pub >> ~/.ssh/authorized_keys
Now, from the master node try to ssh to slave.
$ssh slave1
If you are still prompted for a password (which is most likely) then it is very often just a simple permission issue. Go back to your slave node again and as the hadoop user run this
$ chmod go-w $HOME $HOME/.ssh
$ chmod 600 $HOME/.ssh/authorized_keys
$ chown `whoami` $HOME/.ssh/authorized_keys
Try again from your master node.
$ssh slave1
And you should be good to go. Repeat for all Hadoop Cluster Nodes.
[tweetmeme http://wp.me/pTu1i-29%5D
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/
Very Very Thanks for this wonderful tutorial. Keep Uploading the this type of documents. Bye Bye
I’m being asked to setup hadoop in an environment where key-based ssh logins aren’t allowed for application accounts for audit reasons. is there any way around this requirement?
If you have not already posted to the user group you should http://hadoop.apache.org/mailing_lists.html
thanks nice blog 😉
Wow! Thank you! I always wanted to write on my blog something like that. Can I implement a fragment of your post to my site?