Install Hadoop in Ubuntu 12.04

Prerequisites to learn hadoop are having good knowledge in Java and basics of Linux commands.

Lot of us face problems while installing Hadoop in our systems even I was one of them. Here I am going to tell you step by step produce to get error free hadoop environment in Ubuntu 12.04.

Install Java in your system:

  • Download Oracle JDK from Oracle download page (I am using  jdk-7u25-linux-i586.tar.gz)
  • Open Terminal (Ctrl+T) and go to the folder where you downloaded Java.
  • Extract zip file, we will get a folder like jdk1.7.0_xx (In my case jdk1.7.0_25). Use the following command for extracting java.

                     tar -xvf jdk-7u25-linux-i586.tar.gz

  • Make a directory /usr/lib/jvm

                     sudo mkdir -p /usr/lib/jvm

  • Move extracted java folder to the directory created above

                     sudo mv jdk1.7.0_25 /usr/lib/jvm

  • Update alternatives of java, javac, javaws and jps to Ubuntu environment

                     sudo update-alternatives –install “/usr/bin/java” “java” “/usr/lib/jvm/jdk1.7.0_25/bin/java” 1

                     sudo update-alternatives –install “/usr/bin/javac” “javac” “/usr/lib/jvm/jdk1.7.0_25/bin/javac” 1

                     sudo update-alternatives –install “/usr/bin/javaws” “javaws” “/usr/lib/jvm/jdk1.7.0_25/bin/javaws” 1

                     sudo update-alternatives –install “/usr/bin/jps” “jps” “/usr/lib/jvm/jdk1.7.0_25/bin/jps” 1

Now we successfully got java configured in our system.

Prerequisites for Hadoop Installation:

  • Hadoop cluster works on SSH Networks so we install SSH and create password less sessions.

                     sudo apt-get install ssh

  • Add dedicated user and group for hadoop related operations lets consider a user called hduser and group as hadoop, we can use any names in place of hduser and hadoop

                     sudo addgroup hadoop  This adds group called hadoop

                     sudo adduser –ingroup hadoop hduser This adds user called hduser and keeps it in hadoop group

  • Login to hduser account through Terminal

                     su hduser Asks for a password, please give password that you have created for hduser in the above step

  • To create password-less SSH connection, create RSA key for hduser. It will ask for “file in which to save the key” just hit enter.

                     ssh-keygen -t rsa -P “”

  • You have to enable SSH access to your local machine with this newly created key

                     cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

  • Lets test above configured SSH setup by connecting to your local machine with hduser user. It will ask for confirmation, type yes to continue.

                     ssh localhost

Now we are set to install hadoop, if we reach this mark it’s very easy to configure hadoop.

Hadoop Installation:

  • Make sure to exit from hduser account because from now onward we need sudo access which is nothing but admin rights. we can unable root account easily by typing “sudo passwd” from terminal which asks to give password once password is set we can use root account by logging into it as “su root“.
  • Download Hadoop from Apache site use stable version (In my case I am using hadoop-1.1.2)
  • Unzip this file, move it to /usr/local/ folder and make hduser owner of this folder and sub-directories with file.

                     tar -xvf hadoop-1.1.2.tar.gz

                     sudo mv hadoop-1.1.2 /usr/local/

                     sudo chown -R hduser:hadoop /usr/local/hadoop-1.1.2

  • Make temporary directory for local file system and Hadoop File System(HDFS) . Also make hduser its owner.

                     sudo mkdir -p /app/hadoop/tmp/

                     sudo chown -R hduser:hadoop /app

                     sudo chmod -R 750 /app

  • Now let’s begin configuring hadoop files.
  • Configure Java and Hadoop folders to hduser home folder. This is very very important step and don’t skip it. Log in as root user “su root“. Every step below is performed as root user only.

                      gedit /home/hduser/.bashrc [there is space between / & .bashrc]

  • Copy below text at the end of above opened file

                     # Set Hadoop-related environment variables
                     export HADOOP_PREFIX=/usr/local/hadoop-1.1.2
                     export HADOOP_HOME=/usr/local/hadoop-1.1.2

                     # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
                     export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_25/

                     # Some convenient aliases and functions for running Hadoop-related commands
                     unalias fs &> /dev/null
                     alias fs=”hadoop fs”
                     unalias hls &> /dev/null
                     alias hls=”fs -ls”

                     # If you have LZO compression enabled in your Hadoop cluster and
                     # compress job outputs with LZOP (not covered in this tutorial):
                     # Conveniently inspect an LZOP compressed file from the command
                     # line; run via:
                     #
                     # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
                     #
                     # Requires installed ‘lzop’ command.
                     #
                     lzohead () {
                     hadoop fs -cat $1 | lzop -dc | head -1000 | less
                     }

                     # Add Hadoop bin/ directory to PATH
                     export PATH=$PATH:$HADOOP_PREFIX/bin
                     export PATH=$PATH:$JAVA_HOME/bin

  • We need to configure Java Home Folder path in hadoop-env.sh file and also disable IPv6 for Hadoop because of various network related issues with Hadoop trying to bind 0.0.0.0 IP.

                      cd /usr/local/hadoop-1.1.2/conf/

                      gedit hadoop-env.sh

  • In above file search for HADOOP_OPTS and replace whole line with below to disable IPv6

                      export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

  • Now search for JAVA_HOME and replace whole line with below to configure JAVA HOME path for hadoop usage.

                      export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_25/

  • Save and exit this file
  • Disable IPv6 by editing the following file

                      gedit /etc/sysctl.conf

  • Append the below text in the above opened file

                     # disable ipv6

                     net.ipv6.conf.all.disable_ipv6 = 1

                     net.ipv6.conf.default.disable_ipv6 = 1

                     net.ipv6.conf.lo.disable_ipv6 = 1

Please restart the system in order to reflect the changes

  • Check whether IPv6 disabled or not. If the below statement returns “1” its disabled

                      $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

  • Open core-site.xml file

                      gedit /usr/local/hadoop/conf/core-site.xml

and add below text within <configuration></configuration> tag. After that save and exit file.

                     <property>
                     <name>hadoop.tmp.dir</name>
                     <value>/app/hadoop/tmp</value>
                     <description>Base folder for other temporary directories.</description>
                     </property><property>
                     <name>fs.default.name</name>
                     <value>hdfs://localhost:54310</value>
                     <description>This is name of default file system.</description>
                     </property>

  • Open mapred-site.xml file

                      gedit /usr/local/hadoop/conf/mapred-site.xml

and add below text within <configuration></configuration> tag. After that save and exit file.

                     <property>
                     <name>mapred.job.tracker</name>
                     <value>localhost:54311</value>
                     <description>The host and port that the MapReduce job tracker runs at.</description>
                     </property>

  • Open hdfs-site.xml file

                      gedit /usr/local/hadoop/conf/hdfs-site.xml

and add below text within <configuration></configuration> tag. After that save and exit file as in step 12.

                     <property>
                     <name>dfs.replication</name>
                     <value>1</value>
                     <description>Default data blocks replication</description>
                     </property>

  • Its time to format HDFS filesystem via the NameNode. This is only one-time activity, You need to do this the first time you set up a Hadoop cluster.

                      cd /usr/local/hadoop/bin

                      ./hadoop namenode -format

  • You should get below line somewhere in output , Bold text is important.

31/03/13 12:10:21 INFO common.Storage: Storage directory …/hadoop-hduser/dfs/name has been successfully formatted.

  • Bingo Done !!!!!
  • Below command will namenode, datanode, secondarynamenode, jobtracker and tasktracker.

                      /usr/local/hadoop/bin/start-all.sh

  • Verify if all services are up. Execute below command in Terminal and you should be able to see all enlisted jobs with their PID(s).

                      jps

  • To stop Hadoop services you can fire below command

                      /usr/local/hadoop/bin/stop-all.sh

Web Interfaces:

Got lot of help from michael noll hadoop installation tutorial.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s