Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard

UPDATE: This has been replaced by a newer post Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2 . I found that using the pre-built distributions of Hadoop and HBase much better than trying to build from source. I need more Java/Ant-fu to do the build from scratch. The HBase-0.20.0 Release Candidates are really great and seemingly easier to get the cluster going than previous releases.

Introduction

Hadoop and Map / Reduce are all the rage now days, so we figure we should be using it too.

Hbase is an implementation of Google’s Bigtable. Its built on top of the Hadoop File System (HDFS).

Its trivial to install it as a standalone on top of a filesystem, but I had some difficulty getting it working on top of HDFS in the “Pseudo-Distributed” mode.

Follow the Instructions

I set up Hadoop with no problems following the instructions on the Hadoop sitefor Pseudo-Distributed Operation which runs Hbase on top of HDFS but everything runs on one server (I.E. Its configured pretty much like a cluster but all the pieces are on the same server). Another helpful set of instructions are at Running Hadoop On Ubuntu Linux (Single-Node Cluster).

I followed the HBase installation instructions also for Pseudo-Distributed Operation.

A few things to be aware of:

  • Make sure that the Hadoop version and the Hbase major version numbers are the same
    (I used Hadoop 0.18.2 and Hbase 0.18.1)
  • Make sure that the Hadoop, Hbase trees as well as the directories and files that hold the hdfs filesystem are owned by hadoop:hadoop (You have to create the user and group)
  • No need to disable ipv6 as some sites said

You can download the Hadoop tar file from http://www.apache.org/dyn/closer.cgi/hadoop/core/ and the Hbase tar file from http://www.apache.org/dyn/closer.cgi/hadoop/hbase/
They are also available as git repositories via:

git clone git://git.apache.org/hadoop.git
git clone git://git.apache.org/hbase.git

You can track a particular branch with the command (We’re stuck at hadoop 0.19.1 / hbase 0.19.0:

cd hadoop
git branch --track release-0.19.1 origin/tags/release-0.19.1
git checkout release-0.19.1
cd ../hbase
git branch --track 0.19.0 origin/tags/0.19.0
git checkout 0.19.0

Then in each directory build things. As far as I can tell you just need to use the default ant build. But you can build the jar also:

cd ../hadoop
ant
ant jar
cd ../hbase
ant
ant jar

Biggest Problem I Had

The thing that took the longest time to get right was when I wanted to access Hbase from other hosts. You would think you could put the DNS Fully Qualified Domain Name (FQDN) in the config file. Turns out that by default, the Hadoop tools don’t seem to use the host’s DNS resolver and just what is in /etc/hosts (as far as I can tell). So you have to use the IP address in the config file.

I believe there are ways to configure around this but I haven’t found it yet.

Configuration Examples

File System Layout

I untarred the distributions into /usr/local/pkgs and made symbolic links to /usr/local/hadoop and /usr/local/hbase as well as created the directory where Hadoop/HDFS will use for storage.

For Ubuntu:

sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop

For Mac:

Create a Home Directory

mkdir /Users/_hadoop

Find an unused groupid by seeing what ids are already in use:

sudo dscl . -list /Groups PrimaryGroupID | cut -c 32-34 | sort -rn

Then find an unused userid by seeing what userid’s are in use:

sudo dscl . -list /Users UniqueID | cut -c 20-22 | sort -rn

Pick a number that is in neither list. In our case we will use 402 for both the userid and groupid for _hadoop (Mac OS X has an underscore in front of daemon user/group names. We will also

sudo dscl . -create /Groups/_hadoop PrimaryGroupID 402
sudo dscl . -append /Groups/_hadoop RecordName hadoop

Take the Value of dsAttrTypeStandard:PrimaryGroupID in this case 500, and use it as the groupid in the following command:

sudo dscl . -create /Users/_hadoop UniqueID 402
sudo dscl . -create /Users/_hadoop RealName "Hadoop Service"
sudo dscl . -create /Users/_hadoop PrimaryGroupID 402
sudo dscl . -create /Users/_hadoop NFSHomeDirectory /Users/_hadoop
sudo dscl . -append /Users/_hadoop RecordName hadoop

For both Ubuntu and Mac (Note that the Mac will end up having a user/group id of _hadoop)

cd /usr/local/pkgs
tar xzf hadoop-0.18.2.tar.gz
tar xzf hbase-0.18.1.tar.gz

cd ..
ln -s /usr/local/pkgs/hadoop-0.18.2 hadoop
ln -s /usr/local/pkgs/hbase-0.18.1 hbase
mkdir /var/hadoop_datastore
chown -R hadoop:hadoop hadoop/ hbase/ /var/hadoop_datastore /Users/_hadoop

Hadoop Config files

The following are all in /usr/local/hadoop/conf

hadoop-env.sh

Need to set the JAVA_HOME variable. I installed java 6 via synoptic. You can also install it with:

apt-get install sun-java6-jdk

The Macintosh is a easy if you have a Intel Core 2 Dual (the Intel Core Dual doesn’t count). Apple is only supporting Java 1.6 on their 64 bit processors. If you have a 32 bit processor like the first generation Macbook Pro 17″ or first generation MacMini, or you have a PPC see Tech Tip: How to Set Up JDK 6 and JavaFX on 32-bit Intel Macs

So my config is (only the things I changed, the rest was left as is):

...
# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
 export JAVA_HOME=/usr/lib/jvm/java-6-sun
...

For the Macintosh:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current

hadoop-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/var/hadoop_datastore/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
<!-- As per note in http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/<C20126171.post@talk.nabble.com> -->
<property>
  <name>dfs.datanode.socket.write.timeout</name>
  <value>0</value>
</property>

<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>1023</value>
</property>
</configuration>

HBase Config Files

The following are all in /usr/local/hbase/conf

hbase-env.sh

Again, just need to set up JAVA_HOME:

...
# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
...

For the Macintosh:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current

hbase-site.xml

Here is where I wanted to give a FQDN for the host that is the hbase.master, but had to use an IP address instead.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:54310/hbase</value>
    <description>The directory shared by region servers.
    Should be fully-qualified to include the filesystem to use.
    E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
    </description>
  </property>

  <property>
    <name>hbase.master</name>
    <value>192.168.10.50:60000</value>
    <description>The host and port that the HBase master runs at.
    </description>
  </property>
</configuration>

Formatting the Name Node

You must do this as the same user as will be running the daemon (hadoop)

su hadoop -s /bin/sh -c /usr/local/hadoop/bin/hadoop namenode -format

on the Mac:

/usr/bin/su _hadoop /usr/local/hadoop/bin/hadoop namenode -format

Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

su - hadoop
ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands (as haddop):

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Ubuntu /etc/init.d style startup scripts

I scoured the InterTubes for example hadoop/hbase startup scripts and found absolutely none! I ended up creating a minimal one that is so far only suited for the Pseudo-Distributed Operation mode as it just calls the start-all / stop-all scripts.

/etc/init.d/hadoop

Create the place it will put its startup logs

mkdir /var/log/hadoop

Create /etc/init.d/hadoop with the following:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          hadoop services
# Required-Start:    $network
# Required-Stop:     $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Description:       Hadoop services
# Short-Description: Enable Hadoop services including hdfs
### END INIT INFO
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HADOOP_BIN=/usr/local/hadoop/bin
NAME=hadoop
DESC=hadoop
USER=hadoop
ROTATE_SUFFIX=
test -x $HADOOP_BIN || exit 0
RETVAL=0
set -e
cd /

start_hadoop () {
    set +e
    su $USER -s /bin/sh -c $HADOOP_BIN/start-all.sh > /var/log/hadoop/startup_log
    case "$?" in
      0)
        echo SUCCESS
        RETVAL=0
        ;;
      1)
        echo TIMEOUT - check /var/log/hadoop/startup_log
        RETVAL=1
        ;;
      *)
        echo FAILED - check /var/log/hadoop/startup_log
        RETVAL=1
        ;;
    esac
    set -e
}

stop_hadoop () {
    set +e
    if [ $RETVAL = 0 ] ; then
        su $USER -s /bin/sh -c $HADOOP_BIN/stop-all.sh > /var/log/hadoop/shutdown_log
        RETVAL=$?
        if [ $RETVAL != 0 ] ; then
            echo FAILED - check /var/log/hadoop/shutdown_log
        fi
    else
        echo No nodes running
        RETVAL=0
    fi
    set -e
}

restart_hadoop() {
    stop_hadoop
    start_hadoop
}

case "$1" in
    start)
        echo -n "Starting $DESC: "
        start_hadoop
        echo "$NAME."
        ;;
    stop)
        echo -n "Stopping $DESC: "
        stop_hadoop
        echo "$NAME."
        ;;
    force-reload|restart)
        echo -n "Restarting $DESC: "
        restart_hadoop
        echo "$NAME."
        ;;
    *)
        echo "Usage: $0 {start|stop|restart|force-reload}" >&2
        RETVAL=1
        ;;
esac
exit $RETVAL

/etc/init.d/hbase

Create the place it will put its startup logs

mkdir /var/log/hbase

Create /etc/init.d/hbase with the following:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          hbase services
# Required-Start:    $network
# Required-Stop:     $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Description:       Hbase services
# Short-Description: Enable Hbase services including hdfs
### END INIT INFO

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HBASE_BIN=/usr/local/hbase/bin
NAME=hbase
DESC=hbase
USER=hadoop
ROTATE_SUFFIX=
test -x $HBASE_BIN || exit 0
RETVAL=0
set -e
cd /

start_hbase () {
    set +e
    su $USER -s /bin/sh -c $HBASE_BIN/start-hbase.sh > /var/log/hbase/startup_log
    case "$?" in
      0)
        echo SUCCESS
        RETVAL=0
        ;;
      1)
        echo TIMEOUT - check /var/log/hbase/startup_log
        RETVAL=1
        ;;
      *)
        echo FAILED - check /var/log/hbase/startup_log
        RETVAL=1
        ;;
    esac
    set -e
}

stop_hbase () {
    set +e
    if [ $RETVAL = 0 ] ; then
        su $USER -s /bin/sh -c $HBASE_BIN/stop-hbase.sh > /var/log/hbase/shutdown_log
        RETVAL=$?
        if [ $RETVAL != 0 ] ; then
            echo FAILED - check /var/log/hbase/shutdown_log
        fi
    else
        echo No nodes running
        RETVAL=0
    fi
    set -e
}

restart_hbase() {
    stop_hbase
    start_hbase
}

case "$1" in
    start)
        echo -n "Starting $DESC: "
        start_hbase
        echo "$NAME."
        ;;
    stop)
        echo -n "Stopping $DESC: "
        stop_hbase
        echo "$NAME."
        ;;
    force-reload|restart)
        echo -n "Restarting $DESC: "
        restart_hbase
        echo "$NAME."
        ;;
    *)
        echo "Usage: $0 {start|stop|restart|force-reload}" >&2
        RETVAL=1
        ;;
esac
exit $RETVAL

Set up the init system

This assumes you put the above init files in /etc/init.d

chmod +x /etc/init.d/{hbase,hadoop}
update-rc.d hadoop defaults
update-rc.d hbase defaults 25

You can now start / stop hadoop by saying:

/etc/init.d/hadoop start
/etc/init.d/hadoop stop

And similarly with hbase

/etc/init.d/hbase start
/etc/init.d/hbase stop

Make sure you start hadoop before hbase and stop hbase before you stop hadoop

Macintosh launchd style startup

Starting proceses on Macintosh Leopard is pretty easy with lauchd/launchctl.

For hadoop, create a file /Library/LaunchAgents/com.yourdomain.hadoop.plist with the following content (replace yourdomain with the domain you want to use for this class of apps):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>GroupName</key>
    <string>_hadoop</string>
    <key>KeepAlive</key>
    <true/>
    <key>Label</key>
    <string>com.yourdomain.hadoop</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/hadoop/bin/start-all.sh</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>ServiceDescription</key>
    <string>Hadoop Process</string>
    <key>UserName</key>
    <string>_hadoop</string>
</dict>
</plist>

And for hbase, /Library/LaunchAgents/com.yourdomain.hbase.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>GroupName</key>
	<string>_hadoop</string>
	<key>KeepAlive</key>
	<true/>
	<key>Label</key>
	<string>com.ibd.hbase</string>
	<key>ProgramArguments</key>
	<array>
		<string>/usr/local/hbase/bin/start-hbase.sh</string>
	</array>
	<key>RunAtLoad</key>
	<true/>
	<key>UserName</key>
	<string>_hadoop</string>
</dict>
</plist>

Set the owner to root and the mode to 644:

chown root /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist
chmod 644 /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist

The next time you restart, it should start hbase and hadoop. You can also start them manually with the commands:

sudo launchctl load /Library/LaunchAgents/com.yourdomain.hadoop.plist
sudo launchctl load /Library/LaunchAgents/com.yourdomain.hbase.plist

Conclusion

You should now be able to see the HBase web interface at http://<your domain name>:60010

If you have problems check /var/log/{hbase,hadoop}/startup_log as well as /usr/local/hadoop/logs/hadoop-hadoop-namenode-yourhostname.log and /usr/local/hbase/logs/hbase-hadoop-master-yourhostname.log

The error messages are pretty poor. (Ie useless as far as I could tell when tracking down the FQDN/IP Address problem). But better than nothing.

I will post an update when I deploy a Full Cluster.

Share and Enjoy:
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Yahoo! Buzz
  • Twitter
  • Google Bookmarks
  • LinkedIn
  • Slashdot
  • Suggest to Techmeme via Twitter

8 comments to Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard

  • Charlie M

    Thanks for the post it was very useful. I had the same issue, it doesn’t seem to like IP addresses in the root path either.

    It does definitely do full DNS look up as I resorted to adding a sub-domain on our public DNS server for it. It returns a private IP so it will still only be available internally.

  • Yeah, I don’t think I fully understand what’s going on with it. Its even worse when I try to deploy it to Amazon EC2 where it resolves the DNS to the NAT’d local address even if you specify the DNS FQDN of the public ip address!

    When I figure it out, I’ll publish it here and make a comment. I think you get an automatic notification…

  • Added info on getting hadoop and hbase via git

  • Minor addition on how to build using ant if you installed from git. I’m still having a bit of trouble understanding what the right way is to build from source and then use though. Should I really be doing ant package and then use the package?

  • Mathiasdm

    It’s a very useful post indeed, thanks!

    Robert, I think I might have a solution for your DNS problem.
    Ubuntu, by default, adds a line ’127.0.1.1 somehostname’ to your /etc/hosts.
    Remove this line, and your problems might just be fixed.

  • Windows has vpn capabilities built into the software. You will need a dynamic DNS acount. http://www.dyndns.org I have included a link with step by step instructions.

  • Bayu (Hbase error)

    Thank’s for the post. I’m finish install hadoop & hbase now, but i have problem when using hbase. I create database by the name fire and I success to make it. but while I try create database with the same name, there are mistakes but i don’t know where the fault.

    hbase(main):003:0> create “fire”, “firework”
    0 row(s) in 1.1460 seconds
    hbase(main):004:0> create “fire”, “one”
    NativeException: org.apache.hadoop.hbase.TableExistsException: org.apache.hadoop.hbase.TableExistsException: fire
    at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:798)
    at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:762)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
    at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

  • swati verma

    I am new to HBASE, and while trying to install the same on Ubuntu system, I am facing some problem.

    Below is the error log from Zookeeper log file

    2014-01-18 06:10:51,392 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x143a5b052980000, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) 2014-01-18 06:10:51,394 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:56671 which had sessionid 0x143a5b052980000

    Below is error log from master log:

    2014-01-18 06:10:51,381 INFO org.apache.zookeeper.ZooKeeper: Session: 0x143a5b052980000 closed 2014-01-18 06:10:51,381 INFO org.apache.hadoop.hbase.master.HMaster: HMaster main thread exiting 2014-01-18 06:10:51,381 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master java.lang.RuntimeException: HMaster Aborted at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2120)

    Please note, I am able to start Hbase successfully. I mean after starting Hbase, I am able to see Hmaster running using jps command. But as soon as I try to go to Hbase shell, this issue arises and then by executing jps command, I don’t find Hmaster in list.

    Please help me in this issue, as I tried to solve it by myself from last for days, but no luck. Please help

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>