Hadoop and Map / Reduce are all the rage now days, so we figure we should be using it too.
Hbase is an implementation of Google’s Bigtable. Its built on top of the Hadoop File System (HDFS).
Its trivial to install it as a standalone on top of a filesystem, but I had some difficulty getting it working on top of HDFS in the “Pseudo-Distributed” mode.
Follow the Instructions
I set up Hadoop with no problems following the instructions on the Hadoop sitefor Pseudo-Distributed Operation which runs Hbase on top of HDFS but everything runs on one server (I.E. Its configured pretty much like a cluster but all the pieces are on the same server). Another helpful set of instructions are at Running Hadoop On Ubuntu Linux (Single-Node Cluster).
I followed the HBase installation instructions also for Pseudo-Distributed Operation.
A few things to be aware of:
- Make sure that the Hadoop version and the Hbase major version numbers are the same
(I used Hadoop 0.18.2 and Hbase 0.18.1)
- Make sure that the Hadoop, Hbase trees as well as the directories and files that hold the hdfs filesystem are owned by hadoop:hadoop (You have to create the user and group)
- No need to disable ipv6 as some sites said
You can download the Hadoop tar file from http://www.apache.org/dyn/closer.cgi/hadoop/core/ and the Hbase tar file from http://www.apache.org/dyn/closer.cgi/hadoop/hbase/
They are also available as git repositories via:
git clone git://git.apache.org/hadoop.git
git clone git://git.apache.org/hbase.git
You can track a particular branch with the command (We’re stuck at hadoop 0.19.1 / hbase 0.19.0:
cd hadoop
git branch --track release-0.19.1 origin/tags/release-0.19.1
git checkout release-0.19.1
cd ../hbase
git branch --track 0.19.0 origin/tags/0.19.0
git checkout 0.19.0
Then in each directory build things. As far as I can tell you just need to use the default ant build. But you can build the jar also:
cd ../hadoop
ant
ant jar
cd ../hbase
ant
ant jar
Biggest Problem I Had
The thing that took the longest time to get right was when I wanted to access Hbase from other hosts. You would think you could put the DNS Fully Qualified Domain Name (FQDN) in the config file. Turns out that by default, the Hadoop tools don’t seem to use the host’s DNS resolver and just what is in /etc/hosts (as far as I can tell). So you have to use the IP address in the config file.
I believe there are ways to configure around this but I haven’t found it yet.
Configuration Examples
File System Layout
I untarred the distributions into /usr/local/pkgs and made symbolic links to /usr/local/hadoop and /usr/local/hbase as well as created the directory where Hadoop/HDFS will use for storage.
For Ubuntu:
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
For Mac:
Create a Home Directory
mkdir /Users/_hadoop
Find an unused groupid by seeing what ids are already in use:
sudo dscl . -list /Groups PrimaryGroupID | cut -c 32-34 | sort -rn
Then find an unused userid by seeing what userid’s are in use:
sudo dscl . -list /Users UniqueID | cut -c 20-22 | sort -rn
Pick a number that is in neither list. In our case we will use 402 for both the userid and groupid for _hadoop (Mac OS X has an underscore in front of daemon user/group names. We will also
sudo dscl . -create /Groups/_hadoop PrimaryGroupID 402
sudo dscl . -append /Groups/_hadoop RecordName hadoop
Take the Value of dsAttrTypeStandard:PrimaryGroupID in this case 500, and use it as the groupid in the following command:
sudo dscl . -create /Users/_hadoop UniqueID 402
sudo dscl . -create /Users/_hadoop RealName "Hadoop Service"
sudo dscl . -create /Users/_hadoop PrimaryGroupID 402
sudo dscl . -create /Users/_hadoop NFSHomeDirectory /Users/_hadoop
sudo dscl . -append /Users/_hadoop RecordName hadoop
For both Ubuntu and Mac (Note that the Mac will end up having a user/group id of _hadoop)
cd /usr/local/pkgs
tar xzf hadoop-0.18.2.tar.gz
tar xzf hbase-0.18.1.tar.gz
cd ..
ln -s /usr/local/pkgs/hadoop-0.18.2 hadoop
ln -s /usr/local/pkgs/hbase-0.18.1 hbase
mkdir /var/hadoop_datastore
chown -R hadoop:hadoop hadoop/ hbase/ /var/hadoop_datastore /Users/_hadoop
Hadoop Config files
The following are all in /usr/local/hadoop/conf
hadoop-env.sh
Need to set the JAVA_HOME variable. I installed java 6 via synoptic. You can also install it with:
apt-get install sun-java6-jdk
The Macintosh is a easy if you have a Intel Core 2 Dual (the Intel Core Dual doesn’t count). Apple is only supporting Java 1.6 on their 64 bit processors. If you have a 32 bit processor like the first generation Macbook Pro 17″ or first generation MacMini, or you have a PPC see Tech Tip: How to Set Up JDK 6 and JavaFX on 32-bit Intel Macs
So my config is (only the things I changed, the rest was left as is):
...
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
...
For the Macintosh:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current
hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop_datastore/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<!-- As per note in http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/<C20126171.post@talk.nabble.com> -->
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>0</value>
</property>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>1023</value>
</property>
</configuration>
HBase Config Files
The following are all in /usr/local/hbase/conf
hbase-env.sh
Again, just need to set up JAVA_HOME:
...
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
...
For the Macintosh:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current
hbase-site.xml
Here is where I wanted to give a FQDN for the host that is the hbase.master, but had to use an IP address instead.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
<description>The directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
</description>
</property>
<property>
<name>hbase.master</name>
<value>192.168.10.50:60000</value>
<description>The host and port that the HBase master runs at.
</description>
</property>
</configuration>
Formatting the Name Node
You must do this as the same user as will be running the daemon (hadoop)
su hadoop -s /bin/sh -c /usr/local/hadoop/bin/hadoop namenode -format
on the Mac:
/usr/bin/su _hadoop /usr/local/hadoop/bin/hadoop namenode -format
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
su - hadoop
ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands (as haddop):
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Ubuntu /etc/init.d style startup scripts
I scoured the InterTubes for example hadoop/hbase startup scripts and found absolutely none! I ended up creating a minimal one that is so far only suited for the Pseudo-Distributed Operation mode as it just calls the start-all / stop-all scripts.
/etc/init.d/hadoop
Create the place it will put its startup logs
mkdir /var/log/hadoop
Create /etc/init.d/hadoop with the following:
#!/bin/sh
### BEGIN INIT INFO
# Provides: hadoop services
# Required-Start: $network
# Required-Stop: $network
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Description: Hadoop services
# Short-Description: Enable Hadoop services including hdfs
### END INIT INFO
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HADOOP_BIN=/usr/local/hadoop/bin
NAME=hadoop
DESC=hadoop
USER=hadoop
ROTATE_SUFFIX=
test -x $HADOOP_BIN || exit 0
RETVAL=0
set -e
cd /
start_hadoop () {
set +e
su $USER -s /bin/sh -c $HADOOP_BIN/start-all.sh > /var/log/hadoop/startup_log
case "$?" in
0)
echo SUCCESS
RETVAL=0
;;
1)
echo TIMEOUT - check /var/log/hadoop/startup_log
RETVAL=1
;;
*)
echo FAILED - check /var/log/hadoop/startup_log
RETVAL=1
;;
esac
set -e
}
stop_hadoop () {
set +e
if [ $RETVAL = 0 ] ; then
su $USER -s /bin/sh -c $HADOOP_BIN/stop-all.sh > /var/log/hadoop/shutdown_log
RETVAL=$?
if [ $RETVAL != 0 ] ; then
echo FAILED - check /var/log/hadoop/shutdown_log
fi
else
echo No nodes running
RETVAL=0
fi
set -e
}
restart_hadoop() {
stop_hadoop
start_hadoop
}
case "$1" in
start)
echo -n "Starting $DESC: "
start_hadoop
echo "$NAME."
;;
stop)
echo -n "Stopping $DESC: "
stop_hadoop
echo "$NAME."
;;
force-reload|restart)
echo -n "Restarting $DESC: "
restart_hadoop
echo "$NAME."
;;
*)
echo "Usage: $0 {start|stop|restart|force-reload}" >&2
RETVAL=1
;;
esac
exit $RETVAL
/etc/init.d/hbase
Create the place it will put its startup logs
mkdir /var/log/hbase
Create /etc/init.d/hbase with the following:
#!/bin/sh
### BEGIN INIT INFO
# Provides: hbase services
# Required-Start: $network
# Required-Stop: $network
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Description: Hbase services
# Short-Description: Enable Hbase services including hdfs
### END INIT INFO
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HBASE_BIN=/usr/local/hbase/bin
NAME=hbase
DESC=hbase
USER=hadoop
ROTATE_SUFFIX=
test -x $HBASE_BIN || exit 0
RETVAL=0
set -e
cd /
start_hbase () {
set +e
su $USER -s /bin/sh -c $HBASE_BIN/start-hbase.sh > /var/log/hbase/startup_log
case "$?" in
0)
echo SUCCESS
RETVAL=0
;;
1)
echo TIMEOUT - check /var/log/hbase/startup_log
RETVAL=1
;;
*)
echo FAILED - check /var/log/hbase/startup_log
RETVAL=1
;;
esac
set -e
}
stop_hbase () {
set +e
if [ $RETVAL = 0 ] ; then
su $USER -s /bin/sh -c $HBASE_BIN/stop-hbase.sh > /var/log/hbase/shutdown_log
RETVAL=$?
if [ $RETVAL != 0 ] ; then
echo FAILED - check /var/log/hbase/shutdown_log
fi
else
echo No nodes running
RETVAL=0
fi
set -e
}
restart_hbase() {
stop_hbase
start_hbase
}
case "$1" in
start)
echo -n "Starting $DESC: "
start_hbase
echo "$NAME."
;;
stop)
echo -n "Stopping $DESC: "
stop_hbase
echo "$NAME."
;;
force-reload|restart)
echo -n "Restarting $DESC: "
restart_hbase
echo "$NAME."
;;
*)
echo "Usage: $0 {start|stop|restart|force-reload}" >&2
RETVAL=1
;;
esac
exit $RETVAL
Set up the init system
This assumes you put the above init files in /etc/init.d
chmod +x /etc/init.d/{hbase,hadoop}
update-rc.d hadoop defaults
update-rc.d hbase defaults 25
You can now start / stop hadoop by saying:
/etc/init.d/hadoop start
/etc/init.d/hadoop stop
And similarly with hbase
/etc/init.d/hbase start
/etc/init.d/hbase stop
Make sure you start hadoop before hbase and stop hbase before you stop hadoop
Macintosh launchd style startup
Starting proceses on Macintosh Leopard is pretty easy with lauchd/launchctl.
For hadoop, create a file /Library/LaunchAgents/com.yourdomain.hadoop.plist with the following content (replace yourdomain with the domain you want to use for this class of apps):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>GroupName</key>
<string>_hadoop</string>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.yourdomain.hadoop</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/hadoop/bin/start-all.sh</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>ServiceDescription</key>
<string>Hadoop Process</string>
<key>UserName</key>
<string>_hadoop</string>
</dict>
</plist>
And for hbase, /Library/LaunchAgents/com.yourdomain.hbase.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>GroupName</key>
<string>_hadoop</string>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.ibd.hbase</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/hbase/bin/start-hbase.sh</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>UserName</key>
<string>_hadoop</string>
</dict>
</plist>
Set the owner to root and the mode to 644:
chown root /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist
chmod 644 /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist
The next time you restart, it should start hbase and hadoop. You can also start them manually with the commands:
sudo launchctl load /Library/LaunchAgents/com.yourdomain.hadoop.plist
sudo launchctl load /Library/LaunchAgents/com.yourdomain.hbase.plist
Conclusion
You should now be able to see the HBase web interface at http://<your domain name>:60010
If you have problems check /var/log/{hbase,hadoop}/startup_log as well as /usr/local/hadoop/logs/hadoop-hadoop-namenode-yourhostname.log and /usr/local/hbase/logs/hbase-hadoop-master-yourhostname.log
The error messages are pretty poor. (Ie useless as far as I could tell when tracking down the FQDN/IP Address problem). But better than nothing.
I will post an update when I deploy a Full Cluster.