I wanted to do some experimenting with various tools for doing Hadoop and HBase activities and didn’t want to have to bother making it work with our Cluster in the Cloud. I just wanted a simple experimental environment on my Macbook Pro running Snow Leopard Mac OS X.
So I thought it was time to revisit installing Hadoop and HBase on the Mac using the latest versions of everything. This will be deployed as Psuedo-Distributed mode native to Mac OS X. Some folks actually create a set of Linux VMs with a full Hadoop/HBase stack and run that on the Mac, but that is a bit of overkill for now.
These instructions mainly follow the standard instructions for Apache Hadoop and Apache HBase
Prerequisits
Mac OS X Xcode developer tools which includes Java 1.6.x. You can get this for free from the Apple Mac Dev Center. You have to become a member but there is a free membership available.
Download and Unpack Latest Distros
You can get a link to a mirror for Hadoop via the Hadoop Apache Mirror link and for Hbase at the HBase Apache Mirror link. Each of those links will bring you to a suggested link to a mirror for Hadoop or HBase. Once you click on the suggest link, it will bring you to a mirror with the recent releases. You can click on the stable link which will then bring you to a directory that has the latest stable Hadoop (as of this writing: hadoop-0.20.2.tar.gz) or HBase (as of this writing: hbase-0.20.3.tar.gz ). Click on those tar.gz files to download them.
I am going to keep the distros in ~/work/pkgs. I usually create a directory ~/work/pkgs and unpack the tar files there as numbered versions and then create symbolic links to them in ~/work. But you can do this all in any directory that you can control.:
cd ~/work
mkdir -p pkgs
cd pkgs
tar xvzf hadoop-0.20.2.tar.gz
tar xvzf hbase-0.20.3.tar.gz
cd ..
ln -s pkgs/hadoop-0.20.2 hadoop
ln -s pkgs/hbase-020.3 hbase
mkdir -p hadoop/logs
mkdir -p hbase/logs
Now you can have your tools all access ~/work/hadoop or ~/work/hbase and not care what version it is. You can update to later version just by downloading, untarring the distro and then just change the symbolic links.
Configure Hadoop
All the configuration files mentioned here will be in ~/work/hadoop/conf. In this example we are assuming that the Hadoop servers will only be accessed from this localhost. If you need to make it accessable from other hosts or VMs on your lan that support Bonjour, you could use the bonjour name (ie. the name of your mac followed by .local such as mymac.local) instead of localhost in the following Hadoop and HBase configuraitons
hadoop-env.sh
Mainly need to tell Hadoop where your JAVA_HOME is.
Add the following line below the commented out JAVA_HOME line is in hadoop-env.sh
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home
core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Make sure you can ssh without a password to the hostname used in the configs
The Hadoop and Hbase start/stop scripts use ssh to access the various servers. In this case of doing a Pseudo-Distributed mode, everything is running on the localhost, but we still need to allow the scripts to ssh to the localhost.
Check that you can ssh to the localhost (or whatever hostname you used in the above configs)
We’re assuming that we’ll be running the Hadoop/HBase servers as the same user as our login. You can set things up to run as the hadoop user, but its kind of complicated on Mac OS X. See the section File System Layout in an earlier post Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard. That section and a few other points thru that post describe how to create and use a hadoop user to run the Hadoop and HBase servers.
Back to just doing this as our own user. Test that you can ssh to the localhost without a password:
ssh localhost
If you see something like the following paragraph that ends up with a password prompt, then you need to add a key to your ssh setup that does not need a password (you may need to say yes if you are asked if you want to continue connecting).
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 3c:5d:6a:39:64:78:02:9d:a3:c9:69:68:50:23:71:eb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Password:
To create a passwordless key and add it to your set of authorized keys that can access your host, do the following (as yourself, not as root. The id_dsa file name can be arbitrary):
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa_for_hadoop
cat ~/.ssh/id_dsa_for_hadoop.pub >> ~/.ssh/authorized_keys
If you have strong alternative opinions on how to set up your own keys to accomplish the same thing please do it your own way. This is just the basic way of doing a passwordless ssh. You may want to use a key you already have lying around or some other mechanism.
Start Hadoop
One time format of Hadoop File System
Only once, before the first time you use Hadoop, you have to create a formated Hadoop File System. Don’t do this again once you have data in your Hadoop file system as it will erase anything you might have saved there. You may have to do this command again if somehow you screw up your file system. But its not something to do lightly the second time.
~/work/hadoop/bin/hadoop namenode -format
If all goes well, you should see something like:
10/05/02 18:45:04 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = Psion.local/192.168.50.16
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/02 18:45:04 INFO namenode.FSNamesystem: fsOwner=rberger,rberger,admin,com.apple.access_screensharing,_developer,_lpoperator,_lpadmin,_appserveradm,_appserverusr,localaccounts,everyone,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3,dev,com.apple.sharepoint.group.1,workgroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/02 18:45:04 INFO common.Storage: Image file of size 97 saved in 0 seconds.
10/05/02 18:45:04 INFO common.Storage: Storage directory /tmp/hadoop-rberger/dfs/name has been successfully formatted.
10/05/02 18:45:04 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Psion.local/192.168.50.16
************************************************************/
Starting and stopping Hadoop
Now you can start Hadoop. You will use this command to start Hadoop in general:
~/work/hadoop/bin/start-all.sh
You can stop Hadoop with the command
~/work/hadoop/bin/stop-all.sh
But remember if you are running HBase, stop that first, then stop Hadoop.
Making sure Hadoop is working
You can see the Hadoop logs in ~/work/hadoop/logs
You should be able to see the Hadoop Namenode web interface at http://localhost:50070/ and the JobTracker Web Interface at http://localhost:50030/. If not, check that you have 5 java processes running where each of those java processes have one of the following as their last command line (as seen from a ps ax | grep hadoop command) :
org.apache.hadoop.mapred.JobTracker
org.apache.hadoop.hdfs.server.namenode.NameNode
org.apache.hadoop.mapred.TaskTracker
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
org.apache.hadoop.hdfs.server.datanode.DataNode
If you do not see these 5 processes, check the logs in ~work/hadoop/logs/*.{out,log} for messages that might give you a hint as to what went wrong.
Run some example map/reduce jobs
The Hadoop distro comes with some example / test map / reduce jobs. Here we’ll run them and make sure things are working end to end.
cd ~/work/hadoop
# Copy the input files into the distributed filesystem
# (there will be no output visible from the command):
bin/hadoop fs -put conf input
# Run some of the examples provided:
# (there will be a large amount of INFO statements as output)
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
# Examine the output files:
bin/hadoop fs -cat output/part-00000
The resulting output should be something like:
3 dfs.class
2 dfs.period
1 dfs.file
1 dfs.replication
1 dfs.servers
1 dfsadmin
1 dfsmetrics.log
Configuring HBase
The following config files all reside in ~/work/hbase/conf. As mentioned earlier, use a FQDN or a Bonjour name instead of localhost if you need remote clients to access HBase. But if you don’t use localhost here, make sure you do the same in the Hadoop config.
hbase-env.sh
Add the following line below the commented out JAVA_HOME line is in hbase-env.sh
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home
Add the following line below the commented out HBASE_CLASSPATH= line
export HBASE_CLASSPATH=${HOME}/work/hadoop/conf
hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
<description>The directory shared by region servers.
</description>
</property>
</configuration>
Making Sure HBase is Working
If you do a ps ax | grep hbase you should see two java processes. One should end with:
org.apache.hadoop.hbase.zookeeper.HQuorumPeer start
And the other should end with:
org.apache.hadoop.hbase.master.HMaster start
Since we are running in the Pseudo-Distributed mode, there will not be any explicit regionservers running. If you have problems, check the logs in ~/work/hbase/logs/*.{out,log}
Testing HBase using the HBase Shell
From the unix prompt give the following command:
~/work/hbase/bin/hbase shell
Here is some example commands from the Apache HBase Installation Instructions:
base> # Type "help" to see shell help screen
hbase> help
hbase> # To create a table named "mylittletable" with a column family of "mylittlecolumnfamily", type
hbase> create "mylittletable", "mylittlecolumnfamily"
hbase> # To see the schema for you just created "mylittletable" table and its single "mylittlecolumnfamily", type
hbase> describe "mylittletable"
hbase> # To add a row whose id is "myrow", to the column "mylittlecolumnfamily:x" with a value of 'v', do
hbase> put "mylittletable", "myrow", "mylittlecolumnfamily:x", "v"
hbase> # To get the cell just added, do
hbase> get "mylittletable", "myrow"
hbase> # To scan you new table, do
hbase> scan "mylittletable"
You can stop hbase with the command:
~/work/hbase/bin/stop-hbase.sh
Once that has stopped you can stop hadoop:
~/work/hadoop/bin/stop-all.sh
Conclusion
You should now have a fully working Pseudo-Distributed Hadoop / HBase setup on your Mac. This is not suitable for any kind of large data or production project. In fact it will probably fail if you try to do anything with lots of data or high volumes of I/O. HBase seems to not like to work well until you get 4 – 5 regionservers.
But this Pseudo-Distributed version should be fine for doing experiments with tools and small data sets.
Now I can get on with playing with Cascading-Clojure and Cascalog!