HBase/Hadoop on Mac OS X (Pseudo-Distributed)

I wanted to do some experimenting with various tools for doing Hadoop and HBase activities and didn’t want to have to bother making it work with our Cluster in the Cloud. I just wanted a simple experimental environment on my Macbook Pro running Snow Leopard Mac OS X.

So I thought it was time to revisit installing Hadoop and HBase on the Mac using the latest versions of everything. This will be deployed as Psuedo-Distributed mode native to Mac OS X. Some folks actually create a set of Linux VMs with a full Hadoop/HBase stack and run that on the Mac, but that is a bit of overkill for now.

These instructions mainly follow the standard instructions for Apache Hadoop and Apache HBase

Prerequisits

Mac OS X Xcode developer tools which includes Java 1.6.x. You can get this for free from the Apple Mac Dev Center. You have to become a member but there is a free membership available.

Download and Unpack Latest Distros

You can get a link to a mirror for Hadoop via the Hadoop Apache Mirror link and for Hbase at the HBase Apache Mirror link. Each of those links will bring you to a suggested link to a mirror for Hadoop or HBase. Once you click on the suggest link, it will bring you to a mirror with the recent releases. You can click on the stable link which will then bring you to a directory that has the latest stable Hadoop (as of this writing: hadoop-0.20.2.tar.gz) or HBase (as of this writing: hbase-0.20.3.tar.gz ). Click on those tar.gz files to download them.

I am going to keep the distros in ~/work/pkgs. I usually create a directory ~/work/pkgs and unpack the tar files there as numbered versions and then create symbolic links to them in ~/work. But you can do this all in any directory that you can control.:

cd ~/work
mkdir -p pkgs
cd pkgs
tar xvzf hadoop-0.20.2.tar.gz
tar xvzf hbase-0.20.3.tar.gz
cd ..
ln -s pkgs/hadoop-0.20.2 hadoop
ln -s pkgs/hbase-020.3 hbase
mkdir -p hadoop/logs
mkdir -p hbase/logs

Now you can have your tools all access ~/work/hadoop or ~/work/hbase and not care what version it is. You can update to later version just by downloading, untarring the distro and then just change the symbolic links.

Configure Hadoop

All the configuration files mentioned here will be in ~/work/hadoop/conf. In this example we are assuming that the Hadoop servers will only be accessed from this localhost. If you need to make it accessable from other hosts or VMs on your lan that support Bonjour, you could use the bonjour name (ie. the name of your mac followed by .local such as mymac.local) instead of localhost in the following Hadoop and HBase configuraitons

hadoop-env.sh

Mainly need to tell Hadoop where your JAVA_HOME is.

Add the following line below the commented out JAVA_HOME line is in hadoop-env.sh

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

Make sure you can ssh without a password to the hostname used in the configs

The Hadoop and Hbase start/stop scripts use ssh to access the various servers. In this case of doing a Pseudo-Distributed mode, everything is running on the localhost, but we still need to allow the scripts to ssh to the localhost.

Check that you can ssh to the localhost (or whatever hostname you used in the above configs)

We’re assuming that we’ll be running the Hadoop/HBase servers as the same user as our login. You can set things up to run as the hadoop user, but its kind of complicated on Mac OS X. See the section File System Layout in an earlier post Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard. That section and a few other points thru that post describe how to create and use a hadoop user to run the Hadoop and HBase servers.

Back to just doing this as our own user. Test that you can ssh to the localhost without a password:

ssh localhost

If you see something like the following paragraph that ends up with a password prompt, then you need to add a key to your ssh setup that does not need a password (you may need to say yes if you are asked if you want to continue connecting).

The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 3c:5d:6a:39:64:78:02:9d:a3:c9:69:68:50:23:71:eb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Password:

To create a passwordless key and add it to your set of authorized keys that can access your host, do the following (as yourself, not as root. The id_dsa file name can be arbitrary):

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa_for_hadoop
cat ~/.ssh/id_dsa_for_hadoop.pub >> ~/.ssh/authorized_keys

If you have strong alternative opinions on how to set up your own keys to accomplish the same thing please do it your own way. This is just the basic way of doing a passwordless ssh. You may want to use a key you already have lying around or some other mechanism.

Start Hadoop

One time format of  Hadoop File System

Only once, before the first time you use Hadoop, you have to create a formated Hadoop File System. Don’t do this again once you have data in your Hadoop file system as it will erase anything you might have saved there. You may have to do this command again if somehow you screw up your file system. But its not something to do lightly the second time.

~/work/hadoop/bin/hadoop namenode -format

If all goes well, you should see something like:

10/05/02 18:45:04 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Psion.local/192.168.50.16
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/02 18:45:04 INFO namenode.FSNamesystem: fsOwner=rberger,rberger,admin,com.apple.access_screensharing,_developer,_lpoperator,_lpadmin,_appserveradm,_appserverusr,localaccounts,everyone,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3,dev,com.apple.sharepoint.group.1,workgroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/02 18:45:04 INFO common.Storage: Image file of size 97 saved in 0 seconds.
10/05/02 18:45:04 INFO common.Storage: Storage directory /tmp/hadoop-rberger/dfs/name has been successfully formatted.
10/05/02 18:45:04 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Psion.local/192.168.50.16
************************************************************/

Starting and stopping Hadoop

Now you can start Hadoop. You will use this command to start Hadoop in general:

~/work/hadoop/bin/start-all.sh

You can stop Hadoop with the command

~/work/hadoop/bin/stop-all.sh

But remember if you are running HBase, stop that first, then stop Hadoop.

Making sure Hadoop is working

You can see the Hadoop logs in ~/work/hadoop/logs

You should be able to see the Hadoop Namenode web interface at http://localhost:50070/ and the JobTracker Web Interface at http://localhost:50030/. If not, check that you have 5 java processes running where each of those java processes have one of the following as their last command line (as seen from a ps ax | grep hadoop command) :

org.apache.hadoop.mapred.JobTracker
org.apache.hadoop.hdfs.server.namenode.NameNode
org.apache.hadoop.mapred.TaskTracker
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
org.apache.hadoop.hdfs.server.datanode.DataNode

If you do not see these 5 processes, check the logs in ~work/hadoop/logs/*.{out,log} for messages that might give you a hint as to what went wrong.

Run some example map/reduce jobs

The Hadoop distro comes with some example / test map / reduce jobs. Here we’ll run them and make sure things are working end to end.

cd ~/work/hadoop
# Copy the input files into the distributed filesystem
# (there will be no output visible from the command):
bin/hadoop fs -put conf input
# Run some of the examples provided:
# (there will be a large amount of INFO statements as output)
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
# Examine the output files:
bin/hadoop fs -cat output/part-00000

The resulting output should be something like:

3	dfs.class
2	dfs.period
1	dfs.file
1	dfs.replication
1	dfs.servers
1	dfsadmin
1	dfsmetrics.log

Configuring HBase

The following config files all reside in ~/work/hbase/conf. As mentioned earlier, use a FQDN or a Bonjour name instead of localhost if you need remote clients to access HBase. But if you don’t use localhost here, make sure you do the same in the Hadoop config.

hbase-env.sh

Add the following line below the commented out JAVA_HOME line is in hbase-env.sh

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

Add the following line below the commented out HBASE_CLASSPATH= line

export HBASE_CLASSPATH=${HOME}/work/hadoop/conf

hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
    <description>The directory shared by region servers.
    </description>
  </property>
</configuration>

Making Sure HBase is Working

If you do a ps ax | grep hbase you should see two java processes. One should end with:
org.apache.hadoop.hbase.zookeeper.HQuorumPeer start
And the other should end with:
org.apache.hadoop.hbase.master.HMaster start
Since we are running in the Pseudo-Distributed mode, there will not be any explicit regionservers running. If you have problems, check the logs in ~/work/hbase/logs/*.{out,log}

Testing HBase using the HBase Shell

From the unix prompt give the following command:

~/work/hbase/bin/hbase shell

Here is some example commands from the Apache HBase Installation Instructions:

base> # Type "help" to see shell help screen
hbase> help
hbase> # To create a table named "mylittletable" with a column family of "mylittlecolumnfamily", type
hbase> create "mylittletable", "mylittlecolumnfamily"
hbase> # To see the schema for you just created "mylittletable" table and its single "mylittlecolumnfamily", type
hbase> describe "mylittletable"
hbase> # To add a row whose id is "myrow", to the column "mylittlecolumnfamily:x" with a value of 'v', do
hbase> put "mylittletable", "myrow", "mylittlecolumnfamily:x", "v"
hbase> # To get the cell just added, do
hbase> get "mylittletable", "myrow"
hbase> # To scan you new table, do
hbase> scan "mylittletable"

You can stop hbase with the command:

~/work/hbase/bin/stop-hbase.sh

Once that has stopped you can stop hadoop:

~/work/hadoop/bin/stop-all.sh

Conclusion

You should now have a fully working Pseudo-Distributed Hadoop / HBase setup on your Mac. This is not suitable for any kind of large data or production project. In fact it will probably fail if you try to do anything with lots of data or high volumes of I/O. HBase seems to not like to work well until you get 4 – 5 regionservers.

But this Pseudo-Distributed version should be fine for doing experiments with tools and small data sets.

Now I can get on with playing with Cascading-Clojure and Cascalog!

Share and Enjoy:
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Yahoo! Buzz
  • Twitter
  • Google Bookmarks
  • LinkedIn
  • Slashdot
  • Suggest to Techmeme via Twitter

25 comments to HBase/Hadoop on Mac OS X (Pseudo-Distributed)

  • jgc

    Despite these very helpful instructions, I can’t get the passwordless ssh to work. I still get asked for a password. (I’m using the standard user account on Snow Leopard.) Are there any other settings in the System Preferences (for example Sharing) that are important to have set correctly?

    Thanks.

  • jgc

    Having played with the passwordless ssh problem some more, I realise what to change here to get it to work. The comment “The id_dsa file name can be arbitrary” is wrong, at least for me on Mac OS X 10.6.3. When I change it from id_dsa_for_hadoop to just id_dsa it then works.

  • sorry to nitpick, but you left the ./start-hbase.sh step out! Otherwise, excellent tutorial!

  • bob dobbs

    as a further expansion to jgc’s self-discovered solution, the id_dsa file name _can_ be arbitrary, you just need to tell hadoop what the the new arbitrary name is (~/.ssh/id_dsa is ssh’s default). You can use the name suggested in the main article by adding the following to the hadoop-env.sh config file:

    export HADOOP_SSH_OPTS="-i /Users/$USER/.ssh/id_dsa_for_hadoop"

    It might seem like extra work, but using the non-default key location is very useful if you’re already using key-based authentication with ssh, or have multiple identities that you don’t want mixed (i.e. you don’t want hadoop using your main personal private key for a test setup).

  • Shradds

    Excellent tutorial!!

    But theres a typo in the modifications to be made to hbase-site.xml file. Header appears twice.

    ……

    Sorry its kinda a small thing to point out.

  • This tutorial is great ! Thank you.
    Small mistake though :
    In hbase-site.xml, line 3, you don’t need the following :

    Makes everything crash (for me at least)

  • Sherin

    Hi,

    I configured my hadoop and hbase on my Mac os x system as explained above.
    1st thing i noticed that i m getting these errors -:

    2011-03-04 17:01:19,831 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /Users/sherin/Desktop/hadoop-0.20.2/tmp/dfs/data: namenode namespaceID = 251315710; datanode namespaceID = 341783780

    when i logging into hbase shell and creating a table create “mylittletable”, “mylittlecolumnfamily”, I m getting this error -:

    NativeException: org.apache.hadoop.hbase.MasterNotRunningException: null

    in logs i saw that 7:30:35,726 INFO org.apache.hadoop.hbase.util.FSUtils: Waiting for dfs to come up…

    Please tell me why it is happenning like this.

    thanks

  • I don’t know what is happening off hand. If you don’t really have any important data in your store. I would suggest stopping all the hadoop/hbase processes (and make sure they are all stopped by doing a ps). Then delete everything below the hadoop/dfs directory (assuming your dfs data and name directories are there. Then make sure there really is nothing in hadoop/dfs directory. Then try the hadoop name format command again: hadoop/bin/hadoop namenode -format. See if that helps. It might have been that somehow that the nameserver format was done imprpoerly.

    Obviously adjust the paths for these commands and directories to match your layout. I would also make sure that the dfs directory does match the path in your hadoop/conf/hdfs.xml file.

    This is all assuming that accessing hdfs/hbase never worked for you.

    But more likely if nothing ever worked, something is wrong with the setup in general. This tutorial is a bit dated now. There may be easier/better ways to install Hadoop and HBase on the Mac. One option is to use VirtualBox, run Linux in the virtualbox and then use the Cloudera CDH3 packages to install. This assumes you are just playing around with Hadoop and HBase which would be true for doing it directly on the Mac as well. If you want to do anything serious you really need at least 3 regionserver/datanode instances and at least one hbase master/hdfs namenode/secondarynamenode zookeeper. And if you are REALLY doing something you need even more than that. We found (this was about a year ago or more) that we needed at least 6 regionserver/datanodes for HBase to be really happy. I think that is not so much an issue now. But again, HBase/Hadoop really is all about distributed parallel processing and needs enough hardware to be distributed on to be worth it.

    The use of the Mac or a VM on your laptop or whatever is usefull for learning and doing development with the assumption you will not do any real work with that kind of setup. I recommend that you check out Cloudera and the Hadoop or HBase IRC channels for more assistance.

  • jazeps basko

    jgc, for passwordless login to work you should make your ssh client to use that newly created private key file.

    Create file ~/.ssh/config

    Add this line:
    IdentityFile ~/.ssh/id_dsa_for_hadoop

  • How About on FreeBSD?. I think there will be no problem.

  • There are probably other tutorials that are more approprate for xBSD. Also this tutorial is pretty much out of date now…

  • Kx

    I receive “Unable to load realm info from SCDynamicStore” error while trying to run the Hadoop examples and “[Fatal Error] hbase-site.xml:3:6: The processing instruction target matching “[xX][mM][lL]” is not allowed.” when I try to test Hbase using the Hbase shell. Any ideas what causes these errors? I am running Hadoop and Hbase on macbook pro with OS Lion

  • Kx

    Fixed the “[xX][mM][lL]” is not allowed.” error. Your example code for hbase-site.xml had 2 copies of the line:

    You need to delete one of the lines

  • david gerber

    I am getting the same error Unable to load realm info from SCDynamicStore when running examples. Will this interfere with running HBase

  • Best place to ask such questions is the HBase list

  • david gerber

    I decided to run hbase as standalone because I could not get it to work as pseudo distributed. I know you mentioned hbase list and I will definitely look into that but just by chance maybe somebody or yourself have encountered this error.

    ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable’s javadoc for more information

    this happens when I run the create ‘table_name’,’value’ in the shell

    help would be appreciated if you can.

    thanks

  • Thank you, wonderful tutorial!

  • Jignesh

    David,
    I am having same problem and when I tried to look for log I got following error.

    ERROR org.apache.hadoop.hbase.HServerAddress: Could not resolve the DNS name of unknowne4ce8f3898fc
    FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
    java.lang.IllegalArgumentException: hostname can’t be null

    I have asked HBase forum and they said fix your DNS resolving. But when I go home and trying to work it works but not in the office network. Any idea for resolving this.

    And off course this is a great tutorial and still not out of date.

  • whether text distribution among nodes
    is possible in hadoop 1.0.0 version

  • iowissen

    very helpful tutorial! i followed it also successfully on FreeBSD 9.0-RELEASE with diablo-jdk1.6.0 (ported together with openjdk7).

  • Good tutorial. I am wondering how Hbase works in Pseudo distributed enviornment without doing set up for zookeeper. we have set up HBase on unix machine in stand alone mode. But as far as i know in distributed mode we need to have zookeeper quoram running. Please suggest.

  • Manish: I’m pretty sure you need zookeeper for any distributed mode (Psuedo or full). The big thing to remember though is HBase is pretty much useless for anything other than very basic API testing/development in Pseudo Distributed mode. And its funky untill you have at least 5 or 6 regionservers. Also if you plan to do any Map/Reduce you really want the parallelism of multiple machines.

    HBase only makes sense if you can start at that kind of scale. You can get away with a single node being the zookeeper, hbase master, namenode, secondarynamenode jobtracker, etc and then have the regionservers, hdfs slaves, tasktracker etc on all the other machines. Though you really want to quickly get to the point where the secondarynamenode is on a machine different than the namenode and to have at least 3 zookeeprs. But that is more about robustness and handling some class of failures than basic functionality.

    If you only have the scale or ability to use a single machine you should consider some other kind of store.

  • Oh and this tutorial is now pretty out of date. we’ve switched to using the Cloudera CDH3 distribution. I havent tried to get CDH3 to work on the Mac. You need to do it from their tarballs, though I believe Homebrew now has formula for building Hadoop and HBase on a Mac.

  • patrick

    hi,
    how do you limit the hdfs filesystem format to just a portion of the disk? i don’t want hadoop to eat up all the storage on my imac.

    thanks in advance,
    patrick

  • [...] Czytaj więcej: HBase/Hadoop on Mac OS X (Pseudo-Distributed) « Cognizant Transmutaion [...]

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>