Ask ten DBAs for a definition of ‘Big Data’ and you well get more than ten replies. And the majority of those replies will lead you to Hadoop. Hadoop has been the most prominent of the big data frameworks in the open source world. Over 80% of the Hadoop instances in the world are feed their data from MySQL1. But Hadoop is made up of many parts, some confusing and many that do not play nicely with each other. It is analogous to being given a pile of automotive parts from different models and tyring to come up with a car at the end of the day. So what if you do if you are wanting to copy some of your relational data into Hadoop and want to avoid the equivilent of scraped knuckles? The answer is Bigtop and what follows is a way to get a one node does all system running so you can experiement with Hadoop, Map/Reduce, Hive, and all the other parts.
Bigtop is an Apache Project self described as:
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
So much the same way you can use a Linux distribution over rolling your own by using Ubuntu or similar, you can get all the Hadoop parts you need to start from Bigtop. At OSCON this year, Cloudera’s Mark Grover had a fantastic presentation on Bigtop. Mr. Grover includes some dummy data based on US Postal Service Zip Codes and income information and a setup script for Unbunu users. How To Install Hadoop distribution from Bigtop 0.6.0 details the installation instruction for a few more Linux distrinutions. So grab the lastest Bigtop, install the software as per the directions, and get ready to enter Hadoop-ville.
Next you will need the latest Java Developers Kit (JDK) installed. Then the name node is initialized
/etc/init.d/hadoop-hdfs-namenode init, start the hadoop-hdfs-namenode & hadoop-hdfs-datanode services
service start hadoop-hdfs-namenode start; service hadoop-hdfs-datanode start, run the
/usr/lib/hadoop/libexec/init-hdfs.sh script to initialize the file system, (aren’t you glad this is the simplified way), and finally start the Yarn daemons
service hadoop-yarn-resourcemanager start;service hadoop-yarn-nodemanager start.
Now you can test with an provided example program
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples*.jar pi 10 1000 for counting Pi.
Mr, Grover provides a test data set
that was mentioned earlier. You will have to use hive to create the table
hive -e "CREATE TABLE zipcode_incomes(id STRING, zip STRING, description1 STRING, description2 STRING, income INT) (very SQL DDL like, eh?) and then comes the loading of that demo data
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"
hive -e "LOAD DATA LOCAL INPATH 'workspace/dataset/DEC_00_SF3_P077_with_ann.csv' OVERWRITE INTO TABLE zipcode_incomes"2. Gee, not all that different than what most MySQL DBAs deal with daily. Start hive and enter the oddly familiar select max(income), min(income), avg(income) from zipcode_incomes;.
‘But there has to be a better way to move data from MySQL into Hadoop other than SELECT INTO OUTFILE?’ I hear you cry. There is and in the next installment we will look at the MySQL Applier for Hadoop!
- My suspicion is that the popularity of MySQL in the Hadoopverse is partly due to the popularity of MySQL itself and that Cloudera and Horton Works hired many former MySQL Training Team members who have proven to be quite influential.
- This is using the data file loaded into MySQL earilier in Mr. Grover Script. Of course we could have dumped data into another file and loaded that but that has been skipped for clarity.