Copying MySQL Data to Hadoop with Minimal Loss of Blood Part 1

Ask ten DBAs for a definition of ‘Big Data’ and you well get more than ten replies. And the majority of those replies will lead you to Hadoop. Hadoop has been the most prominent of the big data frameworks in the open source world. Over 80% of the Hadoop instances in the world are feed their data from MySQL1. But Hadoop is made up of many parts, some confusing and many that do not play nicely with each other. It is analogous to being given a pile of automotive parts from different models and tyring to come up with a car at the end of the day. So what if you do if you are wanting to copy some of your relational data into Hadoop and want to avoid the equivilent of scraped knuckles? The answer is Bigtop and what follows is a way to get a one node does all system running so you can experiement with Hadoop, Map/Reduce, Hive, and all the other parts.

Bigtop is an Apache Project self described as:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

So much the same way you can use a Linux distribution over rolling your own by using Ubuntu or similar, you can get all the Hadoop parts you need to start from Bigtop. At OSCON this year, Cloudera’s Mark Grover had a fantastic presentation on Bigtop. Mr. Grover includes some dummy data based on US Postal Service Zip Codes and income information and a setup script for Unbunu users. How To Install Hadoop distribution from Bigtop 0.6.0 details the installation instruction for a few more Linux distrinutions. So grab the lastest Bigtop, install the software as per the directions, and get ready to enter Hadoop-ville.

Next you will need the latest Java Developers Kit (JDK) installed. Then the name node is initialized /etc/init.d/hadoop-hdfs-namenode init, start the hadoop-hdfs-namenode & hadoop-hdfs-datanode services service start hadoop-hdfs-namenode start; service hadoop-hdfs-datanode start, run the /usr/lib/hadoop/libexec/ script to initialize the file system, (aren’t you glad this is the simplified way), and finally start the Yarn daemons service hadoop-yarn-resourcemanager start;service hadoop-yarn-nodemanager start.

Now you can test with an provided example program hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples*.jar pi 10 1000 for counting Pi.

Mr, Grover provides a test data set

Using hive to find, MIN, MAX, and AVG incomes. Looks very SQL-like

Using hive to find, MIN, MAX, and AVG incomes. Looks very SQL-like

that was mentioned earlier. You will have to use hive to create the table hive -e "CREATE TABLE zipcode_incomes(id STRING, zip STRING, description1 STRING, description2 STRING, income INT)
(very SQL DDL like, eh?) and then comes the loading of that demo data hive -e "LOAD DATA LOCAL INPATH 'workspace/dataset/DEC_00_SF3_P077_with_ann.csv' OVERWRITE INTO TABLE zipcode_incomes"2. Gee, not all that different than what most MySQL DBAs deal with daily. Start hive and enter the oddly familiar select max(income), min(income), avg(income) from zipcode_incomes;.

‘But there has to be a better way to move data from MySQL into Hadoop other than SELECT INTO OUTFILE?’ I hear you cry. There is and in the next installment we will look at the MySQL Applier for Hadoop!


  1. My suspicion is that the popularity of MySQL in the Hadoopverse is partly due to the popularity of MySQL itself and that Cloudera and Horton Works hired many former MySQL Training Team members who have proven to be quite influential.
  2. This is using the data file loaded into MySQL earilier in Mr. Grover Script. Of course we could have dumped data into another file and loaded that but that has been skipped for clarity.

2 Responses to Copying MySQL Data to Hadoop with Minimal Loss of Blood Part 1

  1. tutor75 says:

    What do you think about sqoop ?

    “Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: