Using Pentaho Spoon to load data


Pentaho’s Spoon is an open source ETL or Extract, Transform and Load tool that makes loading data from various formats into a MySQL server easy. Spoon also lets you check the data for problems and correct them before it gets inserted. It does a lot of things very well to make life easier for a DBA. So if people are entering their state of residence as ‘CA’, ‘Cal’, ‘Cal.’ and ‘California’ , Spoon can clean up the data to what you need.

What follows is an example of transforming a CSV file into a series on INSERT statements. It is a very simple transformation but is a good initial exposure to how Spoon functions. It then covers how to use Spoon with a bulk loading program.

There have been questions on the Calpont InfiniDB forums (http://www.infinifb.org/forums) on using Spoon to drive the cpimport bulk loader program. LOAD DATA INFILE can be slow (10,000 records a second versus 1.5 million a second for cpimport). And for large data sets used with Calpont’s InfiniDB storage engine, Spoon can be used to set up a job that becomes a ‘one button’ to push data import or scheduled to be run at a predetermined time.

The is the Welcome screen for spoon
File -> New -> Transformation
The big canvas in the middle is for the layout of the transformation
From the Steps column on the left, expand the Output folder and drag CSV File Input to the canvas
A simple two column comma separated file was created and the location is input to the Filename line
1,100

2,200

3,300

4,400

5,500

6,601

7,702

8,803

9,904

The contents of the CSV file
CREATE TABLE x (

x INT, y INT) ENGINE=InfiniDB;

The DDL to create the target table. This table was created in the test database.
Select Preview and note that Spoon is seeing the two desired columns
Now a second Step or hop is needed. From the Output folder under Steps, drag the SQL File Output icon
to the canvas. Hold down the SHIFT key and drag the mouse from the CSV file input icon to the SQL File Output to designate the flow of data between the steps.
Double click on the SQL File Output icon and configure the output from this step. Select New to the right of the Connection line.
The connection is configured for an InfiniDB instance on the local system. Select the Test button
The connection must be OK
Select the green arrow icon to run the transformation. Select Launch on the next screen
Under Execution Results, the transformation can be monitored. Here all 9 lines from the CSV file are processed in the two hops.
INSERT INTO test.x(Field_000, Field_001) VALUES (1,100);INSERT INTO test.x(Field_000, Field_001) VALUES (2,200);INSERT INTO test.x(Field_000, Field_001) VALUES (3,300);INSERT INTO test.x(Field_000, Field_001) VALUES (4,400);INSERT INTO test.x(Field_000, Field_001) VALUES (5,500);INSERT INTO test.x(Field_000, Field_001) VALUES (6,601);INSERT INTO test.x(Field_000, Field_001) VALUES (7,702);INSERT INTO test.x(Field_000, Field_001) VALUES (8,803);INSERT INTO test.x(Field_000, Field_001) VALUES (9,904);
This file can now be sourced from the MySQL command line. However for InfiniDB this is VERY slow for large data sets.
To take advantage of the cpimport bulk loader, create a new transformation and select Text File Output from the Output folder. Hold the shift key and drag the mouse from the CSV File Input icon to the Text file output icon to set the data flow.
Double click the Text file output icon. The contents of the transformed data need to be written to a file in the InfiniDB bulk import directory, /usr/local/Calpont/data/bulk/data/import, and is named x. Be sure to set the Extension to tbl.
Under the Content tab, set the Separator to a pipe character (|).
Under the Field tab, select Get Fields. Then select OK
/usr/local/Calpont/bin/colxml –j 101 –l x.tbl –t x test The parameters for the bulk import are set with the colxml utility. An unused job number was selected (101). The other arguments declare that data from the file x.tbl will be loaded into the table x in the test database. Enter this command at a shell prompt/
Another hop is established. Use the steps from the previous hop.
Double click on Text file output2. Set Filename to /usr/local/Calpont/bin/cpimport -j 101, click ‘Run this as a script instead?‘, and then click OK
Execute the transformation and the CSV file will be loaded into the InfinDB instance.

2 Responses to Using Pentaho Spoon to load data

  1. Have you tested using this method with more than 10000 records. I’ve tried it and when the first batch of 10000 comes over, the bulk loader runs and then kettle sits and spins. Have you seen this?

    • OpenSorceDBA says:

      Check you steps to see if you limit the amount of records read or written to 10,000. I remember seeing such a limit but am away from my Spoon system right now and can not check it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s