Pentaho’s Spoon is an open source ETL or Extract, Transform and Load tool that makes loading data from various formats into a MySQL server easy. Spoon also lets you check the data for problems and correct them before it gets inserted. It does a lot of things very well to make life easier for a DBA. So if people are entering their state of residence as ‘CA’, ‘Cal’, ‘Cal.’ and ‘California’ , Spoon can clean up the data to what you need.
What follows is an example of transforming a CSV file into a series on INSERT statements. It is a very simple transformation but is a good initial exposure to how Spoon functions. It then covers how to use Spoon with a bulk loading program.
There have been questions on the Calpont InfiniDB forums (http://www.infinifb.org/forums) on using Spoon to drive the cpimport bulk loader program. LOAD DATA INFILE can be slow (10,000 records a second versus 1.5 million a second for cpimport). And for large data sets used with Calpont’s InfiniDB storage engine, Spoon can be used to set up a job that becomes a ‘one button’ to push data import or scheduled to be run at a predetermined time.
|The is the Welcome screen for spoon|
|File -> New -> Transformation|
|The big canvas in the middle is for the layout of the transformation|
|From the Steps column on the left, expand the Output folder and drag CSV File Input to the canvas|
|A simple two column comma separated file was created and the location is input to the Filename line|
|The contents of the CSV file|
|CREATE TABLE x (
x INT, y INT) ENGINE=InfiniDB;
|The DDL to create the target table. This table was created in the test database.|
|Select Preview and note that Spoon is seeing the two desired columns|
|Now a second Step or hop is needed. From the Output folder under Steps, drag the SQL File Output icon
to the canvas. Hold down the SHIFT key and drag the mouse from the CSV file input icon to the SQL File Output to designate the flow of data between the steps.
|Double click on the SQL File Output icon and configure the output from this step. Select New to the right of the Connection line.|
|The connection is configured for an InfiniDB instance on the local system. Select the Test button|
|The connection must be OK|
|Select the green arrow icon to run the transformation. Select Launch on the next screen|
|Under Execution Results, the transformation can be monitored. Here all 9 lines from the CSV file are processed in the two hops.|
|INSERT INTO test.x(Field_000, Field_001) VALUES (1,100);INSERT INTO test.x(Field_000, Field_001) VALUES (2,200);INSERT INTO test.x(Field_000, Field_001) VALUES (3,300);INSERT INTO test.x(Field_000, Field_001) VALUES (4,400);INSERT INTO test.x(Field_000, Field_001) VALUES (5,500);INSERT INTO test.x(Field_000, Field_001) VALUES (6,601);INSERT INTO test.x(Field_000, Field_001) VALUES (7,702);INSERT INTO test.x(Field_000, Field_001) VALUES (8,803);INSERT INTO test.x(Field_000, Field_001) VALUES (9,904);
||This file can now be sourced from the MySQL command line. However for InfiniDB this is VERY slow for large data sets.|
|To take advantage of the cpimport bulk loader, create a new transformation and select Text File Output from the Output folder. Hold the shift key and drag the mouse from the CSV File Input icon to the Text file output icon to set the data flow.|
|Double click the Text file output icon. The contents of the transformed data need to be written to a file in the InfiniDB bulk import directory, /usr/local/Calpont/data/bulk/data/import, and is named x. Be sure to set the Extension to tbl.|
|Under the Content tab, set the Separator to a pipe character (|).|
|Under the Field tab, select Get Fields. Then select OK|
|/usr/local/Calpont/bin/colxml –j 101 –l x.tbl –t x test||The parameters for the bulk import are set with the colxml utility. An unused job number was selected (101). The other arguments declare that data from the file x.tbl will be loaded into the table x in the test database. Enter this command at a shell prompt/|
|Another hop is established. Use the steps from the previous hop.|
|Double click on Text file output2. Set Filename to /usr/local/Calpont/bin/cpimport -j 101, click ‘Run this as a script instead?‘, and then click OK|
|Execute the transformation and the CSV file will be loaded into the InfinDB instance.|