Move Salesforce Files out using Pentaho DI

kettle

PreviousDeploy a static web page with protection of specific static resources on AWS S3 NextA Pentaho DI Project Readme

Last updated 4 years ago

Was this helpful?

Move Salesforce Files out using Pentaho DI

kettle

Background

We have developed our internal CRM system, and decided to stop using Salesforce. However there are data migration need to be done. among them, file migration is the most challenging one, since Salesforce and our CRM used two different file storage strategy.

The source is Salesforce Attachments. There are 30GB files stored as blob data (base64 encoded) in Salesforce.

The target is our internal CRM System. In this internal CRM, files themselves stores in file system, and there is a table named files in MySQL that records files' information (names, full path, relative path, timestamp, etc)

It's a one-time data integration task, that need to move file out of Salesforce to our file system, and record the file info into files table.

Graph below showed the scenario:

Design

We use Pentaho Data Integration Tool (We call it PDI later) to do the task. The rough steps is:

extract salesforce attachment data, transform, load into a tmp table which store all raw data of salesforce attachments, and some extra computed info (generate full path, relative path and name of destination in advance)
based on tmp table records, generate files in file system and record files' info in destination table.
we don't fetch salesforce data in one time, but in multiple loops and for each loop we only fetch limited files data (we set limit row to 1000) from salesforce.
finally we lookup Lead/Account/Opportunity table to generate links between them and files, records the relations in corresponding tables.

Details Explains

main job

It receives use passed argument p_use_local_tmp , which is used to tell program to just use existed tmp table data to generate files, or load tmp data then generate files.

Job: J_exp_salesforce_files_data

It's used to load tmp data. Firstly it tries to create tmp table (J_create_salesforce_tmp_table), then in multiple loops (set salesforce attachment row limit to 1000) to load tmp data (T_imp_attachments_tmp).

tmp table:

CREATE TABLE `tmp_salesforce_files` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `salesforce_id` varchar(191)  NOT NULL,
  `is_deleted` tinyint(1) NOT NULL DEFAULT 0,
  `parent_id` varchar(191)  DEFAULT NULL,
  `name` varchar(260)  NOT NULL,
  `is_private` tinyint(1) NOT NULL DEFAULT 0,
  `content_type` varchar(191)  DEFAULT NULL,
  `body_length` int(11) DEFAULT NULL,
  `body` longtext  DEFAULT NULL,
  `owner_id` varchar(191)  DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `created_by` varchar(191)  DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `updated_by` varchar(191)  DEFAULT NULL,
  `system_mod_stamp` datetime DEFAULT NULL,
  `description` varchar(500)  DEFAULT NULL,
  `full_path` varchar(500)  DEFAULT NULL,
  `db_path` varchar(300)  DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `tmp_salesforce_files_unique1` (`salesforce_id`),
  KEY `tmp_salesforce_files_idx1` (`system_mod_stamp`)

We provide two mode in loops.

Incremental mode: in this mode, we firstly fetch the max timestamp of salesforce file in the tmp table as start point.

SELECT max(system_mod_stamp) as max_mod_date FROM tmp_salesforce_files

Full mode: just fetch all data from salesforce

Now we touch the core transformer T_imp_attachments_tmp

T_imp_attachments_tmp

Salesforce Input is the query to extract salesforce file data

SELECT Id, IsDeleted, ParentId, Name, IsPrivate, 
 ContentType, BodyLength, OwnerId, CreatedDate, 
 CreatedById, LastModifiedDate, LastModifiedById, 
 SystemModstamp, Description 
 ,Body
 FROM Attachment
 WHERE SystemModstamp >=${p_begin_date}
 ORDER BY SystemModstamp

there are some basic transformations, here I only show some important ones:

change name path full_path

It's used to generate final name, path and full_path of file. it's a JavaScript script:

// determine extension
if (name.indexOf('.')==-1){
	var typeTmp = content_type;

	try{
		var fileExtension = (typeTmp.split('/'))[1];
	}
	catch (e){
		var fileExtension = 'dat';
	}
	
	name = name + '.' + fileExtension;
}
else {
	var fileExtension = name.split('.').pop();
}

 
// determine filename and content, strip off "file://" from path 
// var path = 'salesforce/'+date2str(created_at, "yyyyMMdd")+'/'+name;

var path = 'salesforce/'+date2str(created_at, "yyyyMMdd")+'/' + attachment_id + '.' + fileExtension;
var full_path1 = getVariable('PROJECT_DIR', '')+'/tmp/'+path;

write file

It's used to write file to physical location. Also a JavaScript script

// pseudo import
var File = java.io.File;
var FileOutputStream = java.io.FileOutputStream;
var BufferedOutputStream = java.io.BufferedOutputStream;

var FileUtils = Packages.org.apache.commons.io.FileUtils;
var Base64 = Packages.org.apache.commons.codec.binary.Base64;
 
// output variables
var success;
var message;
 
// determine filename and content, strip off "file://" from path 
// var full_path1 = getVariable('PROJECT_DIR', '')+'/tmp/'+path;
var outBytes = Base64.decodeBase64(Body);

 
try{

	FileUtils.writeByteArrayToFile(new File(full_path1), outBytes);
 
    success = 'Y';
    message = 'OK';
 
}
catch (e){
 
    success = 'N';
    message = e.javaException.toString();
 
}

Use base64 decode function to get binary data and write to full path.

Pentaho DI uses the Rhino JavaScript engine. in PDI, there are two version of the JavaScript engine: the 2.5 version and the 3 version. If "compatibility mode" is checked (and by default it is), javascript works like it did in version 2.5. Obviously the new version should be used if possible so uncheck "compatibility mode" if you can.

Generate links between leads/accounts/opportunities and files

Now There is one issue remains: we need to recovery the relationship between files and leads/accounts/opportunities (which file belongs to which object).

We use a transformation to achieve it.

T_from_tmp_to_crm_files

The idea is: Attachment ID's first 3 character in salesforce has a hint of which type of object it belongs.

source

target

001

accounts

00Q

leads

006

opportunities

Then we use Attachment's parent_id (it's file's owner object id) to lookup to find the exact crm object. then we can record the relationship in DB between file and object.