Move Salesforce Files out using Pentaho DI
kettle
Last updated
Was this helpful?
kettle
Last updated
Was this helpful?
We have developed our internal CRM system, and decided to stop using Salesforce. However there are data migration need to be done. among them, file migration is the most challenging one, since Salesforce and our CRM used two different file storage strategy.
The source is Salesforce
Attachments. There are 30GB files stored as blob data (base64 encoded) in Salesforce.
The target is our internal CRM System. In this internal CRM, files themselves stores in file system, and there is a table named files
in MySQL that records files' information (names, full path, relative path, timestamp, etc)
It's a one-time data integration task, that need to move file out of Salesforce
to our file system, and record the file info into files
table.
Graph below showed the scenario:
We use Pentaho Data Integration Tool (We call it PDI later) to do the task. The rough steps is:
extract salesforce attachment data, transform, load into a tmp table which store all raw data of salesforce attachments, and some extra computed info (generate full path, relative path and name of destination in advance)
based on tmp table records, generate files in file system and record files' info in destination table.
we don't fetch salesforce data in one time, but in multiple loops and for each loop we only fetch limited files data (we set limit row to 1000) from salesforce.
finally we lookup Lead/Account/Opportunity table to generate links between them and files, records the relations in corresponding tables.
It receives use passed argument p_use_local_tmp
, which is used to tell program to just use existed tmp table data to generate files, or load tmp data then generate files.
It's used to load tmp data. Firstly it tries to create tmp table (J_create_salesforce_tmp_table), then in multiple loops (set salesforce attachment row limit to 1000) to load tmp data (T_imp_attachments_tmp).
tmp table:
CREATE TABLE `tmp_salesforce_files` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`salesforce_id` varchar(191) NOT NULL,
`is_deleted` tinyint(1) NOT NULL DEFAULT 0,
`parent_id` varchar(191) DEFAULT NULL,
`name` varchar(260) NOT NULL,
`is_private` tinyint(1) NOT NULL DEFAULT 0,
`content_type` varchar(191) DEFAULT NULL,
`body_length` int(11) DEFAULT NULL,
`body` longtext DEFAULT NULL,
`owner_id` varchar(191) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`created_by` varchar(191) DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`updated_by` varchar(191) DEFAULT NULL,
`system_mod_stamp` datetime DEFAULT NULL,
`description` varchar(500) DEFAULT NULL,
`full_path` varchar(500) DEFAULT NULL,
`db_path` varchar(300) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `tmp_salesforce_files_unique1` (`salesforce_id`),
KEY `tmp_salesforce_files_idx1` (`system_mod_stamp`)
We provide two mode in loops.
Incremental mode: in this mode, we firstly fetch the max timestamp of salesforce file in the tmp table as start point.
SELECT max(system_mod_stamp) as max_mod_date FROM tmp_salesforce_files
Full mode: just fetch all data from salesforce
Now we touch the core transformer T_imp_attachments_tmp
Salesforce Input is the query to extract salesforce file data
SELECT Id, IsDeleted, ParentId, Name, IsPrivate,
ContentType, BodyLength, OwnerId, CreatedDate,
CreatedById, LastModifiedDate, LastModifiedById,
SystemModstamp, Description
,Body
FROM Attachment
WHERE SystemModstamp >=${p_begin_date}
ORDER BY SystemModstamp
there are some basic transformations, here I only show some important ones:
change name path full_path
It's used to generate final name, path and full_path of file. it's a JavaScript script:
// determine extension
if (name.indexOf('.')==-1){
var typeTmp = content_type;
try{
var fileExtension = (typeTmp.split('/'))[1];
}
catch (e){
var fileExtension = 'dat';
}
name = name + '.' + fileExtension;
}
else {
var fileExtension = name.split('.').pop();
}
// determine filename and content, strip off "file://" from path
// var path = 'salesforce/'+date2str(created_at, "yyyyMMdd")+'/'+name;
var path = 'salesforce/'+date2str(created_at, "yyyyMMdd")+'/' + attachment_id + '.' + fileExtension;
var full_path1 = getVariable('PROJECT_DIR', '')+'/tmp/'+path;
write file
It's used to write file to physical location. Also a JavaScript script
// pseudo import
var File = java.io.File;
var FileOutputStream = java.io.FileOutputStream;
var BufferedOutputStream = java.io.BufferedOutputStream;
var FileUtils = Packages.org.apache.commons.io.FileUtils;
var Base64 = Packages.org.apache.commons.codec.binary.Base64;
// output variables
var success;
var message;
// determine filename and content, strip off "file://" from path
// var full_path1 = getVariable('PROJECT_DIR', '')+'/tmp/'+path;
var outBytes = Base64.decodeBase64(Body);
try{
FileUtils.writeByteArrayToFile(new File(full_path1), outBytes);
success = 'Y';
message = 'OK';
}
catch (e){
success = 'N';
message = e.javaException.toString();
}
Use base64 decode function to get binary data and write to full path.
Now There is one issue remains: we need to recovery the relationship between files and leads/accounts/opportunities (which file belongs to which object).
We use a transformation to achieve it.
The idea is: Attachment ID's first 3 character in salesforce has a hint of which type of object it belongs.
source
target
001
accounts
00Q
leads
006
opportunities
Then we use Attachment's parent_id (it's file's owner object id) to lookup to find the exact crm object. then we can record the relationship in DB between file and object.