Thursday, May 9, 2013

Data migration is hard to do

Data migration is hard to do

Throughout the past decade, I've dealt with innumerable data munging (a bona fide technical term, folks) tasks. Whether it's migrating an ancient database into a modern database or dragging massive data sets through new processing tools or a myriad other tasks, taking data in one form and turning it into data in a new form happens with massive frequency, every single day. It's still magic to the vast majority of folks, IT included.

Let's take a few minutes and try to explain how this whole thing works, at least on a high level. You may find that referencing this helps communicate delicate back-end processes to those who are lightly technical.

[ Keep up with the latest approaches to managing information overload and staying compliant in InfoWorld's Enterprise Data Explosion newsletter and interactive Data Explosion iGuide. ]

It all started with Excel
Let's pick an unfortunately common situation: the dreaded Excel spreadsheet. Some time ago in a company far, far away, it was determined that data needed to be collected pertaining to a business process -- inventory, sales, customers, what have you. Lacking the proper tools, someone cooked up an Excel spreadsheet and went nuts with it. As time passed and thousands of records were amassed, the spreadsheet became less and less functional, and the decision was finally made to turn this intos a real database. A consulting group was hired or an internal resource defined, and someone was handed the task.

The first order of business is to inspect the cleanliness of the data itself. In a perfect world, the spreadsheet is laid out sort of like a database, with each common field in its own column -- say, First Name, Last Name, Street, City, and so forth. However, that's not always the way it works. There might be a top-down layout with the information contained in individual rows under the same column, such as a column Contact with separate rows for full name, company, address, phone number, and so forth. The next column over might be Last Order or Sales 2012 or other data. This presents a much more challenging problem.

Let's look at the first situation because it's definitely the easiest. The data is relatively clean and well organized, and it can be exported to CSV and run through a custom parser to translate it into a database. A good CSV parser will pull all of those records into an array that can be peeled apart record by record and inserted into the new database. Within that process, checks of the data can be made, and the data can be modified to better fit the format of the new database.

For instance, we might run regexes over a Phone Number field to turn various phone number formats into one standard. This requires tossing all the special characters and reformatting the resulting string before it gets inserted into the new database. The operation will turn numbers such as (212) 555-1212, 212-555-1212, 2125551212, 212 555 1212, or 212.555.1212 into the likes of (212) 555-1212 for each record, which will help with readability and searching later on.


View the original article here

No comments:

Free Facebook Likes