TL;DR: Go here for a code sample.
This is long overdue post – I’m too busy these days with contract work and other commitments.
By the nature of my work, I have to do data migrations on a regular basis. Some of the migrations from one database into another, but most of the migrations are from Excel spreadsheets into a database.
A database to database migration is somewhat nerve racking – usually the structure is broken, data is dirty, fields re-purposed and all other sort of terrible things you read about on The Daily Wtf. The best so far I’ve seen are along the lines of person middle names stored in field called
CompanyName2; employee information stored in table called
tbl_Jobs and info about actual jobs in table
tbl_Job – see the difference here? That is all to avoid the confusion!
Database-to-database migration are unique in their own kind. But I found a lot of similar things in a process of export from Excel into SQL Server. So I’ll write down these things here – for my own future reference and you, my dear reader, might find this useful as well.
Make data conversion repeatable
Depending on your situation, you might re-do the data migration once again with a new version of a spreadsheet: i.e. first time I was given an outdated version of a spreadsheet; or first import was into a UAT environment. Then a month after do the same migration into production, but with latest data.
Anyway, my main rule is to make as little modifications to the spreadsheet as possible – you can’t (easily) script these changes. And next time you do this, you’ll have to repeat them again. And you don’t want to repeat your tasks.
So I rarely do data clean-up in Excel. But if I find it easier to do in a spreadsheet, I write a VB macro and save it in a source control.
Save your migration script into version control
Chances that if you going to re-do the migration, you’ll get a different data set and a different set of issues related. So you’ll need to modify your migration script. And any piece of code modified over time should be in source control. Also if you work in a team, your team mates will be able to pull down your script.
Basic Migration Script Parts
Always set date format
SET DATEFORMAT dmy;
By default SQL Server works in US date format. Which is great if you are in US. I’m not. So I always need to specify what format I use for dates to avoid server confusion. And don’t get me started on US date format!!!
Always wrap your script in a transaction
begin transaction tran1 -- do your migration script here rollback transaction tran1
I always work in SQL Server Management Studio and you can select a SQL command and hit F5 to execute only the selected code. And I use this feature all the time – write a query in a file, execute only that query; adjust query; save the whole file, move next. Sometimes I forget to select a specific query and hit F5 – executing the entire script in my file by mistake. And usually my scripts are done quickly and are not meant to be executed twice or the entire file to be executed in one go. So I’ll have to do a clean-up and start the migration again in these cases. So wrapping the entire script in a transaction helps to prevent disasters, as long as you rollback the transaction.
Be careful with this one though: I’ve seen people accidentally starting transaction, but never committing it or rolling back, then starting another transaction… So if you think you’ve done a change in data, but can’t see this change in other select queries – most likely you have open transaction going (or a few nested ones). Just execute
rollback transaction tran1 a few times until you see an error message.
Use #temp tables to import data from excel.
Temp tables are almost like real tables, only temporary. They are good for migration type of work.
I always create temp tables with fields as
nvarchar(max). It is usually not worth to try to match the data type to what you have in Excel – most likely in the column where you always see a date, in one of the rows you’ll have something like ‘N/A’ that will throw an exception when attempted to be converted into
create table #people ( Firstname nvarchar(max) collate SQL_Latin1_General_CP1_CI_AS, Surname nvarchar(max) collate SQL_Latin1_General_CP1_CI_AS, Dob nvarchar(max) collate SQL_Latin1_General_CP1_CI_AS, DateOfJoin nvarchar(max) collate SQL_Latin1_General_CP1_CI_AS, )
And include your default collation when you create the table. If collations in DB where you migrate data into does not match your temp-table collation, you won’t be able to do text matching or comparison between the two. On attempt to do a join matching on text or text-comparison with other tables, you might see this error:
Cannot resolve the collation conflict between "Latin1_General_CI_AS" and "SQL_Latin1_General_CP1_CI_AS" in the equal to operation.
If you get this, drop your temp table and re-create it with the required collation specified.
Save Excel as a tab-delimited-values file
For some reason CSV file format is the favourite in a community. But as soon as you have commas in the actual values, your process is going to be all screwed. Instead save your spreadseets as Tab-delimited: File => Save As => “Save as type” dropdown to
Text (Tab delimited). This way you’ll save yourself headaches if there are commas in your data.
Migrating tab-delimited file into SQL Server takes the same effort as working with CSV:
BULK INSERT #people FROM 'd:\MyFiles\Path\Personnel_Details.txt' WITH ( FIELDTERMINATOR = '\t', ROWTERMINATOR = '\n' ) GO
Convert columns into CSV list
Use your regular expressions to convert from columns into comma-separated list. A lot of times I find useful to check data integrity with simple queries like
select * from companies where name not in ('CompanyA', 'CompanyB', 'CompanyC')
'CompanyA', 'CompanyB', 'CompanyC' are stored in XLS in columns. So I copy-paste columns from XLS into Notepad++ and apply Regular Expression replace of
'\1',. This will convert these:
CompanyA CompanyB CompanyC CompanyD
'CompanyA', 'CompanyB', 'CompanyC', 'CompanyD',
Only you’ll need to remove final comma and paste this into SQL Script.
Create insert SQL statements from columns
For simple XLS tables like lookup tables (Countries, Types, Product Categories) sometimes it is easier to generate a bunch of
insert into statements in Notepad++. Copy-paste your columns from Excel into Notepad++
Great Ltd BT Tesco
and use regex
insert into Companies \(Name\) values \('\1'\) to convert your columns into
insert into Companies (Name) values ('Great Ltd') insert into Companies (Name) values ('BT') insert into Companies (Name) values ('Tesco')
Beware that this is a manual non-repeatable step that you’ll have to re-do when (and if) re-doing the migration. Chances that this type of data getting changed is usually slim, but possible. So you’ll need to decide if this is worth doing. It is advisable to save your regexp in the same migration file as your other SQL – for your future reference.
Convert your stringy-dates to
convert sql function to change date from
select convert(datetime, Dob, 103) as DOB, convert(datetime, JoinDate, 103) as JoinDate from #people
This conversion works with UK `dd/mm/yyyy’ format. Go here for more date formats available in SQL Server.
Excel being flexible is great and that’s why people love it. And in a column of dates you’ll always find somebody putting “last year, around March” which can never be converted to a
datetime. And if you have enough rows in a table it’ll be difficult to spot these by scrolling and eye-balling.
For these cases
IsDate is very useful. Here is the documentation – there are examples, go check it out.
So to spot a bad date entry you’d go
SELECT * FROM #people WHERE ISDATE(JoinDate) = 0 and JoinDate is not null
This would give you a list of all records with incorrect entry for
The same thing goes for numeric values. Use
IsNumeric (documentation) to check if there are fields with values that are not correct numbers:
SELECT * FROM #people WHERE ISNUMERIC(SomeNumber)<> 1
Move date from temp tables
Once you are done with clean up of temp tables, you’ll need to migrate the data into real tables. I’m sure you know how to do this, but for a reference I’ll add this here:
insert into people ( FirstName, Surname, DateOfBirth, DateOfJoin) select Firstname, Surname, convert(datetime, Dob, 103) as DateOfBirth, convert(datetime, DateOfJoin, 103) as DateOfJoin, from #people
If you notice I have put aliases for fields in select part of the statement. These don’t do anything for the insert statement. But when you have more than 10 fields to insert and not all the field names match (and they rarely do!), you’ll be grateful to yourself in a month – it’ll be easier to know what field is inserted into where.
In many cases your database structure is much more rigid than excel given to you. Your application might expect date of birth always present for an employee, but excel does not always have that. For these cases you usually go back to the client and say “yo, dude! I need all the data”, and your client will be like “wha? Bro, it’ll take time. Can you deal with this without a DOB, bro? I know you can!”. And you have to deal with this crap. So you go ahead and use
isnull(value, substitution) – very useful function that checks if value is null and replaces it with a substitution. I believe this is widely known and a lot of you already use it.
select convert(datetime, isnull(Dob, '01/01/1900'), 103) as DateOfBirth from #people
In the case opposite from the above, sometimes I get only a whitespace in cells, where it actually should be nothing (
nulls). And you really don’t want to copy noise into your clean new database. So use
nullif(value, comparison) – function that returns
select nullif(LeaveReason, '') as LeaveReason from #people
If Statements are not supported in
select statements in SQL Server. So you need to use
select case when Sex = 'Male' then 0 when sex='Female' then 1 else null end as Gender from #people
I’ve created a sample Excel file and migrated it into a sample database with all the above tricks are applied. Go to Github for the repository.