Bulk Load Data into Azure Sql using CSV and Blob Storage.

Overview

CSV is a widely supported format and as an integrator you often find yourself having to load a CSV extract into sql so that it can be processed, integrated or presented in some way. This guide will take you through a real-world example of loading a CSV export from a legacy system into Azure Sql.

Create a Staging Table to Receive the Data

When importing CSV into sql your options on how to represent it in sql are very limited. The best approach is to load all your data as-is into a temporary “staging” table. Once the staging table is loaded use sql to transform and move the data to its final destination.

Importing CSV works best when the first row is the column headings. Open your CSV file and extract the column names for use in your staging table.

Create a staging table to receive this data. This table MUST contain every column from your CSV file and each column name must match the headers from the CSV file. It is best to make each column a text column such as varchar and correct the types later. Don’t add a primary key or any constraints either as these make duplicates in the source data almost impossible to identify.

--create the stage schema if you don't have it
Create Schema stage;

--create the stage table
Create Table stage.CaseHistory (
[Case No.] varchar (20),
[History date] varchar (20),
[Time Created] varchar(20),
[Action type] varchar (20),
[Action code] varchar (50),
[Corr Type] varchar (10),
[Details] varchar(max),
[Typist code] varchar(50),
[Handler] varchar(50),
[Document Code] varchar(20),
[No of units] varchar (20),
[Cost value] varchar (20)
)

Connect Azure SQL to Blob Storage

In a real world environment one of your colleagues may have already connected azure sql to a blob storage account in which case you may want to skip ahead. Here’s a simple query that will check:

SELECT * FROM sys.external_data_sources
WHERE type_desc = 'BLOB_STORAGE'

Create a Database Master Key If You Need One

You will need to store a SAS token for blob storage in SQL. You need to enable a database master key so that these can be encrypted. This snippet will check to see if you already have a key before creating one. Be sure to replace ‘LONG_UNIQUE_STRING_HERE’

IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE name LIKE '%DatabaseMasterKey%')
BEGIN
    SELECT 'DMK does not exist'
    CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'LONG_UNIQUE_STRING_HERE'
END
ELSE
BEGIN
    SELECT 'DMK exists'
END

Create a SAS Token in the Azure Portal

A SAS token grants sql permissions to the container in your blob storage account. You need to give sql read only access. I would generally use the azure CLI for this but In my testing the generated tokens didn’t work so I recommend using the ‘Shared access signature’ blade from the azure portal generate to generate a SAS with an expiry far enough in the future for your needs.

Save SAS Token for Blob Storage in Azure SQL.

You need to store the generated SAS Token so that Azure sql can use it. Copy the token from the Azure Portal and replace “SAS_TOKEN_HERE” in this script. Remove the leading ? so that it starts sv=

IF NOT EXISTS (SELECT * FROM sys.database_scoped_credentials WHERE name = 'BlobCredential')
BEGIN
   SELECT 'Credential does not exist'
   CREATE DATABASE SCOPED CREDENTIAL BlobCredential
   WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
   SECRET = 'sv=SAS_TOKEN_HERE';
END
ELSE
BEGIN
   SELECT 'Credential already exists'
END

Register Blob Storage as a DataSource in Azure Sql

Next we need to register the blob container in Azure Sql. This Requires CONTROL permission on the database in Azure SQL.

CREATE EXTERNAL DATA SOURCE DataLoadInput
WITH ( 
    TYPE       = BLOB_STORAGE,
    LOCATION   = 'https://myblobaccount.blob.core.windows.net/sql-data-load',
    CREDENTIAL = BlobCredential
);

Useful debugging commands

-- credentials
select * from sys.database_scoped_credentials

--external datasource
SELECT * FROM sys.external_data_sources

--grant the Control permission required to create the external datasource
GRANT CONTROL ON DATABASE::[my-database] TO [username];

--List the external data sources
SELECT * FROM sys.external_data_sources
WHERE type_desc = 'BLOB_STORAGE'

Load the CSV file From Blob Storage

Now that we have the data source registered we can go ahead and try to load the CSV into staging table.

Firstly Upload your CSV file to blob storage. You can do this using the excellent Azure Storage Explorer or azcopy.

Next we load the data into the staging table. We are using some dynamic SQL here so that we can parameterize the filename for future reuse.

 Declare @filename varchar(300) = 'Case History.csv'
   DECLARE @bulksql NVARCHAR(4000) = 'BULK INSERT stage.[CaseHistory] FROM''' + @FileName + ''' WITH ( DATA_SOURCE = ''DataLoadInput'', FORMAT = ''CSV'',	FIRSTROW  = 1)';
   EXEC(@bulksql);

If you have any issues here the most common problems are

differences in column names
columns longer than expected (increase varchar length)
CSV Encoding Issues.

CSV encoding issues are all too common. If you are able to correct the data at the source this is the best bet however this is not always possible in which case you will need to clean it up. It’s usually a case of adding or removing double-quotes.

Here is the message you will receive if you have an encoding issue to solve. It gives you the position of the issue so you can go and correct it.

Msg 4879, Level 16, State 1, Line 1
Bulk load failed due to invalid column value in CSV data file Case History.csv in row 16403, column 7.
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)"

Create Destination Table

Next we create the destination table. This is different from the staging table in that the columns can be named to your own standard and stored in the correct types instead of text and you can specify the primary key. I’ve added in a Line field to act as part of a compound key with caseref as there is no key in the file. I’ve also added in a HistoryDate field to hold a combination of the [History Date] and [Time Created] Fields

CREATE TABLE [dbo].[CaseHistory](
	[CaseRef] [varchar](20) NOT NULL,
	[Line] [int] NOT NULL,
	[HistoryDate] [datetime] NULL,
	[ActionType] [varchar](2) NULL,
	[ActionCode] [varchar](20) NULL,
	[CorrType] [varchar](8) NULL,
	[Details] [varchar](max) NULL,
	[Typist] [varchar](50) NULL,
	[Handler] [varchar](50) NULL,
	[DocumentCode] [varchar](20) NULL,
	[NoOfUnits] [decimal](19, 4) NULL,
	[CostValue] [money] NULL,
 CONSTRAINT [PK_CaseHistory] PRIMARY KEY CLUSTERED 
(
	[CaseRef] ASC,
	[Line] ASC
))

Copy from Stage to Destination

The final step is to copy the stage table to the production table. I’ve combined this with the previous steps into a single stored proc


--exec dbo.RefeshCaseHistory 'case-history.csv'
CREATE Procedure [dbo].[RefreshCaseHistory]
	@Filename varchar(200)
AS

BEGIN


    DECLARE @bulksql NVARCHAR(4000) = 'BULK INSERT stage.[CaseHistory] FROM''' + @FileName + ''' WITH ( DATA_SOURCE = ''DataLoadInput'', FORMAT = ''CSV'',	FIRSTROW  = 2)';


	BEGIN TRY
		BEGIN TRANSACTION;
		
			--load into staging table
			truncate table stage.[CaseHistory] 
		    EXEC(@bulksql);
			
			--transform from stage -> production ------------------
			truncate table dbo.[CaseHistory];

			insert into dbo.CaseHistory
			SELECT  
				[Case No.]  as CaseRef,
				Row_number() OVER(partition BY [Case No.] ORDER BY calc.historyDate) as Line,  --assign a line number to act as key		
				calc.HistoryDate, -- combined history date and time created
				[Action type] as ActionType,
				[Action code] as ActionCode,
				[Corr Type] as CorrType,
				[Details] ,
				[Typist code] as Typist,
				[Handler],
				[Document Code] as DocumentCode,
				Convert(Decimal(19,4), [No of units]) NoOfUnits,
				Convert(money, [Cost value]) CostValue
			  FROM 
				[stage].[CaseHistory] as ch
 
			 --convert and combine the date & time fields.  use a cross apply so we don't need to repeat the calculation in the row_number.
			 CROSS APPLY 	
				(SELECT DATEADD(ss,convert(int, ch.[Time Created]),Convert(datetime, ch.[History date], 103))) AS calc(HistoryDate) 
			

    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
        ROLLBACK TRANSACTION;

		THROW;
END CATCH;

END
GO

Regenerating the SAS Key

At some point, the SAS Key will expire on you. In this case, you need to create a new SAS key as described earlier and then recreate the external data source.

   DROP EXTERNAL DATA SOURCE DataLoadInput;
   
   DROP  DATABASE SCOPED CREDENTIAL BlobCredential;

   CREATE DATABASE SCOPED CREDENTIAL BlobCredential
   WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
   SECRET = 'sv= NEW SAS TOKEN'
 

   CREATE EXTERNAL DATA SOURCE DataLoadInput
   WITH 
	( 
    TYPE       = BLOB_STORAGE,
    LOCATION   = 'https://myblobaccount.blob.core.windows.net/sql-data-load',
    CREDENTIAL = BlobCredential
	);

Tags: Azure, Azure-Sql, Azure-Storage