Category: Azure

Bulk Load Data into Azure Sql using CSV and Blob Storage.

Overview

CSV is a widely supported format and as an integrator you often find yourself having to load a CSV extract into sql so that it can be processed, integrated or presented in some way. This guide will take you through a real-world example of loading a CSV export from a legacy system into Azure Sql.

Create a Staging Table to Receive the Data

When importing CSV into sql your options on how to represent it in sql are very limited. The best approach is to load all your data as-is into a temporary “staging” table. Once the staging table is loaded use sql to transform and move the data to its final destination.

Importing CSV works best when the first row is the column headings. Open your CSV file and extract the column names for use in your staging table.

  • Create a staging table to receive this data. This table MUST contain every column from your CSV file and each column name must match the headers from the CSV file. It is best to make each column a text column such as varchar and correct the types later. Don’t add a primary key or any constraints either as these make duplicates in the source data almost impossible to identify.
--create the stage schema if you don't have it
Create Schema stage;

--create the stage table
Create Table stage.CaseHistory (
[Case No.] varchar (20),
[History date] varchar (20),
[Time Created] varchar(20),
[Action type] varchar (20),
[Action code] varchar (50),
[Corr Type] varchar (10),
[Details] varchar(max),
[Typist code] varchar(50),
[Handler] varchar(50),
[Document Code] varchar(20),
[No of units] varchar (20),
[Cost value] varchar (20)
)

Connect Azure SQL to Blob Storage

In a real world environment one of your colleagues may have already connected azure sql to a blob storage account in which case you may want to skip ahead. Here’s a simple query that will check:

SELECT * FROM sys.external_data_sources
WHERE type_desc = 'BLOB_STORAGE'

Create a Database Master Key If You Need One

You will need to store a SAS token for blob storage in SQL. You need to enable a database master key so that these can be encrypted. This snippet will check to see if you already have a key before creating one. Be sure to replace ‘LONG_UNIQUE_STRING_HERE’

IF NOT EXISTS (SELECT * FROM sys.symmetric_keys WHERE name LIKE '%DatabaseMasterKey%')
BEGIN
    SELECT 'DMK does not exist'
    CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'LONG_UNIQUE_STRING_HERE'
END
ELSE
BEGIN
    SELECT 'DMK exists'
END

Create a SAS Token in the Azure Portal

A SAS token grants sql permissions to the container in your blob storage account. You need to give sql read only access. I would generally use the azure CLI for this but In my testing the generated tokens didn’t work so I recommend using the ‘Shared access signature’ blade from the azure portal generate to generate a SAS with an expiry far enough in the future for your needs.

Save SAS Token for Blob Storage in Azure SQL.

You need to store the generated SAS Token so that Azure sql can use it. Copy the token from the Azure Portal and replace “SAS_TOKEN_HERE” in this script. Remove the leading ? so that it starts sv=

IF NOT EXISTS (SELECT * FROM sys.database_scoped_credentials WHERE name = 'BlobCredential')
BEGIN
   SELECT 'Credential does not exist'
   CREATE DATABASE SCOPED CREDENTIAL BlobCredential
   WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
   SECRET = 'sv=SAS_TOKEN_HERE';
END
ELSE
BEGIN
   SELECT 'Credential already exists'
END

Register Blob Storage as a DataSource in Azure Sql

Next we need to register the blob container in Azure Sql. This Requires CONTROL permission on the database in Azure SQL.

CREATE EXTERNAL DATA SOURCE DataLoadInput
WITH ( 
    TYPE       = BLOB_STORAGE,
    LOCATION   = 'https://myblobaccount.blob.core.windows.net/sql-data-load',
    CREDENTIAL = BlobCredential
);

Useful debugging commands

-- credentials
select * from sys.database_scoped_credentials

--external datasource
SELECT * FROM sys.external_data_sources

--grant the Control permission required to create the external datasource
GRANT CONTROL ON DATABASE::[my-database] TO [username];

--List the external data sources
SELECT * FROM sys.external_data_sources
WHERE type_desc = 'BLOB_STORAGE'

Load the CSV file From Blob Storage

Now that we have the data source registered we can go ahead and try to load the CSV into staging table.

Firstly Upload your CSV file to blob storage. You can do this using the excellent Azure Storage Explorer or azcopy.

Next we load the data into the staging table. We are using some dynamic SQL here so that we can parameterize the filename for future reuse.

 Declare @filename varchar(300) = 'Case History.csv'
   DECLARE @bulksql NVARCHAR(4000) = 'BULK INSERT stage.[CaseHistory] FROM''' + @FileName + ''' WITH ( DATA_SOURCE = ''DataLoadInput'', FORMAT = ''CSV'',	FIRSTROW  = 1)';
   EXEC(@bulksql);

If you have any issues here the most common problems are

  • differences in column names
  • columns longer than expected (increase varchar length)
  • CSV Encoding Issues.

CSV encoding issues are all too common. If you are able to correct the data at the source this is the best bet however this is not always possible in which case you will need to clean it up. It’s usually a case of adding or removing double-quotes.

Here is the message you will receive if you have an encoding issue to solve. It gives you the position of the issue so you can go and correct it.

Msg 4879, Level 16, State 1, Line 1
Bulk load failed due to invalid column value in CSV data file Case History.csv in row 16403, column 7.
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)"

Create Destination Table

Next we create the destination table. This is different from the staging table in that the columns can be named to your own standard and stored in the correct types instead of text and you can specify the primary key. I’ve added in a Line field to act as part of a compound key with caseref as there is no key in the file. I’ve also added in a HistoryDate field to hold a combination of the [History Date] and [Time Created] Fields

CREATE TABLE [dbo].[CaseHistory](
	[CaseRef] [varchar](20) NOT NULL,
	[Line] [int] NOT NULL,
	[HistoryDate] [datetime] NULL,
	[ActionType] [varchar](2) NULL,
	[ActionCode] [varchar](20) NULL,
	[CorrType] [varchar](8) NULL,
	[Details] [varchar](max) NULL,
	[Typist] [varchar](50) NULL,
	[Handler] [varchar](50) NULL,
	[DocumentCode] [varchar](20) NULL,
	[NoOfUnits] [decimal](19, 4) NULL,
	[CostValue] [money] NULL,
 CONSTRAINT [PK_CaseHistory] PRIMARY KEY CLUSTERED 
(
	[CaseRef] ASC,
	[Line] ASC
))

Copy from Stage to Destination

The final step is to copy the stage table to the production table. I’ve combined this with the previous steps into a single stored proc


--exec dbo.RefeshCaseHistory 'case-history.csv'
CREATE Procedure [dbo].[RefreshCaseHistory]
	@Filename varchar(200)
AS

BEGIN


    DECLARE @bulksql NVARCHAR(4000) = 'BULK INSERT stage.[CaseHistory] FROM''' + @FileName + ''' WITH ( DATA_SOURCE = ''DataLoadInput'', FORMAT = ''CSV'',	FIRSTROW  = 2)';


	BEGIN TRY
		BEGIN TRANSACTION;
		
			--load into staging table
			truncate table stage.[CaseHistory] 
		    EXEC(@bulksql);
			
			--transform from stage -> production ------------------
			truncate table dbo.[CaseHistory];

			insert into dbo.CaseHistory
			SELECT  
				[Case No.]  as CaseRef,
				Row_number() OVER(partition BY [Case No.] ORDER BY calc.historyDate) as Line,  --assign a line number to act as key		
				calc.HistoryDate, -- combined history date and time created
				[Action type] as ActionType,
				[Action code] as ActionCode,
				[Corr Type] as CorrType,
				[Details] ,
				[Typist code] as Typist,
				[Handler],
				[Document Code] as DocumentCode,
				Convert(Decimal(19,4), [No of units]) NoOfUnits,
				Convert(money, [Cost value]) CostValue
			  FROM 
				[stage].[CaseHistory] as ch
 
			 --convert and combine the date & time fields.  use a cross apply so we don't need to repeat the calculation in the row_number.
			 CROSS APPLY 	
				(SELECT DATEADD(ss,convert(int, ch.[Time Created]),Convert(datetime, ch.[History date], 103))) AS calc(HistoryDate) 
			

    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
        ROLLBACK TRANSACTION;

		THROW;
END CATCH;

END
GO

Regenerating the SAS Key

At some point, the SAS Key will expire on you. In this case, you need to create a new SAS key as described earlier and then recreate the external data source.

   DROP EXTERNAL DATA SOURCE DataLoadInput;
   
   DROP  DATABASE SCOPED CREDENTIAL BlobCredential;

   CREATE DATABASE SCOPED CREDENTIAL BlobCredential
   WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
   SECRET = 'sv= NEW SAS TOKEN'
 

   CREATE EXTERNAL DATA SOURCE DataLoadInput
   WITH 
	( 
    TYPE       = BLOB_STORAGE,
    LOCATION   = 'https://myblobaccount.blob.core.windows.net/sql-data-load',
    CREDENTIAL = BlobCredential
	);

Copying Files to Azure Storage With AzCopy and a System Assigned Managed Identity

Often we need to automate moving files between blob storage in azure and the disks on our servers. Managing credentials in this scenario can be a challenge. In this post, I will show you how to overcome this with azcopy and System Assigned Managed Identities.

Download AZ Copy from here

https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10

Unzip it and add it to your path.

Assign A Identity

Enabling a system assigned managed identity is much simpler than the name would suggest. Simply open your VM in the azure portal Click Identity on the left hand side and toggle the status switch on then Save.

Assign Roles

Goto the storage account in the azure portal.

Go into Access Control (IAM) and Click Add / Add Role Assignment

Then Give the role “Storage Blob Data Contributor” to the virtual machine and click Save

Use the Managed Identity with AZCopy

First login to azcopy using the System Assigned Managed Identity.

azcopy login --identity

This may not work right away as It takes several for the role assignments to ripple through the system to allow login request to succeed. In my case it took 5 minutes. The easiest way to test is to use azcopy to list some files. Note that we do not need to pass credentials

azcopy list https://mystorageaccount.blob.core.windows.net/mycontainer

Each windows user will need to login to azcopy with the identity before using it for the first time. After that, it is not necessary. If you are writing a script which will be executed by a service account then include the login command at the start of the script to ensure it gets called.

Once you are logged in its pretty easy to copy files to blob storage without passing credentials each time

azcopy copy c:\export\sql-data-load\*.csv https://mystorageaccount.blob.core.windows.net/sql-data-load
Examples written using AzCopy v10

Azure Functions – Timer Trigger Cheat Sheet

This is a cheat sheet for azure function CRON expressions that are used by the TimerTrigger binding in azure functions.

Basic Syntax

The structure of the CRON expressions in Azure is:
{second} {minute} {hour} {day} {month} {day of the week}

Using Configuration

Instead of specifying the CROM pattern directly in the [TimerTrigger] attribute, you can give a config setting name. This is great for testing locally on a higher frequency than deployed.

[TimerTrigger("%account-number-clean-schedule%")]TimerInfo myTimer,

Cheat Sheet – Common Trigger Patterns

Description Syntax
Every Minute
(every time the seconds become zero)
0 * * * * * *
Every minute at 30 seconds 30 * * * * *
Every 2 Minutes 0 */2 * * * *
Every Hour at 5 past
(every time the seconds are zero and mins are 5)
0 5 * * * *
Every 15 minutes 9 to 5 Monday To Friday 0 */15 9-17 * * MON-FRI
Daily (midnight) 0 0 0 * * *
Daily at 3:30 am 0 30 3 * * *
First day of each month 0 0 0 1 * *
Every 2 Hours at 10 past, 7am -7pm Mon-Friday 0 10 7-19/2 * * 1-5