Tip #116: Create Rerunnable SQL deployment scripts

I’m a big fan of SQL deployment scripts. I normally create deployment scripts for environments where I need to sync my SQL code and structures myself or pass along scripts to DevOps who can deploy on my behalf.
The reason for using scripts instead of a long-winded manual procedure is for consistency – ensuring all deployment steps are being followed and to eliminate the x-factor of human error and fatigue associated with late-night deployments.

There are two types of versioning related to managing SQL changes: state vs migrations.

Migrations refer to incremental changes since the last deployment whereas state refers to year-to-date changes. Both approaches have pros and cons. Scripts for migration type deployments are shorter and faster to deploy and deployment code is easier to review. Scripts for state style deployments are longer and take longer to execute. But the scripts are complete and independent from previous deployments.

This article shows examples of SQL change scripts specifically for SQL Server (on-prem and Azure SQL database) which is designed for migration style deployments.

Design principles
All examples are rerunnable and customizable. The first 5 examples are straight forward and use the metadata in the database to determine whether an object exists. And for SQL changes for code (view, functions, and stored procs) to create a stub if not exists then alter the object. This is to accommodate preexisting security changes that you don’t want to remove during a DROP operation.

/* Example 1: update a function */
if not exists(select * from sys.objects where [name]='fn_example' and [schema_id]=1)
BEGIN
	exec ('CREATE FUNCTION dbo.fn_example() 	RETURNS TABLE AS RETURN (select GETUTCDATE() as col1)')
END
go

alter FUNCTION dbo.fn_example() 	
RETURNS TABLE AS RETURN (
select GETUTCDATE() as col1
);
go

/* Example 2: update a stored proc
*/
if object_id('dbo.[pr_RefreshDependentView]','P') is null execute('create proc dbo.[pr_RefreshDependentView] as select (1) as t')
go

alter proc dbo.[pr_RefreshDependentView] 
as 
select (1) as t
go

/* Example 3: check if column exists in table otherwise add column */
IF COLUMNPROPERTY(OBJECT_ID('dbo.table1'), 'fieldToAdd', 'ColumnId') IS NULL
	BEGIN
		alter table [dbo].[table1] ADD [fieldToAdd] nchar(3) NULL;
	END


/* Example 4: creates a stub view if not exists */

if object_id('dbo.[myView]','V') is null execute('create view dbo.[myView] as select (1) as t')
go

alter view dbo.[myView] as select (1) as t

go

The trickiest part is dealing with changing table value types that are useful during ADF (“azure data factory”) development when you need to enhance the existing dataset with additional fields like a batchID or data source name that are not included with source data.

So the algorithm is to rename the old object, create the new, update dependencies and drop the old object.

/* Example 5: creates a stub view if not exists */
if exists(select top 1 1 from sys.types where name='myType' and [schema_id]=1)
BEGIN
	EXEC sys.sp_rename 'dbo.myType', 'zMyType';

	CREATE TYPE [dbo].myType AS TABLE(
	[Shift] [varchar](8000) NULL,
	[downtime] [varchar](8000) NULL,
	[duration] [varchar](8000) NULL,
	 [varchar](8000) NULL,
	[codeDesc] [varchar](8000) NULL,
	[description] [varchar](8000) NULL,
	[eqpNumber] [varchar](8000) NULL,
	[componentCode] [varchar](8000) NULL,
	[MyTimeDuration] varchar(8000) NULL, 
	[CustomerTimeDuration] varchar(8000) NULL
	);

	DECLARE @Name NVARCHAR(776);

	DECLARE REF_CURSOR CURSOR FOR
	SELECT referencing_schema_name + '.' + referencing_entity_name
	FROM sys.dm_sql_referencing_entities('dbo.myType', 'TYPE');

	OPEN REF_CURSOR;

	FETCH NEXT FROM REF_CURSOR INTO @Name;
	WHILE (@@FETCH_STATUS = 0)
	BEGIN
		EXEC sys.sp_refreshsqlmodule @name = @Name;
		FETCH NEXT FROM REF_CURSOR INTO @Name;
	END;

	CLOSE REF_CURSOR;
	DEALLOCATE REF_CURSOR;

	EXEC ('DROP TYPE dbo.zMyType');
END

Tip #113: Nibbling Delete Construct

This tip is for dealing with archiving and/or updating large volumes of data from a single table in a database while the database is still in use by other users.

Imagine a scenario where you have a requirement to delete financial data older than 7 years. But the table contains both current and historical data that users access frequently.

The problem with large DELETE transactions is that the DBMS can lock the table preventing read requests until the transaction is committed. No indexes or partitions can also make a DELETE transaction can take a long time.

The Nibbling Delete construct resolves this issue by breaking down a large DELETE transaction into multiple smaller DELETE transactions with a short delay in-between to allow read transactions to continue. This is with the assumption that remaining records to be deleted are not accidently read.

I’ve added a condition to the sample code to allow situations where you don’t want code to process after a specific time like when business hours start. This is entirely optional for your situation.

Another option is that the algorithm can be modified to perform a nibbling INSERT or nibbling UPDATE.

declare @batchsize int, @numYearsToRetain int
declare @cutoffTime varchar(10), @currentTime varchar(10)

set @batchsize=100000
set @numYearsToRetain=-7
set @cutoffTime='08:00:00'

while (1=1)
BEGIN
	DELETE top (@batchsize) 
	from [Sales].[SalesOrderDetail] 
	where ModifiedDate = dateadd(yy, @numYearsToRetain,getutcdate());

	select @currentTime=convert(varchar(8),getutcdate(),8)
	/* use timecheck as a failsafe if you need to stop processing during a critical time*/
	if @@rowcount < @batchsize or @currentTime>=@cutoffTime
	BEGIN
		break;
	END
	
	waitfor delay '00:00:00.010';
END

Case Study: Data Profiling for Data Breaches

Usually, when you think of data profiling, you think of running queries against databases for building reports. But I would like to present another practical example of using data profiling to investigate a potential data breach. The example given in this blog post is a fictitious example based on real-life scenarios using randomized data to illustrate concepts.  

Data breaches are publicized more frequently and it is a real concern for both organizations and individuals. Data breaches happen when there is an unauthorized release of private/confidential data that could also include tampering or having intellectual-property stolen. In many countries, some regulations require companies to disclose data breaches and pay hefty fines based on the number of users affected.

For this fictitious example, there was a “security incident” involving an ex-sales employee possibly accessing a highly critical internal system that contains sensitive personal data. Luckily, I’m familiar with internal processes and all data is stored centrally in a database. But my report will need to be objective and presented in a meaningful way.

Key questions we would likely want to know are:

  1. What evidence proves whether the ex-employee has gained access to systems after his termination date?
  2. What systems were accessed or at risk for the duration of the unauthorized access?
  3. What corrective actions can be done to prevent similar security incidents from reoccurring?

Here are the facts about the system:

  • Accessible via the internet and supports many modules including a scheduler and billing system
  • Billing system requires a user login whereas the scheduler can be accessed via a logged-in or anonymous user
  • Fictitious ex-employee is William Thomas Smith whose termination date is Sept 23, 2019.

For the simplicity of the blog post, we will use the following tables to do our data profiling exercise.

Figure 1.1 Sample System tables

Q1. What evidence proves whether the ex-employee has gained access to systems after his termination date?
Best-case would be that I perfectly match the login used by William Smith. But in reality, I’d likely need to do a fuzzy match across multiple fields (first/last names, email, etc) to account for variability in spellings. For my example, I was unable to match based on “William Smith”. But I was able to fuzzy match on Bill Smith instead of William Smith for the name and some variation of the email address and IsInternal flag. With this combination, I can conclude with a high probability that Bill T Smith is most likely our ex-employee. However, I’m also suspicious that multiple accounts are created with the same password and by Brian who is not a team lead or a manager. But I’ll address that later.

Figure 1.2 Sample query result set

Q2. What systems were accessed or at risk for the duration of the unauthorized access?

From the userSession table to see what modules got accessed and see that only the scheduler module was accessed. Also querying ModifiedDate and ModifiedBy across all tables does not show that the user has changed any info (not shown). However, it is still possible that William could’ve altered the data in the period between his termination date and the timestamp just before a current employee modified it. But the system does not support that detailed level of logging and the goal is to find evidence that William changed data. Therefore, I’d have to conclude that William’s login was used to gain access after his termination date but we cannot prove that he has modified any data. I also learned that William resigned from Sales and likely went to work for a competitor. So he might be motivated to steal design information rather than cause sabotage. However, the data in the userSession table reflect a user activity pattern that ventures into only 1 part of the application that is already publicly accessible.

Figure 1.3 Sample session info

Q3. What corrective actions can be done to prevent similar security incidents from reoccurring?

Although some user activity was found, no data has been changed and limited view to publicly viewable areas of the application. I’m not 100% convinced that William made the login because the password is too common. I later confirmed my assumption with the sales lead that password sharing is a common practice within the sales team which points to security flaws in both technical and business processes.

Final recommendations
So based on the results from data profiling, I’m able to determine that the login likely used by the ex-employee William Smith had activity after his termination date. The login activity only showed read-only access to non-sensitive information. But it seems more likely that someone on the Sales team, not William, used the login.

But to avoid a potentially more disastrous scenario, I would make the following recommendations to address security issues (task owner in brackets):

  1. Mitigate repudiation problem – implement strong password policies/ two-factor authentication, restrict back-end access to DB systems (IT)
  2. Update the security policy and communicate security changes – update the acceptable use policy which includes no password sharing and training staff on security practices (management)
  3. Update HR offboarding process to revoke access – revoke user permissions during HR offboarding process and the opposite for HR onboarding process. (HR)
  4. Implement security change process – implementing system controls/changes need to be requested, approved and documented. (management)

TIP #111 – SQL Data scraping

In the past, I’ve done data scraping on human-readable reports to solve data challenges related to payment integrations, non-financial and internal system integrations. Typically in integration work, there would be a handshake protocol that involves a sender sending a message to the recipient and the recipient sending back an acknowledgment (file) to confirm receipt of the file. Integration reports would need to reconcile sent and receive messages in order to track messages that have been sent successfully and flag those that failed which can be later used to implement retry logic and alerting. However, some integrations may not send the acknowledgment file in a convenient flat-file format but instead a human-readable report. So we’ll need to resort to data scraping methods to extract the information needed to match against sent messages.

There are multiple ways to achieve this: C#, SSIS packages, PowerShell, etc. For this demo, I’m using SQL because the report is fairly short and I wanted to demonstrate an alternative solution for quicker iterative changes over other solutions like SSIS. And for this demo, I’m using a hypothetical scenario of a payment integration involving a fictitious bank and we’ll need to scrape an acknowledgment file for payment info to match against a payment file sent on the same day.

The required payment fields required to match are:

  • value date – date of transaction
  • account number – bank account of fictitious bank
  • item count – count of items found in the sender message
  • item value – the total value of items found in the sender message

The sample report shown below has an identifier to indicate the type and style of the report. Since each line has different content and there is no consistent formatting all across the lines, I will implement custom formatting each line.

RPT101															Sept 7, 2019
				      ABC Transaction Report Summary
				
						Value Date: Sep 7, 2019	
						account 123456	
						Item Count	10
						Item Value	$100.00

The algorithm for this style of processing is:

  1. Create temp tables for processing
  2. Load file contents into the temp table (any method will do but I chose TSQL for ease of use)
  3. Apply special formatting on a per-line basis
  4. Store results into a table or a CTE for ease of downstream processing
declare @fileName1 varchar(255) , @sql_stmt varchar(8000)
declare @RptType varchar(10)

set @fileName1='C:\sampleReport.txt'

/* step1: Create temporary objects */
if object_id('tempdb..#DataBuffer') is not null drop table #DataBuffer

create table #DataBuffer(
	LineValue varchar(max)
)

if object_id('tempdb..#ReportCopy') is not null drop table #ReportCopy

create table #ReportCopy(
	RowID int identity(1,1),
	LineValue varchar(max)
)

/* step2: Load data to temp tables. Samples uses BULK method but can be adaptable for a variety of methods. 
*/

set @sql_stmt= replace('BULK INSERT #DataBuffer FROM ''$filename$''','$filename$',@fileName1)

exec(@sql_stmt)


insert into #ReportCopy with (tablock) (LineValue) select LineValue from #DataBuffer with (nolock)

select @RptType =ltrim(rtrim(replace(left(LineValue,10),char(9),'')))
from #ReportCopy
where RowID=1

/* step3: we scan for the report type to ensure that the rest of the report can be processed. */
if @RptType ='RPT101'
BEGIN

	/* compare results of raw import of report vs results of scaped data */
	select * from #ReportCopy;

	with cte_ReportValues
	as (
		select
		@RptType rptType,
		(select top 1 convert(varchar(10),try_cast(ltrim(rtrim(replace(replace(LineValue,'Value Date: ',''),char(9),''))) as datetime),23) from #ReportCopy where RowID=4) ValueDate,
		(select top 1 ltrim(rtrim(replace(replace(LineValue,'account ',''),char(9),''))) from #ReportCopy where RowID=5) AccountNumber,
		(select top 1 replace(replace(replace(replace(LineValue,'Item Count',''),' ',''),' ',''),char(9),'') from #ReportCopy where RowID=6)  ItemCount,
		(select top 1 replace(replace(replace(LineValue,'Item Value',''),char(9),''),'$','') from #ReportCopy where RowID=7)  ItemValue
	)
	select *
	from cte_ReportValues
END




Results after running sample code

One of the key features of this code is that you can also use the RowID to format sections of your report in a loop such as the case of a detail listing report. Or simply track your changes during iterative development as I have done here. With the query results through a CTE, you can persist the data into a permanent table. In some implementations, I just merely called the table Acknowledgements. But this technique has many applications. Happy coding!

Tip #110 – Profiling data dynamically via metadata

In a perfect world, the data you are searching for is easily searchable in a single database and documentation is always current.  However, often you have a legacy system which has fields that are repurposed to store different types of data. For example, descriptors or memo fields can contain important business information or links to other side systems. Another issue that is commonly encountered is database fields are not consistently named – for example, OrderNo, OrderID, and OrderNumber.  This is often the result of siloed development or the functionality of the legacy system being extended beyond what the system is originally designed to do.

Based on a sample search term, I wanted a systematic way to discover what primary data entities are associated with a specific business process as well as the entity relationships. The solution also needed to account for scenarios where data can exist in mismatched in field names or in a set of information within a single database field.  For my specific case, the database was a SQL Server with 1000s of columns across many tables supporting a well-known ERP system with multiple application modules.

One way to handle this issue is to build SELECT queries using the metadata catalog and using the search term’s value and length to improve the probability of making a match.   This method works best with strings that are intuitively unique which improves the probability of the search.  So as an example, a search term “CAF-0006099” as an order number would work better than “6099”.

Below is the sample code I’ve written to aggregate the search results of dynamically generated metadata search queries.  The IsFound column will flag on the first wildcard match which is better for scanning large tables.  This method of search will not be 100% as a wildcard search can incorrectly make a positive match on both “CAF-0006099” or “CAF-0006099Z”.  But as a first pass, this code is still useful in shortlisting fewer columns from potentially 1000s of columns across many tables. From there, you can do a more detailed level of querying.

declare @search_term as varchar(255), @sql_template varchar(8000)
declare @max_rows int, @row_ctr int, @sql_stmt varchar(8000)

set @search_term ='CAF-0006099';

if object_id('tempdb..#SearchResults') is not null drop table #SearchResults

create table #SearchResults(
  IsFound int,
  ColName varchar(8000)
 );

if object_id('tempdb..#cmdlist') is not null drop table #cmdlist

create table #cmdlist(
    RowID int identity(1,1),
    cmd varchar(8000)
);

set @sql_template ='insert into #SearchResults(IsFound, ColName) '
+ 'select top 1 1 IsFound, ''[$schemaName$].[$tableName$].[$colName$] as ColName'' ColName '
+ 'from [$schemaName$].[$tableName$] with (Nolock) where [$colName$] like ''%$searchTerm$%'';'


insert into #cmdlist(cmd )
select
replace(replace(replace(replace(
    @sql_template ,'$schemaName$',c.Table_schema)
    ,'$tableName$',c.Table_name)
    ,'$colName$',c.Column_name)
	,'$searchTerm$',@search_term)
as SqlStmt
from information_schema.columns c
where c.DATA_TYPE like '%char%' and c.CHARACTER_MAXIMUM_LENGTH>= len(@search_term )

set @max_rows =@@rowcount
set @row_ctr =0

while @row_ctr < @max_rows
BEGIN
    select @sql_stmt = cmd from #cmdlist where RowID= (@row_ctr +1)
    exec (@sql_stmt);

    set @row_ctr=@row_ctr +1;
END

select ColName, IsFound
from #SearchResults;