The goal of this article is to provide an example to create a custom Python object out of a complex JSON structure. The reason for using a custom Python object is to account for additional processing/formatting from an API service that could evolve over time and provides a better interface. Also using a custom Python object provides better control for decoupling and information hiding.
With the example provided, I’m choosing to use a JSON structure that has a mix of strings, structures and arrays which better represent more common JSON message structures.
I’m a big fan of SQL deployment scripts. I normally create deployment scripts for environments where I need to sync my SQL code and structures myself or pass along scripts to DevOps who can deploy on my behalf. The reason for using scripts instead of a long-winded manual procedure is for consistency – ensuring all deployment steps are being followed and to eliminate the x-factor of human error and fatigue associated with late-night deployments.
There are two types of versioning related to managing SQL changes: state vs migrations.
Migrations refer to incremental changes since the last deployment whereas state refers to year-to-date changes. Both approaches have pros and cons. Scripts for migration type deployments are shorter and faster to deploy and deployment code is easier to review. Scripts for state style deployments are longer and take longer to execute. But the scripts are complete and independent from previous deployments.
This article shows examples of SQL change scripts specifically for SQL Server (on-prem and Azure SQL database) which is designed for migration style deployments.
Design principles All examples are rerunnable and customizable. The first 5 examples are straight forward and use the metadata in the database to determine whether an object exists. And for SQL changes for code (view, functions, and stored procs) to create a stub if not exists then alter the object. This is to accommodate preexisting security changes that you don’t want to remove during a DROP operation.
/* Example 1: update a function */
if not exists(select * from sys.objects where [name]='fn_example' and [schema_id]=1)
BEGIN
exec ('CREATE FUNCTION dbo.fn_example() RETURNS TABLE AS RETURN (select GETUTCDATE() as col1)')
END
go
alter FUNCTION dbo.fn_example()
RETURNS TABLE AS RETURN (
select GETUTCDATE() as col1
);
go
/* Example 2: update a stored proc
*/
if object_id('dbo.[pr_RefreshDependentView]','P') is null execute('create proc dbo.[pr_RefreshDependentView] as select (1) as t')
go
alter proc dbo.[pr_RefreshDependentView]
as
select (1) as t
go
/* Example 3: check if column exists in table otherwise add column */
IF COLUMNPROPERTY(OBJECT_ID('dbo.table1'), 'fieldToAdd', 'ColumnId') IS NULL
BEGIN
alter table [dbo].[table1] ADD [fieldToAdd] nchar(3) NULL;
END
/* Example 4: creates a stub view if not exists */
if object_id('dbo.[myView]','V') is null execute('create view dbo.[myView] as select (1) as t')
go
alter view dbo.[myView] as select (1) as t
go
The trickiest part is dealing with changing table value types that are useful during ADF (“azure data factory”) development when you need to enhance the existing dataset with additional fields like a batchID or data source name that are not included with source data.
So the algorithm is to rename the old object, create the new, update dependencies and drop the old object.
/* Example 5: creates a stub view if not exists */
if exists(select top 1 1 from sys.types where name='myType' and [schema_id]=1)
BEGIN
EXEC sys.sp_rename 'dbo.myType', 'zMyType';
CREATE TYPE [dbo].myType AS TABLE(
[Shift] [varchar](8000) NULL,
[downtime] [varchar](8000) NULL,
[duration] [varchar](8000) NULL,
[varchar](8000) NULL,
[codeDesc] [varchar](8000) NULL,
[description] [varchar](8000) NULL,
[eqpNumber] [varchar](8000) NULL,
[componentCode] [varchar](8000) NULL,
[MyTimeDuration] varchar(8000) NULL,
[CustomerTimeDuration] varchar(8000) NULL
);
DECLARE @Name NVARCHAR(776);
DECLARE REF_CURSOR CURSOR FOR
SELECT referencing_schema_name + '.' + referencing_entity_name
FROM sys.dm_sql_referencing_entities('dbo.myType', 'TYPE');
OPEN REF_CURSOR;
FETCH NEXT FROM REF_CURSOR INTO @Name;
WHILE (@@FETCH_STATUS = 0)
BEGIN
EXEC sys.sp_refreshsqlmodule @name = @Name;
FETCH NEXT FROM REF_CURSOR INTO @Name;
END;
CLOSE REF_CURSOR;
DEALLOCATE REF_CURSOR;
EXEC ('DROP TYPE dbo.zMyType');
END
E-commerce site was built 100% to spec but the order fulfillment process was flawed and a famous toy company was unable to deliver presents before Christmas.
Customer support system has been designed for 1 region and 1 dispatcher but then rescoped for multiple regions without consideration to call volume and a required number of dispatchers. This resulted in dropped calls from frustrated customers and a broken triage process.
In both instances, it doesn’t matter how great and shiny the system is designed. If the system is dependent on weak processes, the business capability (i.e. ordering products online or getting customer support) failed and ultimately the IT project to improve the great and shiny also failed.
The problems above could have been prevented had they invested time in a process analysis. That is, model the process for the current and target state and address variances in system and business process changes – also known as gap analysis.
Wait…Isn’t this Product Management’s job?
But I’m a DEV. Shouldn’t this be Product Management’s job? After all, we did everything the customer asks for right?
So if building software was like a car company, would we ship a car with no brakes if the customer never specifically asks for it? I would hope that the answer is no and argue that the developer would still need to consider risks not explicitly defined in a user story (in this case customer safety) because the customer might not know what they want. Just like how QA would test DEV’s work, DEV should confirm Product Management’s requirements to ensure completeness.
I’m not proposing a big effort as in a team we need to respect everyone’s role. However, preparing a “picture” of the problem provides a better platform to base team discussion and propose changes. The process modeling technique is simple and can be applied to model internal and external processes in varying degrees of complexity.
Use BPMN for processing modeling
For process modeling, I use BPMN (Business Process Modeling Notation) symbol conventions. But you can easily swap modeling convention for UML or any flowchart style that is consistent and that your audience will understand. For the modeling software, I use Excel which is pretty standard in the workplace and does not impose any additional license restrictions on my stakeholders to open documents.
BPMN Common Conventions below
Begin terminator – solid dot (with M for a manual trigger, T for timed trigger)
End terminator – Circle
Reference to another process – Circle with arrowhead
Processes – rounded boxes
Process flow – arrows
subsystems – boxes to be used as swimlanes
Conditional flows – diamond boxes
How to build a process modeling diagram
step #1: itemize a list of processes
step #2: order the list in chronological order (post-it notes are great for this)
step #3: Translate processes into various process symbols and lines in Excel*
*Most “special” symbols like the arrowhead or email icons can be typed in using the Marlett font.
Example
This process flow documents the hr onboarding process for granting access to a new employee. For this simple example, the process begins with a manual trigger (ie. hr clerk) granting access to the employee in a personnel system via the activation of an employee record for a specific profile. The system triggers updates to be made against 3 systems: sales, finance, and operations so the employee could get access to those systems via the configure user configuration tasks that happen automatically and independently in each subsystem.
Now if we want to make changes like an email notification, we would just add an additional process box (in a different color) and lines so we can pictorially show proposed changes and get faster feedback to proceed.
References
BPMN 2.0 – Business Process Model and Notation
http://www.bpmb.de/images/BPMN2_0_Poster_EN.pdf
Some time ago, I ran into a situation where I needed to consume JSON data from an API service and store data into a relational database. During development, I discovered that the JSON data was inconsistent such that
JSON structures representing different records can have the same nodes defined as different data types (i.e. string vs struct)
JSON structures can have a varied number of substructures that was not defined as an array nor defined in any technical documentation
JSON structure is defined as a single line JSON so usual methods like using a dataframe for parsing won’t work because no standard schema can be applied (implicit or explicit) and inconsistencies would cause the entire single line JSON to be nulled.
The environment was using azure data factory and using databricks to assist in the transformations. The solution involved using Scala and Jackson libraries for node handling. For privacy, I will use fictitious data to show the concept of how the solution would apply. And the code provided are excerpts of what the full source code could look like.
Core logic will rely on two techniques to solve this problem.
Short circuit logic to simplify conditional logic and improve readability
Apply helper methods to flag node types assist with conditional formatting rules
For the sample data, I would be using this sample data which can be easily reformatted using Json editor online (https://jsoneditoronline.org/).
One observation we can see is that for each report there is a varied number of vehicles that we may not know all the combinations of vehicles that can exist on a report, but we observe that each vehicle node has a top and bottom structure. Second, we take note that for vehicle objects we would want to genericize object handling as they could have the same contents.
So for my data access objects, I would define these methods for resolving this specific issue: IsNodeVehicle (JsonNode) – returns true and abstracts complexity to define whether a node is really a vehicle object ProcessJsonObject (JsonNode) – method to process the JSON file; will show short circuit logic
def IsNodeVehicle(ptrNode: JsonNode): Boolean={
if (ptrNode.has("top") || ptrNode.has("bottom")) {
true
}
else {false}
}
def ProcessJsonObject(SrcFileName: String) {
/*
some code
*/
// short circuit logic
if (dataNodeElement.has("data") && dataNodeElement.get("data").has("report")){
}
var ptrNode=dataNodeElement.get("data").get("report");
val fieldNames = ptrNode.fieldNames;
while (fieldNames.hasNext) {
var fieldName:String = fieldNames.next();
if(IsNodeVehicle(ptrNode.get(fieldName))){
/*
processing logic
*/
}
The IsNodeVehicle method is very easy to define. We use the rule that a vehicle object must have either a top or bottom structure. We implement this as a helper method to reduce repetitious checks throughout the code and to future-proof our code in case the definition of the vehicle object needs to change.
Short circuit logic which is built into Scala takes advantage of replacing nested conditional logic as a single line which will improve readability and ongoing maintenance. The way it works is that the chained logic for AND operations will stop processing as soon as it encounters the first FALSE sub-condition or if all sub-conditions are TRUE whichever scenario happens first.
The full source code contains more logic than what’s shown in this blog post. But using these two techniques greatly reduced the complexity of the issue dealing with a potentially large number of objects and is future-proofed if more vehicles are added to the source JSON.
Sample Json data
{“count”:”2″,”Data”:{“0”:{“id”:”de8f85bf-1e0a-da5a-5852-1ab021eccdcb”,”report”:{“serialNumber”:123,”car”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”truck”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”summary”:{“note”:”this is a test”},”Section1″:{“field1″:””},”Section2″:{“field1″:””}}},”1″:{“id”:”1a01e465-e698-a69d-f072-20e7db3ad203″,”report”:{“serialNumber”:456,”motorcycle”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”flatbedTruck”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”summary”:{“note”:”this is a test”},”Section1″:{“field1″:””},”Section2″:{“field1″:””}}}}}
This tip is for dealing with archiving and/or updating large volumes of data from a single table in a database while the database is still in use by other users.
Imagine a scenario where you have a requirement to delete financial data older than 7 years. But the table contains both current and historical data that users access frequently.
The problem with large DELETE transactions is that the DBMS can lock the table preventing read requests until the transaction is committed. No indexes or partitions can also make a DELETE transaction can take a long time.
The Nibbling Delete construct resolves this issue by breaking down a large DELETE transaction into multiple smaller DELETE transactions with a short delay in-between to allow read transactions to continue. This is with the assumption that remaining records to be deleted are not accidently read.
I’ve added a condition to the sample code to allow situations where you don’t want code to process after a specific time like when business hours start. This is entirely optional for your situation.
Another option is that the algorithm can be modified to perform a nibbling INSERT or nibbling UPDATE.
declare @batchsize int, @numYearsToRetain int
declare @cutoffTime varchar(10), @currentTime varchar(10)
set @batchsize=100000
set @numYearsToRetain=-7
set @cutoffTime='08:00:00'
while (1=1)
BEGIN
DELETE top (@batchsize)
from [Sales].[SalesOrderDetail]
where ModifiedDate = dateadd(yy, @numYearsToRetain,getutcdate());
select @currentTime=convert(varchar(8),getutcdate(),8)
/* use timecheck as a failsafe if you need to stop processing during a critical time*/
if @@rowcount < @batchsize or @currentTime>=@cutoffTime
BEGIN
break;
END
waitfor delay '00:00:00.010';
END
Usually, when you think of data profiling, you think of running queries against databases for building reports. But I would like to present another practical example of using data profiling to investigate a potential data breach. The example given in this blog post is a fictitious example based on real-life scenarios using randomized data to illustrate concepts.
Data breaches are publicized more frequently and it is a real concern for both organizations and individuals. Data breaches happen when there is an unauthorized release of private/confidential data that could also include tampering or having intellectual-property stolen. In many countries, some regulations require companies to disclose data breaches and pay hefty fines based on the number of users affected.
For this fictitious example, there was a “security incident” involving an ex-sales employee possibly accessing a highly critical internal system that contains sensitive personal data. Luckily, I’m familiar with internal processes and all data is stored centrally in a database. But my report will need to be objective and presented in a meaningful way.
Key questions we would likely want to know are:
What evidence proves whether the ex-employee has gained access to systems after his termination date?
What systems were accessed or at risk for the duration of the unauthorized access?
What corrective actions can be done to prevent similar security incidents from reoccurring?
Here are the facts about the system:
Accessible via the internet and supports many modules including a scheduler and billing system
Billing system requires a user login whereas the scheduler can be accessed via a logged-in or anonymous user
Fictitious ex-employee is William Thomas Smith whose termination date is Sept 23, 2019.
For the simplicity of the blog post, we will use the following tables to do our data profiling exercise.
Figure 1.1 Sample System tables
Q1. What evidence proves whether the ex-employee has gained access to systems after his termination date? Best-case would be that I perfectly match the login used by William Smith. But in reality, I’d likely need to do a fuzzy match across multiple fields (first/last names, email, etc) to account for variability in spellings. For my example, I was unable to match based on “William Smith”. But I was able to fuzzy match on Bill Smith instead of William Smith for the name and some variation of the email address and IsInternal flag. With this combination, I can conclude with a high probability that Bill T Smith is most likely our ex-employee. However, I’m also suspicious that multiple accounts are created with the same password and by Brian who is not a team lead or a manager. But I’ll address that later.
Figure 1.2 Sample query result set
Q2. What systems were accessed or at risk for the duration of the unauthorized access?
From the userSession table to see what modules got accessed and see that only the scheduler module was accessed. Also querying ModifiedDate and ModifiedBy across all tables does not show that the user has changed any info (not shown). However, it is still possible that William could’ve altered the data in the period between his termination date and the timestamp just before a current employee modified it. But the system does not support that detailed level of logging and the goal is to find evidence that William changed data. Therefore, I’d have to conclude that William’s login was used to gain access after his termination date but we cannot prove that he has modified any data. I also learned that William resigned from Sales and likely went to work for a competitor. So he might be motivated to steal design information rather than cause sabotage. However, the data in the userSession table reflect a user activity pattern that ventures into only 1 part of the application that is already publicly accessible.
Figure 1.3 Sample session info
Q3. What corrective actions can be done to prevent similar security incidents from reoccurring?
Although some user activity was found, no data has been changed and limited view to publicly viewable areas of the application. I’m not 100% convinced that William made the login because the password is too common. I later confirmed my assumption with the sales lead that password sharing is a common practice within the sales team which points to security flaws in both technical and business processes.
Final recommendations So based on the results from data profiling, I’m able to determine that the login likely used by the ex-employee William Smith had activity after his termination date. The login activity only showed read-only access to non-sensitive information. But it seems more likely that someone on the Sales team, not William, used the login.
But to avoid a potentially more disastrous scenario, I would make the following recommendations to address security issues (task owner in brackets):
Mitigate repudiation problem – implement strong password policies/ two-factor authentication, restrict back-end access to DB systems (IT)
Update the security policy and communicate security changes – update the acceptable use policy which includes no password sharing and training staff on security practices (management)
Update HR offboarding process to revoke access – revoke user permissions during HR offboarding process and the opposite for HR onboarding process. (HR)
Implement security change process – implementing system controls/changes need to be requested, approved and documented. (management)
In the past, I’ve done data scraping on human-readable reports to solve data challenges related to payment integrations, non-financial and internal system integrations. Typically in integration work, there would be a handshake protocol that involves a sender sending a message to the recipient and the recipient sending back an acknowledgment (file) to confirm receipt of the file. Integration reports would need to reconcile sent and receive messages in order to track messages that have been sent successfully and flag those that failed which can be later used to implement retry logic and alerting. However, some integrations may not send the acknowledgment file in a convenient flat-file format but instead a human-readable report. So we’ll need to resort to data scraping methods to extract the information needed to match against sent messages.
There are multiple ways to achieve this: C#, SSIS packages, PowerShell, etc. For this demo, I’m using SQL because the report is fairly short and I wanted to demonstrate an alternative solution for quicker iterative changes over other solutions like SSIS. And for this demo, I’m using a hypothetical scenario of a payment integration involving a fictitious bank and we’ll need to scrape an acknowledgment file for payment info to match against a payment file sent on the same day.
The required payment fields required to match are:
value date – date of transaction
account number – bank account of fictitious bank
item count – count of items found in the sender message
item value – the total value of items found in the sender message
The sample report shown below has an identifier to indicate the type and style of the report. Since each line has different content and there is no consistent formatting all across the lines, I will implement custom formatting each line.
Load file contents into the temp table (any method will do but I chose TSQL for ease of use)
Apply special formatting on a per-line basis
Store results into a table or a CTE for ease of downstream processing
declare @fileName1 varchar(255) , @sql_stmt varchar(8000)
declare @RptType varchar(10)
set @fileName1='C:\sampleReport.txt'
/* step1: Create temporary objects */
if object_id('tempdb..#DataBuffer') is not null drop table #DataBuffer
create table #DataBuffer(
LineValue varchar(max)
)
if object_id('tempdb..#ReportCopy') is not null drop table #ReportCopy
create table #ReportCopy(
RowID int identity(1,1),
LineValue varchar(max)
)
/* step2: Load data to temp tables. Samples uses BULK method but can be adaptable for a variety of methods.
*/
set @sql_stmt= replace('BULK INSERT #DataBuffer FROM ''$filename$''','$filename$',@fileName1)
exec(@sql_stmt)
insert into #ReportCopy with (tablock) (LineValue) select LineValue from #DataBuffer with (nolock)
select @RptType =ltrim(rtrim(replace(left(LineValue,10),char(9),'')))
from #ReportCopy
where RowID=1
/* step3: we scan for the report type to ensure that the rest of the report can be processed. */
if @RptType ='RPT101'
BEGIN
/* compare results of raw import of report vs results of scaped data */
select * from #ReportCopy;
with cte_ReportValues
as (
select
@RptType rptType,
(select top 1 convert(varchar(10),try_cast(ltrim(rtrim(replace(replace(LineValue,'Value Date: ',''),char(9),''))) as datetime),23) from #ReportCopy where RowID=4) ValueDate,
(select top 1 ltrim(rtrim(replace(replace(LineValue,'account ',''),char(9),''))) from #ReportCopy where RowID=5) AccountNumber,
(select top 1 replace(replace(replace(replace(LineValue,'Item Count',''),' ',''),' ',''),char(9),'') from #ReportCopy where RowID=6) ItemCount,
(select top 1 replace(replace(replace(LineValue,'Item Value',''),char(9),''),'$','') from #ReportCopy where RowID=7) ItemValue
)
select *
from cte_ReportValues
END
Results after running sample code
One of the key features of this code is that you can also use the RowID to format sections of your report in a loop such as the case of a detail listing report. Or simply track your changes during iterative development as I have done here. With the query results through a CTE, you can persist the data into a permanent table. In some implementations, I just merely called the table Acknowledgements. But this technique has many applications. Happy coding!
In a perfect world, the data you are searching for is easily searchable in a single database and documentation is always current. However, often you have a legacy system which has fields that are repurposed to store different types of data. For example, descriptors or memo fields can contain important business information or links to other side systems. Another issue that is commonly encountered is database fields are not consistently named – for example, OrderNo, OrderID, and OrderNumber. This is often the result of siloed development or the functionality of the legacy system being extended beyond what the system is originally designed to do.
Based on a sample search term, I wanted a systematic way to discover what primary data entities are associated with a specific business process as well as the entity relationships. The solution also needed to account for scenarios where data can exist in mismatched in field names or in a set of information within a single database field. For my specific case, the database was a SQL Server with 1000s of columns across many tables supporting a well-known ERP system with multiple application modules.
One way to handle this issue is to build SELECT queries using the metadata catalog and using the search term’s value and length to improve the probability of making a match. This method works best with strings that are intuitively unique which improves the probability of the search. So as an example, a search term “CAF-0006099” as an order number would work better than “6099”.
Below is the sample code I’ve written to aggregate the search results of dynamically generated metadata search queries. The IsFound column will flag on the first wildcard match which is better for scanning large tables. This method of search will not be 100% as a wildcard search can incorrectly make a positive match on both “CAF-0006099” or “CAF-0006099Z”. But as a first pass, this code is still useful in shortlisting fewer columns from potentially 1000s of columns across many tables. From there, you can do a more detailed level of querying.
declare @search_term as varchar(255), @sql_template varchar(8000)
declare @max_rows int, @row_ctr int, @sql_stmt varchar(8000)
set @search_term ='CAF-0006099';
if object_id('tempdb..#SearchResults') is not null drop table #SearchResults
create table #SearchResults(
IsFound int,
ColName varchar(8000)
);
if object_id('tempdb..#cmdlist') is not null drop table #cmdlist
create table #cmdlist(
RowID int identity(1,1),
cmd varchar(8000)
);
set @sql_template ='insert into #SearchResults(IsFound, ColName) '
+ 'select top 1 1 IsFound, ''[$schemaName$].[$tableName$].[$colName$] as ColName'' ColName '
+ 'from [$schemaName$].[$tableName$] with (Nolock) where [$colName$] like ''%$searchTerm$%'';'
insert into #cmdlist(cmd )
select
replace(replace(replace(replace(
@sql_template ,'$schemaName$',c.Table_schema)
,'$tableName$',c.Table_name)
,'$colName$',c.Column_name)
,'$searchTerm$',@search_term)
as SqlStmt
from information_schema.columns c
where c.DATA_TYPE like '%char%' and c.CHARACTER_MAXIMUM_LENGTH>= len(@search_term )
set @max_rows =@@rowcount
set @row_ctr =0
while @row_ctr < @max_rows
BEGIN
select @sql_stmt = cmd from #cmdlist where RowID= (@row_ctr +1)
exec (@sql_stmt);
set @row_ctr=@row_ctr +1;
END
select ColName, IsFound
from #SearchResults;