Tip #114: Handling inconsistent JSON data

Some time ago, I ran into a situation where I needed to consume JSON data from an API service and store data into a relational database. During development, I discovered that the JSON data was inconsistent such that

  1. JSON structures representing different records can have the same nodes defined as different data types (i.e. string vs struct)
  2. JSON structures can have a varied number of substructures that was not defined as an array nor defined in any technical documentation
  3. JSON structure is defined as a single line JSON so usual methods like using a dataframe for parsing won’t work because no standard schema can be applied (implicit or explicit) and inconsistencies would cause the entire single line JSON to be nulled.

The environment was using azure data factory and using databricks to assist in the transformations. The solution involved using Scala and Jackson libraries for node handling. For privacy, I will use fictitious data to show the concept of how the solution would apply. And the code provided are excerpts of what the full source code could look like.

Core logic will rely on two techniques to solve this problem.

  1. Short circuit logic to simplify conditional logic and improve readability
  2. Apply helper methods to flag node types assist with conditional formatting rules

For the sample data, I would be using this sample data which can be easily reformatted using Json editor online (https://jsoneditoronline.org/).

One observation we can see is that for each report there is a varied number of vehicles that we may not know all the combinations of vehicles that can exist on a report, but we observe that each vehicle node has a top and bottom structure. Second, we take note that for vehicle objects we would want to genericize object handling as they could have the same contents.

So for my data access objects, I would define these methods for resolving this specific issue:
IsNodeVehicle (JsonNode) – returns true and abstracts complexity to define whether a node is really a vehicle object
ProcessJsonObject (JsonNode) – method to process the JSON file; will show short circuit logic

  def IsNodeVehicle(ptrNode: JsonNode): Boolean={
    if (ptrNode.has("top") || ptrNode.has("bottom")) {
      true
    }
    else {false}
  }

  def ProcessJsonObject(SrcFileName: String) {
/*
 some code 
*/

// short circuit logic
if (dataNodeElement.has("data") && dataNodeElement.get("data").has("report")){

}
var ptrNode=dataNodeElement.get("data").get("report");
          val fieldNames = ptrNode.fieldNames;
          while (fieldNames.hasNext) {
            var fieldName:String = fieldNames.next();
            if(IsNodeVehicle(ptrNode.get(fieldName))){
/*
processing logic
*/
}

The IsNodeVehicle method is very easy to define. We use the rule that a vehicle object must have either a top or bottom structure. We implement this as a helper method to reduce repetitious checks throughout the code and to future-proof our code in case the definition of the vehicle object needs to change.

Short circuit logic which is built into Scala takes advantage of replacing nested conditional logic as a single line which will improve readability and ongoing maintenance. The way it works is that the chained logic for AND operations will stop processing as soon as it encounters the first FALSE sub-condition or if all sub-conditions are TRUE whichever scenario happens first.

The full source code contains more logic than what’s shown in this blog post. But using these two techniques greatly reduced the complexity of the issue dealing with a potentially large number of objects and is future-proofed if more vehicles are added to the source JSON.

Sample Json data
{“count”:”2″,”Data”:{“0”:{“id”:”de8f85bf-1e0a-da5a-5852-1ab021eccdcb”,”report”:{“serialNumber”:123,”car”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”truck”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”summary”:{“note”:”this is a test”},”Section1″:{“field1″:””},”Section2″:{“field1″:””}}},”1″:{“id”:”1a01e465-e698-a69d-f072-20e7db3ad203″,”report”:{“serialNumber”:456,”motorcycle”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”flatbedTruck”:{“Top”:{“wheelstyle”:””},”Bottom”:{“wheelstyle”:””}},”summary”:{“note”:”this is a test”},”Section1″:{“field1″:””},”Section2″:{“field1″:””}}}}}

References:
https://github.com/FasterXML/jackson
https://jsoneditoronline.org/