Flattening Nested JSON
Motivation
Once you start functional programming, you become alert to problems that lend themselves to a recursive solution. In my case the programming language is Elixir and the problem in question was posed to my friend at his job interview. He was asked to flatten a heavily nested JSON object. In Python, his language of choice, heavily nested dictionary.
The Problem
Flatten a nested JSON object. But first, a paragraph on JSON.
JSON is a data interchange format, popularized by the proliferation of APIs (thank you REST). Part of its popularity is also due to the fact that it’s text-based, human-readable, and with an incredibly simple specification which could be read and understood at the water cooler.
Here’s an example valid JSON:
'{"versions":[1],"name":"json","text":true,"creator":{"name":"Doug", "address":{"country":"USA"}}}'
Pretty printed:
{
"creator": {
"name": "Doug",
"address": {
"country": "USA"
}
}
"name": "json",
"versions": [1],
"text": true,
}
Our goal is to get rid of the nesting. For the above example, if we do a good job, we should have this result:
{
"creator.name": "Doug"
"creator.address.country": "USA",
"name": "json",
"versions": [1],
"text": true,
}
To be fair, this is not an exciting problem. It’s not one of those designed to make you look stupid at the white board. It feels like something you’d actually do on the job.
The Recursive Approach
This is the function we’re going to fill up as we tease apart the problem using well-crafted input.
def flatten(json):
pass
Input Type One (Empty or One-Level Deep)
{"a": "A"}
The simplest input we could be given is a JSON object that’s either empty or only one-level deep. In that case we just return the input as the output. Or we could build a new object and return that instead. Not the most optimal solution, as we’re unnecessarily duplicating the object. But it helps keep the implementation consistent as we improve it so let’s do that.
1def flatten(json):
2 acc = {}
3 for k, v in json.items():
4 acc[k] = v
5
6 return acc
We initialize an empty object as the
accumulator, which is
populated as we iterate over the key-value pairs
of the JSON object, generated by the items
method. At the end we return our newly-built
object. It’s a duplicate of the original input,
but it’s different. It’s not affected by changes
to the original.
Input Type Two (Our First Encounter With Nesting)
{"a": {"b": "c"}, "b": "d"}
The end result of flattening the above input
should be as below. I chose .
as the separator
but it can be anything that makes sense, really.
{"a.b": "c", "b": "d"}
Code solution:
1def flatten(json, acc, prefix):
2 for k, v in json.items():
3 prefixed_k = (prefix + '.' + k) if prefix else k
4 if type(v) is dict:
5 flatten(v, acc, prefixed_k)
6 else:
7 acc[prefixed_k] = v
8
9 return acc
It’s quite a jump from handling the simple input to the slightly more sophisticated one. In the process our function has gained 2 more arguments (one of which we’ve met before) and 3 more lines. A couple of things, though, should remain familiar. For example, we still iterate over the items of the object to build our new object, the accumulator.
The changes which introduce the new behavior have been highlighted.
First the prefix. Since we’re flattening an object, it’s no longer safe to use their original key in the accumulator as it could overwrite an existing pair or be overwritten by a later member. Prefixing the current key with all their ancestor keys prevents this data loss.
On line 4 we do a type test of the value under
consideration, if it’s a dict
, we
flatten
it. The arguments to this call are the
value under consideration, the original accumulator,
and a prefix that contains its key (and all
ancestor objects). It is this call to flatten
that makes the solution recursive. As it tries to
flatten the nested JSON, if it encounters further
objects on its way, it pauses to resolve them. And
then continues.
The code above is all that was needed at my friend’s interview. Well, not quite. The interviewer wanted a function that flattens a JSON object but now we demand of them to pass 2 other arguments to the function. It wasn’t a disgraceful challenge so we should be nice to them. Also, as the author of a post complaining bitterly about N-parameter functions I owe it to you to do the right thing. So we’ll bring it back to a single-parameter function by taking advantage of one of the many niceties of Python.
In Python, a function can be defined in another function. Just like a variable. Even better, it can be called (of course what’s the point of defining a function if it can’t be used), just like an initialized variable can be used. Let’s use this technique to bring our function’s arity down to one:
def flatten(json)
def _flatten(json, acc, prefix):
for k, v in json.items():
prefixed_k = (prefix + '.' + k) if prefix
if type(v) is dict:
_flatten(v, acc, prefixed_k)
else:
acc[prefixed_k] = v
return acc
return _flatten(json, {}, '')
This should please the interviewer. This new
implementation also allows us to accept and return
real JSON strings (the single-quoted stuff). To do
that we do some pre-processing before calling
_flatten
and post-process its return value.
Here’s the updated and final code:
from json import loads, dumps
def flatten(json):
def _flatten(json, acc, prefix):
for k, v in json.items():
prefixed_k = (prefix + '.' + k) if prefix else k
if type(v) is dict:
_flatten(v, acc, prefixed_k)
else:
acc[prefixed_k] = v
return acc
json_obj = loads(json)
return dumps(_flatten(json_obj, {}, ''))
Caveat
Imperative languages don’t do recursion well. And
Python is no exception. In fact it has a hard
limit on how much you can recurse. By default it’s
a 1000. (See $ python -c "import sys;print(sys.getrecursionlimit())"
).
It can be moved, of course, but the limit is there
for a reason: to not crash Python. So, if you’re
going to move it around, please do so gently.
You know what has no limits on it? The iterative solution. Its implementation is left as exercise to the reader.
Got comments or corrections for factual errors? There’s a Hacker News thread for that.