1. 程式人生 > >Extract Nested Data From Complex JSON Trees

Extract Nested Data From Complex JSON Trees

Extract Nested Data From Complex JSON Trees

The Pythonic way of working with JSON objects

We’re all data people here, so you already know the scenario: it happens perhaps once a day, perhaps 5, or even more. There’s an API you’re working with, and it’s great. It contains all the information you’re looking for, but there’s just one problem: the complexity of nested JSON objects is endless, and suddenly the job you love needs to be put on hold to painstakingly retrieve the data you actually want, and it’s 5 levels deep in a nested JSON hell. Nobody feels like much of a “scientist” or an “engineer” when half their day becomes dealing with key-value errors.

Luckily, we code in Python (okay fine, that doesn’t make much of a difference here but it felt like a rallying call for a brief moment there).

Real World Example

To visualize the problem, let’s take an example somebody might actually want to use. I think the Google Maps API is a good candidate to fit the bill here.

While Google Maps is actually a collection of APIs, the Google Maps Distance Matrix. The idea is that with a single API call, a user can calculate the distance and time traveled between an origin and an infinite number of destinations. It’s a great full-featured API, but as you might imagine the resulting JSON for calculating commute time between where you stand and every location in the conceivable universe

makes an awfully complex JSON structure.

Real quick, here’s an example of the types of parameters this request accepts:

import API_KEY
def __init__():    params = {       'units': 'imperial',       'key': API_KEY,       'origins': 'New York City, NY',       'destinations': 'Philadelphia,PA',       'transit_mode': 'car'    }

One origin, one destination. Pretty simple response:

{    "destination_addresses": [        "Philadelphia, PA, USA"    ],    "origin_addresses": [        "New York, NY, USA"    ],    "rows": [        {            "elements": [                {                    "distance": {                        "text": "94.6 mi",                        "value": 152193                    },                    "duration": {                        "text": "1 hour 44 mins",                        "value": 6227                    },                    "status": "OK"                }            ]        }    ],    "status": "OK"}

So for each destination, we’re getting two significant kinds of data: the commute distance, and estimated duration.

Now let’s make things interesting by adding a few more stops on our trip:

import API_KEY
def __init__():    params = {       'units': 'imperial',       'key': API_KEY,       'origins': 'New York City, NY',       'destinations': 'Washington,DC|Philadelphia,PA|Santa Barbara,CA|Miami,FL|Austin,TX|Napa County,CA',       'transit_mode': 'car'    }

Oh boy:

{    "destination_addresses": [        "Washington, DC, USA",        "Philadelphia, PA, USA",        "Santa Barbara, CA, USA",        "Miami, FL, USA",        "Austin, TX, USA",        "Napa County, CA, USA"    ],    "origin_addresses": [        "New York, NY, USA"    ],    "rows": [        {            "elements": [                {                    "distance": {                        "text": "227 mi",                        "value": 365468                    },                    "duration": {                        "text": "3 hours 54 mins",                        "value": 14064                    },                    "status": "OK"                },                {                    "distance": {                        "text": "94.6 mi",                        "value": 152193                    },                    "duration": {                        "text": "1 hour 44 mins",                        "value": 6227                    },                    "status": "OK"                },                {                    "distance": {                        "text": "2,878 mi",                        "value": 4632197                    },                    "duration": {                        "text": "1 day 18 hours",                        "value": 151772                    },                    "status": "OK"                },                {                    "distance": {                        "text": "1,286 mi",                        "value": 2069031                    },                    "duration": {                        "text": "18 hours 43 mins",                        "value": 67405                    },                    "status": "OK"                },                {                    "distance": {                        "text": "1,742 mi",                        "value": 2802972                    },                    "duration": {                        "text": "1 day 2 hours",                        "value": 93070                    },                    "status": "OK"                },                {                    "distance": {                        "text": "2,871 mi",                        "value": 4620514                    },                    "duration": {                        "text": "1 day 18 hours",                        "value": 152913                    },                    "status": "OK"                }            ]        }    ],    "status": "OK"}

There’s a lot happening here. There are objects. There are lists. There are lists of objects which are part of an object. The last thing I’d want to deal with is trying to parse this data only to accidentally get a useless key:value pair like “status”: “OK”.

Obligatory Code Snippet Solution

Let’s say we only want the human-readable data from this JSON, which is labeled “text” for both distance and duration. The function below extract_values can help us here. We'll pass three parameters:

  • The API request response JSON object
  • An array to dump the values
  • The key we’re looking to extract

Regardless of where the key “text” lives in the JSON, this function will return every value for the instance of “key”.

arr = []def extract_values(obj, arr, key):    """Recursively pull values of specified key from nested JSON."""    if isinstance(obj, dict):        for k, v in obj.items():            if isinstance(v, (dict, list)):                extract_values(v, arr, key)            elif k == key:               acc.append(v)    elif isinstance(obj, list):          for item in obj:              extract_values(item, arr, key)
extract_values(GoogleAPIRequest, arr, "text")

So a script like this:

import requestsimport API_KEY
import API_KEY
def __init__():    params = {       'units': 'imperial',       'key': "AIzaSyAhgdmWDWEqukxL4hVq3wB00VMr5wVOPfs",       'origins': "New York City,NY",       'destinations': "Washington,DC|Philadelphia,PA|Santa Barbara,CA|Miami,FL|Austin,TX|Napa Valley,CA",       'transit_mode': 'car',       }
r = requests.get(endpoint, params=params)
arr = []    def extract_values(obj, arr, key):        """Recursively pull values of specified key from nested JSON."""        if isinstance(obj, dict):            for k, v in obj.items():                if isinstance(v, (dict, list)):                    extract_values(v, arr, key)                elif k == key:                   arr.append(v)        elif isinstance(obj, list):              for item in obj:                  extract_values(item, arr, key)
extract_values(r.json(), arr, "text")    print(arr)

Results in this:

['227 mi', '3 hours 54 mins', '94.6 mi', '1 hour 44 mins', '2,878 mi', '1 day 18 hours', '1,286 mi', '18 hours 43 mins', '1,742 mi', '1 day 2 hours', '2,871 mi', '1 day 18 hours']

Alas, an array of only the things we want. Maybe we should separate this into two arrays, where each array contains the same type of information:

extract_values(r.json(), arr, "text")print('durations = ', arr[1::2])print('distances = ', arr[2::1])
# -----------------------------------# Output:# -----------------------------------durations = ['3 hours 54 mins', '1 hour 44 mins', '1 day 18 hours', '18 hours 43 mins', '1 day 2 hours', '1 day 18 hours']
distances = ['94.6 mi', '1 hour 44 mins', '2,878 mi', '1 day 18 hours', '1,286 mi', '18 hours 43 mins', '1,742 mi', '1 day 2 hours', '2,871 mi', '1 day 18 hours']

Feel Free to Improve

While it’s nice to have a function that seems to solve all your problems at the moment, there’s obvious room for improvement here. I’ll leave that in your hands.