Declarative Data Transformations

A while back I needed to tap into third-party JSON data and convert it for use within my own application, ETL-style.

I naively expected that ingesting that data would just mean writing a few lines of JavaScript, but quickly realized that data in the real world is messy: Due to the JSON being entered manually, without much in the way of automated validation, records don’t adhere to a strict schema. Consequently their fields might show up inconsistently, meaning the conceptual model only reveals itself when analyzing the entire data set to detect commonalities. Similarly, values might be assigned arbitrarily (e.g. blank values could be null, empty strings or simply omitted). Because the data set continues to evolve, its shape might shift over time as entities and attributes are added, removed or even reinterpreted – often after months of inactivity.

Thus my initial, naive implementation quickly ballooned into a complex web of imperative validation. This prompted me to look into tools like joi, but that didn’t lead to significant improvements in terms of complexity and maintainability. What I really want is declarative, easy-to-scan definitions of both the entities within that JSON data and the corresponding transformations into my own model.

Let’s assume we want to collate various book databases, then a definition for one imaginary (and comparatively simple) third-party data source might be sketched out like this:

sketch illustrating source data as well as target model

This lays out our expectations regarding incoming data on the left, both in terms of structure and value types, along with the corresponding target model on the right. Due to the aforementioned inconsistencies, we have to account for varying value representations and recognize certain fields as being optional (dashed lines) on the left-hand side, but can consolidate values within our target model: A number might also be wrapped in a string ("123"), blank (null or an empty string) or missing entirely, but always ends up either as a number or a null value on our end. Similarly, a blank author field ends up as an empty authors array.

Thus the target model is a simplified, normalized projection of the original data, which means we don’t have to worry about gross inconsistencies in further processing. Because we’re not currently interested in the editions field, we omit it from the target model. It might seem odd that we’re expressing expectations on data we end up discarding, but that’s on purpose here: It’s quite possible that future additions will contain information we’re interested in, so we want to be notified when that happens.

It would be nice if we could easily recognize those structures in our code as well – indeed, it’s kind of essential for the long-term viability of a project that we only return to a few times per year. This is how declepticon evolved from my initial implementation:

let descriptor = {
    name: "Book",
    fields: {
        title: nonBlankString,
        year: optional(integer, integerString, ""),
        author: optional(csv, nonBlankString),
        publisher: optional(value => publishers.has(value)),
        editions: arrayOf(struct("BookEdition", {
            year: optional(integer, null),
            desc: optional(nonBlankString, "", null)
        }))
    },
    slots: {
        title: eager,
        year: ({ year }) => asInteger(year) || null,
        authors: ({ author }) => author ? parseCSV(author) : [],
        publisher: ({ publisher }) => publisher || null
    },
    stringify: ({ title }) => `"${title}"`
};

Here fields describe the third-party JSON data and associated assumptions on our part while slots defines corresponding transformations into our own model. (eager ensures that a slot is populated early on so it can be used within the respective string representation; see warning messages below). We can then apply this descriptor to validate incoming data and turn it into model instances:

let transform = transformation(descriptor);

let records = require("/path/to/books.json");
records.map(record => {
    return transform(record);
});

Any discrepancies will be reported:

[WARNING] <Book "ETL for Dummies">: invalid `author`: `""`
[WARNING] <Book "ETL for Dummies">: spurious entries `price`, `description`

This tells us that we should update our descriptor and perhaps even incorporate those newly added fields into our own model.

So declepticon essentially provides a primitive type system (alternately called a rich data-driven specification language), enforced at runtime. I wish there was an existing solution I could have used, but the alternatives I’d found proved unsuitable for this particular objective.

If you happen to have use for it, please let me know how declepticon works for you – either down in the comments or on GitHub.

Declarative Data Transformations

Validating and Transforming Inconsistently Structured Data

TAGS