....where it all comes together (sort-of)

This puff tries to unravel the general problem of ingesting and making sense of data from the "outside" (and sending it back). In my experience, the process goes roughly like this:

parse the data into some sort of typed object and validate it; else throw an error or relax the schema.

In my opinion, trying to do it in smaller steps, provides some value at each stage.

Essentially the dataflow looks like this:

raw₁ -> untyped₁  -> typed -> validated -> (untyped₂) -> raw₂
  1. you parse the raw data that comes in, transforming it into some sort of untyped structure;
    • think of it as JSON with all values being String, or a CSV parsed into a vector of vectors of Strings.
    • untyped structured data can often already be loaded into a DB, queried, and analysed, and thus inform and the later steps.
  2. you type the untyped values by trying to coerce them into a typed structure;
    • this and the previous step are often done together, but, as described above, there is value in being able to query and fiddle with untyped data;
    • similarly to the above you can ingest typed data and analyse it, before it is validated
  3. you validate the typed data, making it validated
    • note that there could be different ways of validating it; especially at the beginning, when a coherent, universal set of validation rules hasn't emerged yet. For example one team might not care if a field is positive or negative, while another will.

1. Parsing / Deserialising

raw -> untyped

Data coming from the outside world comes in a number of raw formats. This is the process of getting it into EDN with mostly String values (some may be Numeric, e.g., when parsing binary).

Source FormatHow to parse it into EDN
binarybinparse
csv/tsv/fixedvsc
EDIFACT et al.parseq
jsonwhatever
avrowhatever
...whatever

2. Typing

untyped -> typed

A.k.a. coercing untyped data (i.e., Strings and possibly numbers) into fully typed data (e.g., enums, Instants, LocalDates, Sets).

The discovery of the coercing functions can be:

3. (Logical) Validation

typed -> validated

⚗️ Distilled Data

where the many validated data are sent around, stored, queried, analysed, machine-learned, put on the blockchain, sliced and reassembled to solve the Business Problem™

4. (Logical) Validation (again, possibly?)

typed -> typed

5. Untyping

typed -> untyped

6. Serializing

untyped -> raw

raw data is ready to be sent back to the world in the format it came in.

Testing the pipeline

We've seen the dataflow pipeline goes something like this:

raw₁ -> untyped₁  -> typed -> validated -> (untyped₂) -> raw₂

You should be able to round-drip some sample data and:

assert(raw₁ == raw₂)

If that is not possible, perhaps you can test that:

assert(untyped₁ == untyped₂)

Misc

You will generate lots of tables/kafka topics. In my experience including the stage ({raw untyped, typed, validated, validated.xyz} and versioning them to be able to evolve (at some point you may be publishing to multiple _vs of a topic to give people time to upgrade). E.g.,

You may then have a similar naming schema for your schemas/tables in your DBs.

Something like topicmop should be built to help.

🎨 Prior art