....where it all comes together (sort-of)
This puff tries to unravel the general problem of ingesting and making sense of data from the "outside" (and sending it back). In my experience, the process goes roughly like this:
parse the data into some sort of typed object and validate it; else throw an error or relax the schema.
In my opinion, trying to do it in smaller steps, provides some value at each stage.
Essentially the dataflow looks like this:
raw₁ -> untyped₁ -> typed -> validated -> (untyped₂) -> raw₂
- you parse the
raw
data that comes in, transforming it into some sort ofuntyped
structure;- think of it as JSON with all values being
String
, or a CSV parsed into a vector of vectors ofStrings
. untyped
structured data can often already be loaded into a DB, queried, and analysed, and thus inform and the later steps.
- think of it as JSON with all values being
- you type the
untyped
values by trying to coerce them into atyped
structure;- this and the previous step are often done together, but, as described
above, there is value in being able to query and fiddle with
untyped
data; - similarly to the above you can ingest
typed
data and analyse it, before it isvalidated
- this and the previous step are often done together, but, as described
above, there is value in being able to query and fiddle with
- you validate the
typed
data, making itvalidated
- note that there could be different ways of validating it; especially at the beginning, when a coherent, universal set of validation rules hasn't emerged yet. For example one team might not care if a field is positive or negative, while another will.
1. Parsing / Deserialising
raw
-> untyped
Data coming from the outside world comes in a number of raw
formats. This
is the process of getting it into EDN with mostly String
values (some may
be Numeric
, e.g., when parsing binary).
Source Format | How to parse it into EDN |
---|---|
binary | binparse |
csv/tsv/fixed | vsc |
EDIFACT et al. | parseq |
json | whatever |
avro | whatever |
... | whatever |
2. Typing
untyped
-> typed
A.k.a. coercing untyped data (i.e., Strings
and possibly numbers) into fully typed data
(e.g., enum
s, Instant
s, LocalDate
s, Set
s).
The discovery of the coercing functions can be:
- Fully manual:
- hard to get right on first try;
- repetitive.
- tools:
- Computer aided:
- Faster, more iterative approach;
typeguess
+ Trifacta-like UI, but for nested data;- Inference driven:
3. (Logical) Validation
typed
-> validated
- Fully manual
- custom code
- malli
- clojure spec coercers
- cue
- javax.validation
- Computer aided:
- could infer some simple rules (e.g. +ve/-ve/enum) like
spec-provider
does?
- could infer some simple rules (e.g. +ve/-ve/enum) like
TODO:
must be able to provide human-readable error messages
⚗️ Distilled Data
where the many validated
data are sent around, stored, queried, analysed,
machine-learned, put on the blockchain, sliced and reassembled to solve the
Business Problem™
- Kafka Streams (
onion
) to filter, transform data; - Kafka Connect (
onion
again?) to ingest it in specific DBs where can be analysed and spat back out; - Ad-hoc services to do ad-hoc stuff.
4. (Logical) Validation (again, possibly?)
typed
-> typed
5. Untyping
typed
-> untyped
6. Serializing
untyped
-> raw
raw
data is ready to be sent back to the world in the format it came in.
Testing the pipeline
We've seen the dataflow pipeline goes something like this:
raw₁ -> untyped₁ -> typed -> validated -> (untyped₂) -> raw₂
You should be able to round-drip some sample data and:
assert(raw₁ == raw₂)
If that is not possible, perhaps you can test that:
assert(untyped₁ == untyped₂)
Misc
You will generate lots of tables/kafka topics. In my experience including the
stage ({raw
untyped
, typed
, validated
, validated.xyz
} and
versioning them to be able to evolve (at some point
you may be publishing to multiple _v
s of a topic to give people time to upgrade).
E.g.,
data_D0918_untyped_v1
data_D0918_typed_v3
You may then have a similar naming schema for your schemas/tables in your DBs.
Something like topicmop should be built to help.