Overview
The concept of lineage
Lineage
Data entities (like tables, or table columns) have lineage which indicates where the data came from, and what is derived from this data. We use the terms
'upstream' to indicate some data that this data entity was derived from,
and 'downstream' data entities that are derived from this one.
The concept of lineage is well known, and supported more or less by most data processing platforms and data catalogs. Here's an example of a lineage view in BigQuery, taken from its lineage tutorial
Lineage in PACE
PACE aims to
provide data entity lineage information of the various connected platforms in a unified manner. This information is available for all data entities in the underlying platform, with or without a connected PACE data policy.
provide guidance to indicate which data entities might need a PACE data policy, because they're in the lineage chain of a data entity that is under control of a PACE data policy.
Lineage information of data entities.
We've executed the above mentioned BigQuery tutorial on the stream-machine-development
Google Cloud project, with BigQuery access configured in PACE under the id
bigquery-dev
.
The responses are the platform fully-qualified names that we use in the data policy source
references.
Here we show an annotated and somewhat simplified execution of the get lineage
command.
Responds with:
Scanning lineage
This process aims to find data entities that might need a PACE data policy that don't yet have one. Scanning lineage works by
scanning through the stored data policies (within PACE), and finding the lineage information from the underlying processing platform data entity.
Then, for both upstream and downstream dependencies, PACE determines if those are under control of a data policy.
We have set up a data policy on a table named cdc_diabetes
(from UCI). This table has various downstream dependencies in our environment, one of which is controlled by a data policy (stream-machine-development.dynamic_views.diabetes_demo
). Other up- and downstream data entities are not provided with a PACE data policy.
Note: we have shortened the string stream-machine-development
to dev
to keep the tabular output narrower.
The ✓ and ✗ symbols are just representations of the true/false values in the PACE api response; use --output yaml
(or json
) to get the full api response.
The goal is obviously to build a graph like display that will show all lineage, and highlight data entities that probably should be controlled by a PACE data policy.
Last updated