Aggregation

Aggregate sensitive data

This tutorial assumes that you have completed the quickstart section of the docs. The prerequisites for this tutorial are the same as mentioned there.

The goal of this tutorial is to be able to aggregate data based on simple prerequisites. Some data may be considered sensitive in itself, but when aggregated lose the sensitivity while preserving statistical value.

Context

Consider the following snippet of a salary database.

     employee     |   city    |   country   | salary
------------------+-----------+-------------+--------
 Adrian Shaw      | Rotterdam | Netherlands |  61560
 Anaïs Lejeune    | London    | UK          |  69575
 Angela Mueller   | Amsterdam | Netherlands |  81765
 Benjamin Wilson  | Aberdeen  | UK          |  75709
 Bernhard Fechner | Singapore | Singapore   |  74503

These salaries could be considered sensitive information in itself. But if we were able to present specific principals with an aggregation of the salary field, like the average within a country, then the sensitivity will considerably reduced.

This is where PACE's aggregation field transform comes in. In this tutorial we will show how to enforce this in a data policy.

File and directory setup

Clone the repository from GitHub, if you haven't already done so. This command assumes you're not using SSH, but feel free to do so.

git clone https://github.com/getstrm/pace.git

Now navigate to the aggregation directory inside the pace repo:

cd pace/examples/aggregation

Next, let's have a look at the contents of these files.

The compose file is set up without any persistence of data across different startups of the services. Keep in mind that any changes to the data will be persisted for as long as you keep the services running.

docker-compose.yaml

The compose file defines three services:

  • pace_app with the ports for all different interfaces exposed to the host:

    • 9090 -> Envoy JSON / gRPC REST Transcoding proxy.

    • 50051 -> gRPC.

    • 8080 -> Spring Boot Actuator.

  • postgres_pace acts as the persistent layer for PACE to store its Data Policies.

    • Available under localhost:5432 on your machine.

  • postgres_processing_platform is the pre-populated database.

    • Available under localhost:5431 on your machine.

data.sql

The PostgreSQL initialization SQL script that is run on startup for the postgres_processing_platform container. The database is configured to use the public schema (the default), with one table: public.salary, containing employee salary data

The script also creates a few roles and corresponding users, which we will see later on.

config/application.yaml

This is the Spring Boot application configuration, which specifies the PACE database connection, and Processing Platform.

spring:
  datasource:
    url: jdbc:postgresql://postgres_pace:5432/pace
    hikari:
      username: pace
      password: pace
      schema: public

app:
  processing-platforms:
    postgres:
      - id: "aggregation-transforms-sample-connection"
        host-name: "postgres_processing_platform"
        port: 5432
        user-name: "aggregation"
        password: "aggregation"
        database: "aggregation"

Here, PACE is configured to connect to the postgres_pace host, which is the name of the Docker network that both the pace_app and the postgres_pace containers are configured with.

Furthermore, one Processing Platform of type postgres is configured, named aggregation-transforms-example-connection.

data-policy.yaml

This is the Data Policy we'll be creating in this tutorial. It implements the desired sensitivity measures, which we will see below.

Creating the policy

Running PACE

To start the containers, execute the following command from the aggregation directory:

docker compose up --pull always

There should be quite a bit of logging, ending with the startup logs of the PACE app.

If all went well, the pace list tables CLI command should return two tables:

pace list tables --processing-platform aggregation-transforms-sample-connection \
  --database aggregation --schema public  --output table
 ID       NAME     TAGS

 salary   salary

There should be no existing data policies:

pace list data-policies  --output table
No entities of this resource type exist.

Available DB roles and users

Before we dive into the data policy definition, let's have a look at the user group principals (implemented as DB roles) and corresponding users that are configured on the sample database:

  • finance is a role for the corresponding team that tracks company spending. A DB user named fin with password fin has been assigned this role.

  • analytics is a role for the team responsible for providing statistics about the company. A DB user named anna with password anna has been assigned this role.

  • uk_manager is a role, specifically for any manager located in the UK. A DB user named ukm with password ukm has been assigned this role.

  • A DB user named other with password other is also available and hasn't been assigned either of the roles.

We can connect with the sample database on port 5431 on localhost, with aggregation as database name, with either of the above users. None of these users will be able to read data from the public.salary table. Feel free to connect with the database using your favourite SQL client. If you do want to see the data in these tables, you can connect with the user aggregation using the password aggregation.

psql examples

To illustrate, we can try to retrieve transactions with the fin user:

psql postgresql://fin:fin@localhost:5431/aggregation -c "select * from public.salary order by employee limit 5"

Which results in the following output:

ERROR:  permission denied for table salary

And using aggregation instead:

psql postgresql://aggregation:aggregation@localhost:5431/aggregation -c "select * from public.salary order by employee limit 5"

Which results in the following output:

     employee     |   city    |   country   | salary
------------------+-----------+-------------+--------
 Adrian Shaw      | Rotterdam | Netherlands |  61560
 Anaïs Lejeune    | London    | UK          |  69575
 Angela Mueller   | Amsterdam | Netherlands |  81765
 Benjamin Wilson  | Aberdeen  | UK          |  75709
 Bernhard Fechner | Singapore | Singapore   |  74503

Fetching a blueprint Data Policy

We can use the PACE CLI to retrieve a blueprint policy for the transactions table as follows:

pace get data-policy --blueprint \
    --processing-platform aggregation-transforms-sample-connection \
    --database aggregation \
    --schema public \
    salary 

This returns the following data policy definition in YAML, without any field transforms or filters yet:

metadata:
  description: ""
  title: public.salary
source:
  fields:
  - name_parts:
    - employee
    required: true
    type: varchar
  - name_parts:
    - city
    required: true
    type: varchar
  - name_parts:
    - country
    required: true
    type: varchar
  - name_parts:
    - salary
    required: true
    type: integer
  ref: 
    integration_fqn: public.salary
    platform:
      id: aggregation-transforms-sample-connection
      platform_type: POSTGRES

This definition essentially contains the reference to and schema of the source table. If desired, we can change the title and provide a description, such as:

metadata:
  description: "This is policy restricts and aggregates sensitive salary data."
  title: "Employee salary"

We can start filling in the policy by adding a rule_sets section:

[...]
rule_sets:
  - target:
      ref: 
        integration_fqn: public.salary_view

Here we specify the full name of the view that will be created by the policy. Let's start with a filter, that makes sure that the uk_manager role can only see uk data:

[...]
rule_sets:
  - target:
      ref: 
        integration_fqn: public.salary_view
    filters:
      - generic_filter:
          conditions:
            - principals: [ { group: uk_manager } ]
              condition: "country = 'UK'"
            - principals: [ ]
              condition: "true"

The condition is defined in (standard) SQL, compatible with the target platform (our Postgres database in this case). Next, we can codify the remaining requirements as field transforms, resulting in the below rule set specification. See our overview of transforms for more details on their configuration.

[...]
rule_sets:
  - target:
      ref: 
        integration_fqn: public.salary_view
    filters:
      - generic_filter:
          conditions:
            - principals: [ { group: uk_manager } ]
              condition: "country = 'UK'"
            - principals: [ ]
              condition: "true"
    field_transforms:
      - field:
          name_parts: [ salary ]
        transforms:
          - principals: [ { group: administrator } ]
            identity: { }
          - principals: [ { group: finance } ]
            aggregation:
              sum: { }
          - principals: [ { group: uk_manager } ]
            aggregation:
              partition_by:
                - name_parts: [ city ]
              avg:
                precision: 0
                cast_to: "int4"
          - principals: [ { group: analytics } ]
            numeric_rounding:
              round:
                precision: -4
          - principals: [ ]
            nullify: {}

Let's break it down.

  • administrator: can see any salary as is, due to identity transform.

  • finance: is only allowed to see the sum of all salaries.

  • uk_manager: can only see UK salaries. The salaries are averaged for each city and rounded to the nearest integer.

  • analytics: can see any salary, but the salaries are rounded to the nearest 10k

  • other: can see all rows, but the salaries are all nullified.

The data-policy.yaml file in the aggregation-transforms directory contains the same policy. To apply it and create the salary_view, we can use the PACE CLI again:

pace upsert data-policy data-policy.yaml --apply

The pace list data-policies command will now return it.

Querying the view

Now that the public.salary_view has been created, we can compare the query results the various users get (see also the psql example queries further below). First a few of the original values again:

     employee     |   city    |   country   | salary
------------------+-----------+-------------+--------
 Adrian Shaw      | Rotterdam | Netherlands |  61560
 Anaïs Lejeune    | London    | UK          |  69575
 Angela Mueller   | Amsterdam | Netherlands |  81765
 Benjamin Wilson  | Aberdeen  | UK          |  75709
 Bernhard Fechner | Singapore | Singapore   |  74503

Compared to the different users:

     employee     |   city    |   country   | salary
------------------+-----------+-------------+--------
 Adrian Shaw      | Rotterdam | Netherlands |  61560
 Anaïs Lejeune    | London    | UK          |  69575
 Angela Mueller   | Amsterdam | Netherlands |  81765
 Benjamin Wilson  | Aberdeen  | UK          |  75709
 Bernhard Fechner | Singapore | Singapore   |  74503
psql examples

You could use the following psql commands to query the view.

finance
psql postgresql://fin:fin@localhost:5431/aggregation -c "select * from public.salary_view order by employee limit 5"
uk_manager
psql postgresql://ukm:ukm@localhost:5431/aggregation -c "select * from public.salary_view order by employee limit 5"
analytics
psql postgresql://anna:anna@localhost:5431/aggregation -c "select * from public.salary_view order by employee limit 5"
other
psql postgresql://other:other@localhost:5431/aggregation -c "select * from public.salary_view order by employee limit 5"

Cleanup

That wraps up this aggregation transforms tutorial. To clean up all resources, run the following command after stopping the currently running process with ctrl+C.

docker compose down

Any questions or comments? Please ask them on Slack.

Last updated