This tutorial assumes that you have completed the quickstart section of the docs. The prerequisites for this tutorial are the same as mentioned there.
The goal of this tutorial is to be able to aggregate data based on simple prerequisites. Some data may be considered sensitive in itself, but when aggregated lose the sensitivity while preserving statistical value.
Context
Consider the following snippet of a salary database.
employee | city | country | salary
------------------+-----------+-------------+--------
Adrian Shaw | Rotterdam | Netherlands | 61560
Anaïs Lejeune | London | UK | 69575
Angela Mueller | Amsterdam | Netherlands | 81765
Benjamin Wilson | Aberdeen | UK | 75709
Bernhard Fechner | Singapore | Singapore | 74503
These salaries could be considered sensitive information in itself. But if we were able to present specific principals with an aggregation of the salary field, like the average within a country, then the sensitivity will considerably reduced.
This is where PACE's aggregation field transform comes in. In this tutorial we will show how to enforce this in a data policy.
File and directory setup
Clone the repository from GitHub, if you haven't already done so. This command assumes you're not using SSH, but feel free to do so.
gitclonehttps://github.com/getstrm/pace.git
Create a directory aggregation with the following directory tree structure:
Now navigate to the aggregation directory inside the pace repo:
cdpace/examples/aggregation
Next, let's have a look at the contents of these files.
The compose file is set up without any persistence of data across different startups of the services. Keep in mind that any changes to the data will be persisted for as long as you keep the services running.
docker-compose.yaml
The compose file defines three services:
pace_app with the ports for all different interfaces exposed to the host:
9090 -> Envoy JSON / gRPC REST Transcoding proxy.
50051 -> gRPC.
8080 -> Spring Boot Actuator.
postgres_pace acts as the persistent layer for PACE to store its Data Policies.
Available under localhost:5432 on your machine.
postgres_processing_platform is the pre-populated database.
Available under localhost:5431 on your machine.
data.sql
The PostgreSQL initialization SQL script that is run on startup for the postgres_processing_platform container. The database is configured to use the public schema (the default), with one table: public.salary, containing employee salary data
The script also creates a few roles and corresponding users, which we will see later on.
config/application.yaml
This is the Spring Boot application configuration, which specifies the PACE database connection, and Processing Platform.
Here, PACE is configured to connect to the postgres_pace host, which is the name of the Docker network that both the pace_app and the postgres_pace containers are configured with.
Furthermore, one Processing Platform of type postgres is configured, named aggregation-transforms-example-connection.
data-policy.yaml
This is the Data Policy we'll be creating in this tutorial. It implements the desired sensitivity measures, which we will see below.
Creating the policy
Running PACE
To start the containers, execute the following command from the aggregation directory:
dockercomposeup--pullalways
There should be quite a bit of logging, ending with the startup logs of the PACE app.
If all went well, the pace list tables CLI command should return two tables:
Before we dive into the data policy definition, let's have a look at the user group principals (implemented as DB roles) and corresponding users that are configured on the sample database:
finance is a role for the corresponding team that tracks company spending. A DB user named fin with password fin has been assigned this role.
analytics is a role for the team responsible for providing statistics about the company. A DB user named anna with password anna has been assigned this role.
uk_manager is a role, specifically for any manager located in the UK. A DB user named ukm with password ukm has been assigned this role.
A DB user named other with password other is also available and hasn't been assigned either of the roles.
We can connect with the sample database on port 5431 on localhost, with aggregation as database name, with either of the above users. None of these users will be able to read data from the public.salary table. Feel free to connect with the database using your favourite SQL client. If you do want to see the data in these tables, you can connect with the user aggregation using the password aggregation.
psql examples
To illustrate, we can try to retrieve transactions with the fin user:
psqlpostgresql://fin:fin@localhost:5431/aggregation-c"select * from public.salary order by employee limit 5"
Which results in the following output:
ERROR: permission denied for table salary
And using aggregation instead:
psqlpostgresql://aggregation:aggregation@localhost:5431/aggregation-c"select * from public.salary order by employee limit 5"
Which results in the following output:
employee | city | country | salary
------------------+-----------+-------------+--------
Adrian Shaw | Rotterdam | Netherlands | 61560
Anaïs Lejeune | London | UK | 69575
Angela Mueller | Amsterdam | Netherlands | 81765
Benjamin Wilson | Aberdeen | UK | 75709
Bernhard Fechner | Singapore | Singapore | 74503
Fetching a blueprint Data Policy
We can use the PACE CLI to retrieve a blueprint policy for the transactions table as follows:
This definition essentially contains the reference to and schema of the source table. If desired, we can change the title and provide a description, such as:
metadata:description:"This is policy restricts and aggregates sensitive salary data."title:"Employee salary"
We can start filling in the policy by adding a rule_sets section:
Here we specify the full name of the view that will be created by the policy. Let's start with a filter, that makes sure that the uk_manager role can only see uk data:
The condition is defined in (standard) SQL, compatible with the target platform (our Postgres database in this case). Next, we can codify the remaining requirements as field transforms, resulting in the below rule set specification. See our overview of transforms for more details on their configuration.
administrator: can see any salary as is, due to identity transform.
finance: is only allowed to see the sum of all salaries.
uk_manager: can only see UK salaries. The salaries are averaged for each city and rounded to the nearest integer.
analytics: can see any salary, but the salaries are rounded to the nearest 10k
other: can see all rows, but the salaries are all nullified.
The data-policy.yaml file in the aggregation-transforms directory contains the same policy. To apply it and create the salary_view, we can use the PACE CLI again:
paceupsertdata-policydata-policy.yaml--apply
The pace list data-policies command will now return it.
Querying the view
Now that the public.salary_view has been created, we can compare the query results the various users get (see also the psql example queries further below). First a few of the original values again:
employee | city | country | salary
------------------+-----------+-------------+--------
Adrian Shaw | Rotterdam | Netherlands | 61560
Anaïs Lejeune | London | UK | 69575
Angela Mueller | Amsterdam | Netherlands | 81765
Benjamin Wilson | Aberdeen | UK | 75709
Bernhard Fechner | Singapore | Singapore | 74503
Compared to the different users:
employee | city | country | salary
------------------+-----------+-------------+--------
Adrian Shaw | Rotterdam | Netherlands | 61560
Anaïs Lejeune | London | UK | 69575
Angela Mueller | Amsterdam | Netherlands | 81765
Benjamin Wilson | Aberdeen | UK | 75709
Bernhard Fechner | Singapore | Singapore | 74503
employee | city | country | salary
------------------+-----------+-------------+---------
Adrian Shaw | Rotterdam | Netherlands | 7526410
Anaïs Lejeune | London | UK | 7526410
Angela Mueller | Amsterdam | Netherlands | 7526410
Benjamin Wilson | Aberdeen | UK | 7526410
Bernhard Fechner | Singapore | Singapore | 7526410
employee | city | country | salary
--------------------+----------+---------+--------
Anaïs Lejeune | London | UK | 74382
Benjamin Wilson | Aberdeen | UK | 74159
Charles Renault | London | UK | 74382
Christina Copeland | Aberdeen | UK | 74159
Christine King | Aberdeen | UK | 74159
employee | city | country | salary
------------------+-----------+-------------+--------
Adrian Shaw | Rotterdam | Netherlands | 60000
Anaïs Lejeune | London | UK | 70000
Angela Mueller | Amsterdam | Netherlands | 80000
Benjamin Wilson | Aberdeen | UK | 80000
Bernhard Fechner | Singapore | Singapore | 70000
employee | city | country | salary
------------------+-----------+-------------+--------
Adrian Shaw | Rotterdam | Netherlands |
Anaïs Lejeune | London | UK |
Angela Mueller | Amsterdam | Netherlands |
Benjamin Wilson | Aberdeen | UK |
Bernhard Fechner | Singapore | Singapore |
psql examples
You could use the following psql commands to query the view.
finance
psqlpostgresql://fin:fin@localhost:5431/aggregation-c"select * from public.salary_view order by employee limit 5"
uk_manager
psqlpostgresql://ukm:ukm@localhost:5431/aggregation-c"select * from public.salary_view order by employee limit 5"
analytics
psqlpostgresql://anna:anna@localhost:5431/aggregation-c"select * from public.salary_view order by employee limit 5"
other
psqlpostgresql://other:other@localhost:5431/aggregation-c"select * from public.salary_view order by employee limit 5"
Cleanup
That wraps up this aggregation transforms tutorial. To clean up all resources, run the following command after stopping the currently running process with ctrl+C.
dockercomposedown
Any questions or comments? Please ask them on Slack.