Data Policy Generation

Generate a Data Policy using the OpenAI plugin

Extensibility is an important aspect of PACE. Functionality can be added through plugins, the first type of which is the OpenAI plugin. More detail on creating your own plugins will follow soon. In this tutorial, we cover our OpenAI Data Policy Generator implementation.

pace list plugins
plugins:
- actions:
  - invokable: true
    type: GENERATE_DATA_POLICY
  - invokable: true
    type: GENERATE_SAMPLE_DATA
  id: openai
  implementation: com.getstrm.pace.plugins.builtin.openai.OpenAIPlugin

This plugin has two actions, we'll explore the GENERATE_DATA_POLICY action in this tutorial

The OpenAI Data Policy Generator uses the OpenAI Chat API to generate a Rule Set for a given blueprint Data Policy, based on a textual description of filters and field transforms.

An OpenAI API key is required for this tutorial. You can generate one in the OpenAI platform at https://platform.openai.com/api-keys. We recommend creating a new API key for this PACE plugin.

If you are not using an enterprise API key of OpenAI (i.e. a paid subscription), be very aware of any (sensitive) data you share with OpenAI as you are not opted out of using that data for training.

File and directory setup

We provide an example setup in our GitHub repository, as explained below. If you already have a running instance of PACE, you may skip this setup and simply add the OpenAI API key to the PACE application configuration. See the config/application.yaml section below.

Clone the repository from GitHub, if you haven't already done so. This command assumes you're not using SSH, but feel free to do so.

git clone https://github.com/getstrm/pace.git

Now navigate to the data-policy-generator directory inside the pace repo:

cd pace/examples/data-policy-generator

Next, let's have a look at the contents of these files.

docker-compose.yaml

The compose file defines three services:

  • pace_app with the ports for all different interfaces exposed to the host:

    • 9090 -> Envoy JSON / gRPC REST Transcoding proxy.

    • 50051 -> gRPC.

    • 8080 -> Spring Boot Actuator.

  • postgres_pace acts as the persistent layer for PACE to store its Data Policies.

    • Available under localhost:5432 on your machine.

config/application.yaml

This is the Spring Boot application configuration, which specifies the PACE database connection, and the OpenAI API key.

spring:
  datasource:
    url: jdbc:postgresql://postgres_pace:5432/pace
    hikari:
      username: pace
      password: pace
      schema: public

app:
  plugins:
    openai:
      enabled: true
      api-key: "put-your-api-key-here"
      # use a gpt-4 model to follow along with the documentation
      # if you don't have access, use gpt-3.5-turbo, but your results will be poorer
      model: "gpt-4-1106-preview"

Make sure to set a valid API key, which you can generate at https://platform.openai.com/api-keys.

openai-plugin.yaml

This file contains the blueprint Data Policy and the textual instructions we'll use to generate a Rule Set using the OpenAI Data Policy Generator plugin. Feel free to modify it to your own liking.

Generating the Data Policy

Tutorial video

Running PACE

Make sure your current working directory is the same as the directory you've set up in the previous section. Start the containers by running:

docker compose up

There should be quite a bit of logging, ending in the banner of the PACE app booting.

Invoking the plugin

In the same directory, execute the following PACE CLI command:

pace invoke plugin openai GENERATE_DATA_POLICY --payload openai-plugin.yaml

This will take a little while (around 20 seconds during our testing). If OpenAI replied with a valid Data Policy, it will be printed to your terminal. The output should look similar to this:

metadata:
  description: Users of the data policy generator example
  title: generator.users
  version: 3
rule_sets:
  - field_transforms:
      - field:
          name_parts:
            - username
        transforms:
          - fixed:
              value: omitted
            principals:
              - group: administrators
          - identity: {}
            principals:
              - group: analytics
          - regexp:
              regexp: .*@(.*)
              replacement: $1
    filters:
      - retention_filter:
          conditions:
            - period:
                days: "30"
          field:
            name_parts:
              - date
      - generic_filter:
          conditions:
            - condition: "TRUE"
              principals:
                - group: administrators
            - condition: age > 18
              principals:
                - group: analytics
            - condition: email LIKE '%@google.com'
    target:
      ref: 
        integration_fqn: filtered_view
      type: SQL_VIEW
source:
  fields:
    - name_parts:
        - email
      required: true
      type: varchar
    - name_parts:
        - username
      required: true
      type: varchar
    - name_parts:
        - organization
      required: true
      type: varchar
    - name_parts:
        - age
      required: true
      type: int
    - name_parts:
        - date
      required: true
      type: timestamp
  ref: 
    integration_fqn: generator.users
    platform:
      id: data-policy-generator-sample-connection
      platform_type: POSTGRES

This is all still quite experimental, so not all instructions may work as well. Let us know if you encounter any issues, and we will further explore this thing called GenAI!

Last updated