Field Transform

Transform values on a field level

Introduction

To be able to transform the input data on a field-level, we define Field Transforms. Specific values can be transformed, masked or completely nullified. If no transform is defined for a field, the field will be included in the resulting view as-is.

This is useful if data needs to be presented differently for different groups of data users. That is, for fraud detection you can usually access a lot more data than you can for analysis or marketing purposes.

The Field Transform consists of a Field and a list of Transforms. The Field is a reference to the corresponding field in the source fields.

In order to be able to change the on-view results for different users with different roles, each entry in the list of transforms specifies how to transform the specific field for a set of Principals. Principals can be any set of users as defined in the processing platform for the target view. Every Field Transform for one Field should at least contain one Transform without any Principal, defined as last item in the list. This Transform acts as the default or fallback transform for the specified field.

Transforms

1. Regex

The syntax for the replacement string also follows the target platform's syntax, with one important difference: capturing group backreferences always use the dollar notation. In other words, you can use dollar-escaped digits (e.g. $1 or $2) within the replacement string to insert text matching the corresponding parenthesized group in the regexp pattern. Use $0 to refer to the entire matching text. It is quite common for platforms to limit this to 9 subgroups (i.e. $1-$9).

The below example replaces the local part of an email address with **** but preserves the domain part using a capturing group backreference.

regexp:
  regexp: "^.*(@.*)$"
  replacement: "****$1"
before
after

local-part@domain.com

****@domain.com

This transform is not available for Azure Synapse Analytics, as Regular Expression (Replacements) are not supported natively in Synapse

2. Identity

Return the original value.

identity: {}
before
after

local-part@domain.com

local-part@domain.com

3. Fixed Value

Replace a field with a fixed value

fixed:  
  value: "****"
before
after

local-part@domain.com

****

4. Hash

Hash a field using an optional seed. The hashing algorithm depends on the processing platform of your choice.

hash:
  seed: "1234"
before
after

local-part@domain.com

-1230500920091472191

5. SQL Statement

Execute a SQL statement to transform the field value. The exact syntax is platform-specific.

sql_statement:
  statement: "CASE WHEN brand = 'MacBook' THEN 'Apple' ELSE 'Other' END"
before
after

MacBook

Apple

Lenovo

Other

HP

Other

Acer

Other

We can use User Defined Functions, using SQL or other languages like Python (depending on the processing platform). We explore a trivial Python UDF on the Databricks platform in this tutorials page.

6. Nullify

Make the field value null.

nullify: {}
before
after

local-part@domain.com

null

7. Detokenize

If you have tokenized data and the processing platform principal can query the token table, you can detokenize the data.

detokenize:
  token_source_ref: my_schema.token_table
  token_field:
    name_parts:
      - token
  value_field:
    name_parts:
      - value
before
after

TOKEN-123

5425233430109903

While multiple detokenize transforms are allowed in a single rule set, each must use a different token source.

8. Numeric rounding

For numeric fields, the option exists to round the data. The following rounding options exist:

  1. Ceil: Apply the ceiling function to the data.

  2. Floor: Apply the floor function to the data.

  3. Round: Apply logical rounding to the data.

numeric_rounding:
  floor:
    divisor: 20
numeric_rounding:
  ceil:
    divisor: 20
numeric_rounding:
  round:
    precision: 0

Examples of rounding:

Rounding type
Parameters
Input
Output

Ceil

Divisor = 10

33

40

Ceil

Divisor = 0.1

33.3

34

Ceil

Divisor = -10

33

Floor

Divisor = 10

33

30

Floor

Divisor = 0.1

33.3

33

Floor

Divisor = -10

33

Round

Precision = 1

33.345

33.3

Round

Precision = 0

33.345

33

Round

Precision = -

33.345

30

Round

Precision = -

33.345

0

Note that ceiling and floor functions with negative numbers may behave differently based on the processing platform.

9. Aggregation

Some data contains sensitive information, but the aggregation might not. Aggregate a column as the min, max, average or sum, partitioning by zero or more fields. For the average, optionally specify the precision and the output type.

avg

aggregation:
  partition_by:
    - name_parts: [ country ]
  avg:
    precision: 0
    cast_to: "int4"

sum, min, max

aggregation:
  sum: {}

The example below considers the avg aggregation from above.

userId
country
salary

1

UK

65124

2

UK

83412

3

Netherlands

42256

4

Netherlands

39964

5

UK

46532

For a more extensive example of the aggregations see the Tutorials.

Example Field Transform

Below you will find an example of a set of Field Transforms. Note that for each set of Transforms the last one always is without defined principals.

Note that the order of the Field Transform in the policy matters. That is, if you are a member of multiple Principal groups, for each Field Transform, the transform with the first intersection with your Principal groups will be applied.

In this example, we want to transform the userID for three groups: Marketing, Fraud and Risk, everyone else. For Marketing, we nullify the the ID. For Fraud and Risk, we need to retain IDs and do not touch them. For everyone else, we hash the email so they can still be used as keys, but remain unidentified. By default, card numbers are tokenized, but for Fraud and Risk they are detokenized, i.e. replaced with their actual values.

field_transforms:
  - field:
      name_parts: [ userid ]
      type: "string"
      required: true
    transforms:
      - principals:
        - group: "F&R"
        identity: {}
      - principals:
        - group: "MKTNG"
        nullify: {}
      - principals: []
        hash:
          seed: "1234"
  - field:
      name_parts: [ card_number ]
      type: "string"
      required: true
    transforms:
      - principals:
        - group: "F&R"
      detokenize:
        token_source_ref: my_schema.token_table
        token_field:
          name_parts:
            - token
        value_field:
          name_parts:
            - value
      - principals: []
        identity: {}
  - field:
      name_parts: [ email ]
      type: "string"
      required: true
    transforms:
      - principals:
        - group: "MKTNG"
        - group: "COMP"
        regexp:
          regexp: "^.*(@.*)$"
          replacement: "****$1"
      - principals:
        - group: "F&R"
        identity: {}
      - principals: []
        fixed:
          value: "****"
  - field:
      name_parts:
        - age
    transforms:
      - principals:
        - group: "MKTNG"
        identity: { }
      - principals:
        - group: "COMP"
        numeric_rounding:
          floor:
            divisor: 20
      - principals: [ ]
        numeric_rounding:
          round:
            precision: -1
  - field:
      name_parts: [ brand ]
      type: "string"
      required: true
    transforms:
      - principals: []
        sql_statement:
          statement: "CASE WHEN brand = 'MacBook' THEN 'Apple' ELSE 'Other' END"

Example Results

Below you will find raw data and sample outputs for different (sets of) principals.

userId
card_number
email
age
brand

123

TOKEN_123

alice@store.com

33

MacBook

456

TOKEN_456

bob@company.com

26

Lenovo

789

TOKEN_789

carol@domain.com

11

HP

userId
card_number
email
age
brand

123

5425233430109903

alice@store.com

30

Apple

456

2223000048410010

bob@company.com

30

Other

789

2222420000001113

carol@domain.com

10

Other

Last updated