Field Transform
Transform values on a field level
Last updated
Transform values on a field level
Last updated
To be able to transform the input data on a field-level, we define Field Transforms
. Specific values can be transformed, masked or completely nullified. If no transform is defined for a field, the field will be included in the resulting view as-is.
This is useful if data needs to be presented differently for different groups of data users. That is, for fraud detection you can usually access a lot more data than you can for analysis or marketing purposes.
The Field Transform
consists of a Field
and a list of Transforms
. The Field
is a reference to the corresponding field in the source fields.
In order to be able to change the on-view results for different users with different roles, each entry in the list of transforms specifies how to transform the specific field for a set of Principals
. Principals can be any set of users as defined in the processing platform for the target view. Every Field Transform
for one Field
should at least contain one Transform
without any Principal
, defined as last item in the list. This Transform
acts as the default or fallback transform for the specified field.
The syntax for the replacement string also follows the target platform's syntax, with one important difference: capturing group backreferences always use the dollar notation. In other words, you can use dollar-escaped digits (e.g. $1
or $2
) within the replacement
string to insert text matching the corresponding parenthesized group in the regexp
pattern. Use $0
to refer to the entire matching text. It is quite common for platforms to limit this to 9 subgroups (i.e. $1
-$9
).
The below example replaces the local part of an email address with ****
but preserves the domain part using a capturing group backreference.
before | after |
---|---|
This transform is not available for Azure Synapse Analytics, as Regular Expression (Replacements) are not supported natively in Synapse
Return the original value.
Replace a field with a fixed value
Hash a field using an optional seed. The hashing algorithm depends on the processing platform of your choice.
Execute a SQL statement to transform the field value. The exact syntax is platform-specific.
We can use User Defined Functions, using SQL or other languages like Python (depending on the processing platform). We explore a trivial Python UDF on the Databricks platform in this tutorials page.
Make the field value null.
If you have tokenized data and the processing platform principal can query the token table, you can detokenize the data.
While multiple detokenize transforms are allowed in a single rule set, each must use a different token source.
For numeric fields, the option exists to round the data. The following rounding options exist:
Ceil: Apply the ceiling function to the data.
Floor: Apply the floor function to the data.
Round: Apply logical rounding to the data.
Examples of rounding:
Note that ceiling and floor functions with negative numbers may behave differently based on the processing platform.
Some data contains sensitive information, but the aggregation might not. Aggregate a column as the min
, max
, average
or sum
, partitioning by zero or more fields. For the average, optionally specify the precision and the output type.
avg
sum
, min
, max
The example below considers the avg
aggregation from above.
For a more extensive example of the aggregations see the Tutorials.
Below you will find an example of a set of Field Transforms
. Note that for each set of Transforms
the last one always is without defined principals.
Note that the order of the Field Transform
in the policy matters. That is, if you are a member of multiple Principal
groups, for each Field Transform
, the transform with the first intersection with your Principal
groups will be applied.
In this example, we want to transform the userID for three groups: Marketing, Fraud and Risk, everyone else. For Marketing, we nullify the the ID. For Fraud and Risk, we need to retain IDs and do not touch them. For everyone else, we hash the email so they can still be used as keys, but remain unidentified. By default, card numbers are tokenized, but for Fraud and Risk they are detokenized, i.e. replaced with their actual values.
Below you will find raw data and sample outputs for different (sets of) principals.
before | after |
---|---|
before | after |
---|---|
before | after |
---|---|
before | after |
---|---|
before | after |
---|---|
before | after |
---|---|
Rounding type | Parameters | Input | Output |
---|---|---|---|
userId | country | salary |
---|---|---|
userId | country | salary |
---|---|---|
userId | card_number | age | brand | |
---|---|---|---|---|
token | value |
---|---|
userId | card_number | age | brand | |
---|---|---|---|---|
userId | card_number | age | brand | |
---|---|---|---|---|
userId | card_number | age | brand | |
---|---|---|---|---|
userId | card_number | age | brand | |
---|---|---|---|---|