DBmasker ANO syntax

Creating the ANO file


The DBmasker service takes one input: the ANO file, which contains the data schema information and the rules you write. In the following is the ANO syntax described in a simplified EBNF based railroad diagrams. Three symbols are used:

images/download/thumbnails/40534018/terminal.png

A terminal representing plain text written as given in the box.

images/download/thumbnails/40534018/definition.png

A box with small letters. A nonterminal representing a reference to another syntax diagam
with name as given in the box.

images/download/thumbnails/40534018/value.png

A box with only capital letters. A nonterminal representing parameters/values as defined by the name.
I.E. COLUMN is a database column name.

The model


The ANO file contains three sections:

  • database schema information

  • user defined classes

  • tasks and rules

model

images/download/attachments/40534018/model.png

The schema section


The schema section defines the database structure in a simplified syntax:

A sample schema section:

Sample schema section
table CUSTOMER
column integer CUSTOMERNO
column text NAME size 40
column text CREDITCARD
column text EMAIL
column text PASSWORD
column text PHONE
column integer CUSTODIAN
  primary-key CUSTOMERNO
 unique NAMEINDEX NAME
 
table INVOICE
column integer INVOICENO
column text INVOICETEXT
column decimal VALUE size 20 scale 2
column integer CUSTOMER_CUSTOMERNO
primary-key INVOICENO
 
foreign-key
INVOICE CUSTOMER_CUSTOMERNO
CUSTOMER CUSTOMERNO

The schema part is simplified SQL statements. You may convert SQL schema files to the ANO form by using the DBano service, which takes SQL as input and delivers an equivalent ANO file. The syntax is described as follows:

schema

images/download/attachments/40534018/schema.png

The TABLE and COLUMN non-terminals represent a table and a column name.

type

images/download/attachments/40534018/type.png

The TYPE non-terminal represents some type value not covered by standard types.

size

images/download/attachments/40534018/size.png

The column size is given by the size POSINT value and the scale on decimal numbers is given by the scale POSINT where POSINT is a positive integer.

pk

images/download/attachments/40534018/pk.png

unique

images/download/attachments/40534018/unique.png

INDEX-NAME is the name of the index.

fk

images/download/attachments/40534018/fk.png

The setnull terminal indicates that the association represented by the foreign key, may exist.

columns

images/download/attachments/40534018/columns.png

to/from

images/download/attachments/40534018/to.png

The user defined classes section


Some masking parts use functions to manipulate values and add rows to the database and you may create user defined specialized classes (All interfaces and abstract classes are explained in the javadoc):

Package

Description

conversions

Package containing custom conversion classes. Defined for randomized or masked columns using a column input source. Converts input strings into another string format or can convert string into another data type for randomized columns. Implements no.esito.anonymizer.IConversion interface.

Built in conversions used for converting string input to various other data types:

  • ParseDigits - remove all non-digits

  • String2Date - convert to java.time.LocalDate

  • String2DateTime - convert to java.time.LocalDateTime

  • String2Decimal - convert to Double

  • String2Integer - convert to Integer

  • String2Time - convert to java.time.LocalTime

distributions

Package containing custom distribution classes. Defined for dependent tables when creating data for tables. Used for determining the distribution of foreign keys between the parent and child tables. Implements no.esito.anonymizer.IDistribution interface. Distributions may have two parameter definitions:

  • parameter defined and connected to parent tables

  • parameter defined on distribution method and connected to parent groups/tables

Built in distributions:

  • AllCombinations - all combinations of parents are satified at least once

  • EvenWithDeviation -

  • SimpleSpread - foreign key columns are randomly assigned values from possible primary key values

transformations

Package containing custom transformation classes. Defined for masked columns. Used to transform column value before being written to the database. (See ReplaceDigits.java in AnonymizerHotel example). Implements no.esito.anonymizer.ITransformation interface.

Built in transformations:

  • Email - Translates space, hyphen and underscore resutling in a valid email address

  • CreditCard - Adjust last digit for MOD10 validation

The built-in classes have no.esito.anonymizer.conversions, no.esito.anonymizer.distributions and no.esito.anonymizer.transformations packages. All built-ins are implicitly defined.

Sample classes part:

User defined classes
conversion example.anonymizer.conversions.ParseDigits
transformation example.anonymizer.transformations.PostCodeGeneralization
distribution example.anonymizer.distributions.MinPerParent

userdefined

images/download/attachments/40534018/userdefined.png

The CLASS represents a user defined class with full java package specification.

The tasks and rules section


A Task is either a TaskGroup or a WorkTask. TaskGroups can have infinite child tasks of both kinds while WorkTasks only contain rules. A WorkTask is one of update, create, delete, erase or sar. Naming tasks follow these rules:

  • Group tasks names must be unique while work task names must be unique within a group task

  • Group tasks on the same level in the task hierarchy must have unique names

  • Work tasks on the same level in the task hierarchy must have unique names

  • Group tasks cannot have the same name as any of the work tasks it contains. If the work task does not have an explicit defined name it is the default name that is forbidden, i.e. the work task Delete HOTEL gets by default the name Delete_HOTEL

  • The name comparison is case insensitive to prevent file name collisions on case insensitive file systems

Running TaskGroups, will run all child tasks in defined sequence. Running WorkTasks, will run all tasks with same name regardless of which TaskGroup they belong and in the sequence they are defined.

We will often use the common phrase task both for TaskGroup and WorkTask.

A TaskGroup contains tasks which can be both TaskGroups and WorkTasks. The purpose is to group tasks as an entity.

A WorkTask is one of

  • update - mask and make data unrecognizable

  • delete - sub setting database to reduce size

  • create - create records for testing

  • erase - remove obsolete data

  • sar - create reports from database

A Rule defines anonymizations/masking on data and consists of

  • mask - anonymize data

  • randomize - add noise to numbers

  • shuffle - reorder data sequence

WorkTasks are detailed by Rules.

Through out of this documentation, we will be using the phrases taskgroup and worktask, or the common task.

The example below consists of two tasks: Anonymize (taskgroup) and update_CUSTOMER (worktask) and two masking rules for NAME and EMAIL.

Mask
// Pure Anonymizations
task Anonymize {
// Anonymize - Mask various Customer fields
update CUSTOMER
// Create random name from list of firstnames and lastnames
mask NAME
format "%s %s"
file src/main/resources/firstname.txt random-order
file src/main/resources/lastname.txt random-order
// Create email based on the newly created name
mask EMAIL
format %s@%s
transform Email // Use built-in transformation class
unique
column NAME
file src/main/resources/email.txt random-order
}

Creating Tasks

A task is one of

task

A combination of tasks, both taskgroups and worktasks.

update

Updates or Anonymizations are organized by table and performed on one or more of its columns and on columns from dependent tables (i.e. foreign keys). The table anonymizations are organized by task and the table anonymization node contains separate child nodes for each column that is to be anonymized.

create

In test and development environments it is often necessary to work with various test data for developing and testing different scenarios. With Create data for Table, you can insert rows into database tables with sample data and also create records for child tables that are linked together by foreign keys.

It is important to define the creation of data for parent tables before defining the creation of data for the child tables and then ensure that the relation via the foreign keys are defined. You can use the masking to assign data to the columns and if you are using an external file for the data source and you don't select the random order option, the tasks can be constructed so that the database contains the exact same data each time the tasks are run.

You can also specify the distribution of foreign keys used in child records by specifying one of the various distribution algorithms supplied or you can create a customized distribution for your own purposes. If a parent table is referenced by many child foreign keys, a distribution algorithm may be defined for a group of child tables.

delete

When creating a test or development environment, it is often necessary to perform a full copy of the production database in order to acquire a proper data set. This may result in a data set that is too large and unmanageable, often requiring an analysis of the data content and structure to delete data in order to shrink the database size. This is often called database subsetting.

When determining the records that need to be deleted, it is important to understand the structure so that the correct table data deletion sequence is followed. However, if the database has proper foreign key definitions, dbMasker can facilitate the process of establishing all dependent tables. Start with tables that have no parent tables and decide on which data in the tables to delete. All dependent data in dependent tables following the foreign key structure, will also be automatically marked for deletion. You may remove any dependency or manually add dependencies on other tables.

erase

The erase task allows you to combine anonymizations (masking) and deletions.This may be used to anonymize or remove obsolete data. It may also be used to support GDPR §17 The right to Erasure or Right to be forgotten in GDPR or other privacy regulations. It may also help with consent withdrawal related to GDPR §7 .

sar

A subject access request (SAR) is a concept in GDPR that describes requests by individuals to an organization to see copies of all information that relates to them. This includes any personal data and reasons for its use.The sar task allows you to output information stored about an individual into multiple source files supporting a JavaScript Object Notation (JSON) format and XML format. These source files can be used by various readers to provide the requested information. This gives the user a way to verify data correctness and if necessary request corrections or deletions of personal information.

task

images/download/attachments/40534018/task.png

The TASK represents a task name and sql represents SQL statements run before and after the task defined operations.

The update, create, delete, erase and sar are all worktasks and may be run as a task with the implicit name given by the task type and the TABLE name as in "update_CUSTOMER". A NAME may be given and it will override the implicit name.

sql

images/download/attachments/40534018/sql.png

The SQL represents complete SQL statements. The statement must be written inside double quotes ("statement”) and any occurrences of double quotes or backslash must be escaped with a backslash: \" and \\. There will be no check of SQL syntax and a statement cannot be divided on serveral lines. Multiple SQL statements may be written divided by newline .

update

images/download/attachments/40534018/update.png

Used for anonymizing/updating one or more columns for a specified table. Additional anonymization rules based on columns of the selected table may be added. See Creating masking/anonymization rules.

update ARTIST sql-before "delete from ARTIST where NAME is null" selection-key NAME
or
update ARTIST myTaskName sql-before "delete from ARTIST where NAME is null" selection-key NAME

The first statement has no task name specification, giving the taskname update_ARTIST. The second sample specifies a task name, giving it the task name myTaskName.

create

images/download/attachments/40534018/create.png

In test and development environments it is often necessary to work with test data for developing and testing different scenarios. With create TABLE, you can insert rows into database tables with sample data and also create records for child tables that are linked together by foreign keys.

It is important to define the creation of data for parent tables before defining the creation of data for child tables and then ensure that the relation via the foreign keys are defined. Additional anonymization mask rules to assign data to the columns can be added and if you are using an external file for the data source and you don't select the random order option, the tasks can be constructed so that the database contains the exact same data each time the tasks are run.

You can also specify the distribution of foreign keys used in child records by specifying one of the various distribution algorithms supplied or you can create a customized distribution for your own purposes. If a parent table is referenced by many child foreign keys, a distribution algorithm may be defined for a group of child tables.

The minimum-rows POSINT value ensures that the table will have at least minimum-rows records inserted.

distribute

images/download/attachments/40534018/distribute.png

The DISTRIBUTION denotes an algorithm that ensures consistency between parent and child tables via foreign keys. The class must implement the IDistribution interface.

The DISTRIBUTION class can be one of the built-in classes or defined in the user defined distribution classes section. The sample below uses the built-in class AllCombinations. It ensures that all possible combinations of ROOM rows connected to HOTEL and ROOMCATEGORY are created.

create ROOM minimum-rows 50
mask ROOMNO
format %d
random-integer 101 399
mask BALCONY
format %d
random-integer 0 1
mask ID
format %d
unique
sequence -1 1
// Every hotel should have a room of each category
distribute AllCombinations "" //Use built-in distribution class
table ROOMCATEGORY ""
child CATEGORY_ID
parent ID
table HOTEL ""
child HOTEL_ID
parent ID

delete

images/download/attachments/40534018/delete.png

Used for deleting a sub-set of data from a specified table and records from referenced tables. Cascading deletes of dependent tables is supported and shown in the sample code below.

The 3 cascading delete methods have different performance of deleting a hierarchy of records. The characteristics are:

  • cascading

    • Children are deleted first, and the parent last

    • Generally works without disabling constraints

    • Slow performance (one by one)

  • not-exists

    • Parent is deleted first

      • DELETE FROM parent WHERE XX

    • Thereafter all orphans are deleted using

      • DELETE FROM child WHERE NOT EXISTS (SELECT * FROM parent WHERE parent.id = child.parent_id)

    • Fastest, but constraints must be disabled

  • not-in

    • Parent is deleted first

    • Thereafter all orphans are deleted using

      • WHERE child.parent_id NOT IN (SELECT DISTINCT parent.id FROM parent)

    • Fast, but constraints must be disabled

cascade

images/download/attachments/40534018/cascade.png

delete HOTELCHAIN where "ID = 0" {
cascade HOTEL parent ID child CHAIN_ID {
cascade BOOKING parent ID child HOTEL_ID {
cascade STAY parent ID child BOOKING_ID {
cascade NIGHT parent SERIALNO, CUSTOMER_CUSTOMERNO
child STAY_SERIALNO, STAY_CUSTOMER_CUSTOMERNO
}
}
}
}

erase

images/download/attachments/40534018/erase.png

Used for erasing records in a table by anonymizing specified columns and deleting records from it's referenced tables, or masking specified columns in the referenced tables. It contains a parameterized condition bases on the selected tables primary key.

eraseTable

images/download/attachments/40534018/eraseTable.png

erase CUSTOMER
where "CUSTOMERNO = %PARAMETER%"
mask NAME format "firstname lastname"
mask EMAIL format epost@email.com
{
// Anonymize identifiable columns
cascade ADDRESS
parent CUSTOMERNO
child CUSTOMER_CUSTOMERNO
mask HOMEADDRESS format "Home address"
mask POSTALCODE format 1234
}

simpleMask

images/download/attachments/40534018/simpleMask.png

sar

images/download/attachments/40534018/sar.png

Used for creating data for Subject Access Requests reports. The exported data comes in the form of a xml or json file. It contains a parameterized condition bases on the selected tables primary key.

sarTable

images/download/attachments/40534018/sarTable.png

sar CUSTOMER
where "CUSTOMERNO = %PARAMETER%"
mask CREDITCARD
mask CUSTOMERNO
mask EMAIL
mask NAME
{
cascade STAY parent CUSTOMERNO child CUSTOMER_CUSTOMERNO
mask SERIALNO
mask BOOKING_ID
mask INVOICE_INVOICENO
mask LOCATION_ID
{
cascade NIGHT parent SERIALNO, CUSTOMER_CUSTOMERNO
child STAY_SERIALNO, STAY_CUSTOMER_CUSTOMERNO
mask ID
mask DATE
mask PRICE
mask ROOM_ID
}
cascade BOOKING parent CUSTOMERNO child CUSTOMER_CUSTOMERNO
mask FROMDATE
mask TODATE
mask ID
mask HOTEL_ID
}

selectionKey

images/download/thumbnails/40534018/selectionKey.png

Possible selection keys are the primary key and unique indexes. The default selection key is the primary key defined for the table.

where

images/download/thumbnails/40534018/where.png

The EXPRESSION represents a logical expression. The expression must be written inside double quotes ("expression”) and any occurrences of double quotes or backslash must be escaped with a backslash: \" and \\.

The worktasks update, delete, erase and sar can all have a where clause defined to filter the data rows.

The worktasks delete, erase and sar can take values entered in a command or the API. This is defined in the where clause with the %PARAMETERn% notation. Note: You can use %PARAMETER% instead of %PARAMETER1%.

%PARAMETER% use
erase CUSTOMER
where "CUSTOMERNO = %PARAMETER%"
mask NAME format "firstname lastname"
mask EMAIL format epost@email.com
 
Command use: erase customer 1000234
%PARAMETERn% use
erase COUNTRY
where City = %PARAMETER1% AND country_id = (SELECT country_id FROM country WHERE country = %PARAMETER2%)
 
Command use: erase country Paris France

child

images/download/thumbnails/40534018/child.png

parent

images/download/thumbnails/40534018/parent.png

Creating masking/anonymization rules

Anonymizations are organized by table and performed on one or more of its columns and on columns from dependent tables (i.e. foreign keys). The table anonymizations are organized by worktasks and the table anonymization statement contains separate statements for each column that is to be anonymized.

An anonymization rule is one of

mask

The mask function masks the values of columns by replacing all or part of the data with generated values taken from various input sources. A mask may have several parameters based on one or more of the following:

column

The column input source is used to insert the specified column value into the masked parent column. You can only specify a column from the mask table.

file

The file input source is used to insert a string value taken from the file defined in the file property. It picks a line from the file sequentially or randomly depending on the random-order property.

random-[type]

The random-[type] input source is used to insert a value drawn from a random sequence as defined by the from and to properties. The type is one of integer, decimal, time, date, datetime and is used as in random-integer.

sequence

The sequence input source is used to insert a sequence of integers into the column records starting with the start property value and incremented by the increment property value.

randomize

The randomize function is used to apply noise on numeric and date column values. The algorithm is useful for hiding transactional data, whereby it maintains approximate values but changes it significantly so as to make it hard to recognize.

shuffle

The shuffle function reorders the column values in a random order and ensures that no record retains the original column value.

anonymization

images/download/attachments/40534018/anonymization.png

The anonymization may get values from a mapping file, produce mapping to a file or both. A temporary value may be defined for columns that are primary keys or unique columns. It allows you to define a temporary key that is used to temporarily set as the value before the system reads through all records.

map

images/download/attachments/40534018/map.png

The map input function uses a mapping file containing key/value pairs (Key=Value) to update columns using the mapped value. The system reads the column value and checks if the mapping file has a corresponding key. If there is a key found, the column value is replaced with the mapped value. The mapping file is typically produced from another anonymizing rule using map output and could have originated in another database. With this function, it is possible to store the mapping values from one database, and apply those values to similar columns in another database. This ensures that both anonymized databases contain the same values.

The map output function produces a mapping file containing key/value pairs (Key=Value) from the anonymization rule.

The map input-output function uses the mapping file and produces the mapping file with key/value pairs (Key=Value), if there a new or missing values in the map file.

FILE is the file name and the content may be encrypted.

// Update primary key
update COMPANY
// Reuse masking from another database
mask NAME map company_name.txt input

propagate

images/download/attachments/40534018/propagate.png

// Update primary key
update INVOICE
//Save the real and masked values on a file with encryption
mask INVOICENO map "invoicemask.txt" output encrypted
format %d
unique
sequence -1 1
propagate STAY.INVOICE_INVOICENO //Update foreign key also

Use propagate to update foreign keys with the same value as the primary key mask and use map to update similar values in another database.

mask

images/download/attachments/40534018/mask.png

The mask function masks the values of columns by replacing all or part of the data with generated values taken from various input sources.

The FORMAT denotes the masking result and may be a combination of literal string values and format strings using java.util.Formatter. Use "" if there are spaces.

Format

Input

Description

%s

String

Use with File and Column input

%d

Integer

Use with Sequence and Random of Integer type

%f

Decimal

Use with Randomization and Random of Decimal type

%tF

Date

ISO 8601 formatted date "yyyy-MM-dd"

%tT

Time

24-hour time format "HH:mm:ss"

%1$tF %1$tT

DateTime

Database Timestamp format " yyyy-MM-dd HH:mm:ss "

Without any format specifications the value is regarded as a constant. The text constant NULL is regarded as SQL NULL. Each format parameter must be mached with one of sourceRandom, sourceFile, sourceColumn or sourceSequence.

See more examples at https://dzone.com/articles/java-string-format-examples.

The transform TRANSFORMATION class is one of the classes defined in the user defined transformation classes section. The transform() method takes String as input and delivers a String object. The class must implement the ITransformation interface. See Email and CreditCard as samples below.

sourceSequence

images/download/attachments/40534018/sourceSequence.png

The sequence input source is used to insert a sequence of integers into the column starting with the first INTEGER value and incremented by the second INTEGER value. If -1 is selected, the sequence starts with the highest value stored in the column plus the increment.

sourceColumn

images/download/attachments/40534018/sourceColumn.png

The column input source is used to insert the specified column value into the masked parent column. You can only specify a column from the same table. If the column being anonymized is not a string, you can use one of the conversions defined to convert the data type. See convert.

sourceFile

images/download/attachments/40534018/file.png

The file input source is used to insert a string value taken from the file defined in the FILE terminal. It picks a line from the file sequentially or randomly depending on the random-order property. A conversion class can be selected to manipulate the string selected from the text file. See convert.

sourceRandom

images/download/thumbnails/40534018/sourceRandom.png

randomInteger

images/download/attachments/40534018/randomInteger.png

randomDecimal

images/download/attachments/40534018/randomDecimal.png

randomTime

images/download/attachments/40534018/randomTime.png

randomDate

images/download/attachments/40534018/randomDate.png

randomDateTime

images/download/attachments/40534018/randomDateTime.png

All the random-[type] input sources is used to insert a value drawn from a random sequence as defined by the first from value and the second to value.

The sample below shows anonymizations of a CUSTOMER table using all mask variants.

update CUSTOMER
// Create random norwegian phone number
mask PHONE
format "+47 %d"
random-integer 10001000 99909990
// Create random name from list of firstnames and lastnames
mask NAME
format "%s %s"
file src/main/resources/firstname.txt random-order
file src/main/resources/lastname.txt random-order
// Create email based on the newly created name
mask EMAIL
format %s@%s
transform Email
unique
column NAME
file src/main/resources/email.txt random-order
// Create random creditcard with checksum that validates
mask CREDITCARD
format 41428340%d
transform CreditCard //Use the creditcard transformation class
random-integer 10001000 99919991

randomize

images/download/attachments/40534018/randomize.png

The randomize function is used to apply noise on numeric and date column values. The algorithm is useful for hiding transactional data, whereby it maintains approximate values but changes it significantly to make it hard to recognize. Supported types are Integer, Decimal, Date, Time and DateTime.

random

images/download/attachments/40534018/random.png

All input to the randomize function is read as a string. You must therefore specify a function to convert from the String type to the input type, defined by the type property:

Type

Function

Return type

Expected format on string input

Date

String2Date

java.time.LocalDate

yyyy-MM-dd

DateTime

String2DateTime

java.time.LocalDateTime

yyyy-MM-dd HH:mm:ss

Time

String2Time

java.time.Time

HH:mm:ss

Integer

String2Integer

java.lang.Integer

Decimal integer with or without sign

Decimal

String2Decimal

java.lang.Double

Decimal number as accepted by the valueOf method in class Double

Anonymization for types Integer, Decimal:

FORMAT

Printf format for the selected type: %d or %f

Noise addition

Grouping of various attributes that adds noise. Gaussian formula:
gaussion-distribution * (flat-noise + column value * percentage-noise) + offset

flat-noise

Adds a +/- Gaussian distributed noise with a deviation

percentage-noise

Adds a +/- Gaussian distributed noise with the magnitude as a percentage of the original value

offset

Adds a fixed offset to the value.

Anonymization for types Date, DateTime

FORMAT

Printf format for the selected type: %tF or '%1$tF %1$tT'

Noise addition

Grouping of various attributes that adds noise. Gaussian formula: gaussian-distribution * flat-noise + offset

flat-noise (days)

Adds a +/- Gaussian distributed noise with a deviation

offset (days)

Adds a fixed number of days to the date value

Anonymization for Type Time

FORMAT

Printf format for the selected type: %tT

Noise addition

Grouping of various attributes that adds noise. Gaussian formula: gaussian-distribution * flat-noise + offset

flat-noise (seconds)

Adds a +/- Gaussian distributed noise with a deviation

offset (seconds)

Adds a fixed number of seconds to the date value

convert

images/download/thumbnails/40534018/convert.png

The convert CONVERSION class is one of the built-in classes or defined in the user defined conversion classes section. The convert() method takes String as input and delivers an Object. The class must implement the IConversion interface. See String2Decimal sample below.

update ROOMCATEGORY
// Add 1% gaussian noise to hide the value from search
randomize INITIALPRICE decimal
format %.2f
convert String2Decimal
offset 0.0
flat-noise 0.0
percentage-noise 1.0

shuffle

images/download/attachments/40534018/shuffle.png

The shuffle function reorders the column values in a random order and ensures that no record retains the original column value.

update HOTEL
// Output results to an encrypted file
shuffle NAME
map hotelmap.txt output encrypted //Write the mapping to a file, use encryption