Perturbation Algorithms

No Perturbation

If you do not want to apply any perturbation, you should use a pipeline based on the sql_tbe_pipeline, but without the TBE perturbation transform.

TBE Perturbation

Thompson, Broadfoot and Elazar described a perturbation algorithm in the paper "Methodology for the Automatic Confidentialisation of Statistical Outputs from Remote Servers at the Australian Bureau of Statistics - UNECE Work Session on Statistical Data Confidentiality", which we call the "TBE" perturbation algorithm.

In order to use this algorithm with a SQL database, your settings file must include sql_tbe_pipeline.

An example of the dataset configuration required to use the TBE perturbation algorithm is:

  "transform_definitions": {
    "tbe_perturb": {
      "parameters": {
        "record_key_name": "record_key",
        "value_tiebreaker_name": "unit_id",
        "bigN": 1766976779,
        "m": [0.5, 0.3, 0.15, 0.05],
        "p_filename": "pmatrix1",
        "s_filename": "smatrix1",
        "smallC": 5,
        "smallN": 8
      }
    }
  }

The API looks for the named perturbation matrix files in the path specified in the path parameter, which can be specified here as well, but is usually placed in the global settings file. The files must have the extension .csv.

The record key field record_key_name must be an integer field.

The above paper gives a full explanation of the parameters; a short description of some of them follows.

Deviations from the paper

One adjustment has been made to the published algorithm: If the perturbed count is zero, it suppresses the mean and sum. (As written, the paper would show the mean even if the perturbed count is zero.)

Additionally, the API rounds off weighted counts to the nearest integer.

Means and sums

At a high level, the algorithm perturbs means and sums as follows:

Clearly the scale of the m-vector is not independent of the scale of the noise used.

For the sample datasets in protari-sample, the noises are set between -1 and 1, and m = [0.5, 0.3, 0.15, 0.05], which sums to 1.

Scope-based perturbation

Protari implements a scope-based perturbation based on Section 12 of the TBE paper, with some modifications and extensions.

To use it, include the filename containing your primitive scope key declarations in your dataset configuration file, eg:

      "parameters": {
        ...
        "primitive_scope_key_declarations_filename": "psk_declarations"
      }

Protari looks for the file, with a .csv extension added (do not include the extension in the config), in the tbe_perturb.parameters.path defined in your settings file (ie. the same directory where the p matrix and s matrix are stored).

This file must be a csv file without a header, with each row containing the field name, value (or range) name and primitive scope key. The file must provide a primitive scope key for every value (or component range) of every field in the dataset, including sentinel values. For this reason, no fields can have allow_any_value set to true when using scope-based perturbation. An example snippet (without longitudinal values) is:

BEDROOMS,<0,465455415
BEDROOMS,0,195061038
BEDROOMS,1,851514097
BEDROOMS,2,1272994473
BEDROOMS,3,862498105
BEDROOMS,4,458228099
BEDROOMS,5,1554797114
BEDROOMS,6,513897192
BEDROOMS,7,1111749691
BEDROOMS,8,1091567262
BEDROOMS,,457121410
DWELLING,001,183810198
DWELLING,002,1197121074
DWELLING,003,488711548
DWELLING,999,1664310739
PET,dog,782519905
PET,cat,1609262920
PET,fish,1336253152
PET,,1572894361
AMOUNT,<0,992579458
AMOUNT,0-19.99,1170927592
AMOUNT,20-39.99,561902045
AMOUNT,40-59.99,1280419541
AMOUNT,60-79.99,118733751
AMOUNT,80-100.00,507714107
AMOUNT,>=100.01,668653844

As described in the paper, the sum of the primitive scope keys for a field must be 1 modulo big N.

The scope key is calculated for each field in a condition and group-by (including custom group bys) as the sum of the primitive scope keys, eg. with where PET=cat;dog, the primitive scope keys for cat and dog are added. Then the resulting scope keys are multiplied across fields.

Eg. a query with where=PET=cat;dog,DWELLING=001;999 and group_by=BEDROOMS will have the output rows:

BEDROOMS,perturbed count
<0,10
0,15
1,44
...

With the primitive scope keys above, these rows are perturbed with scope keys of:

(1609262920 + 782519905) * (183810198 + 1664310739) * 465455415
(1609262920 + 782519905) * (183810198 + 1664310739) * 195061038
(1609262920 + 782519905) * (183810198 + 1664310739) * 851514097

respectively.

Deviations from the paper are described below.

1. Total key calculation

In equation (29), Protari calculates the total key as:

total key = (cell key + scope key - 1) mod (big N)

In particular, the modulus is taken after the addition (in agreement with how (29) is used); and the "scope key adjustment" mentioned in the paper is always taken to be 1. An adjustment of 1 means the grand total count of all records is the same regardless of whether scope-based perturbation is applied.

2. No "continuous" scope keys

Because numeric data fields can only be queried in well-defined ranges, continuous scope keys (as described in section 12.2) are not required in Protari. Instead, primitive scope keys are defined for each of the minimal allowed ranges on numeric fields (as well as any sentinel values) – see the sample declarations for AMOUNT and BEDROOMS above.

3. Additive scope perturbation (optional)

Protari also implements an alternative algorithm which adds the scope-based perturbation to the record-key-based perturbation. The second perturbation is drawn from the column calculated in the same way, but using the record-key-based perturbed count. Typically this perturbation is drawn from the same p-matrix as before, but the option is available to draw it from a different p-matrix (via scope_p_filename below).

With this approach, formula (7) for the perturbation is replaced with:

p0 = pTable[p_row_index(cell_key), p_col_index(n)]
if p0 ≠ 0:
  p = p0 + scope_pTable[p_row_index(cell_key, scope_key), p_col_index(n + p0)]
else:
  p = 0

You can switch on this behaviour by including in the tbe_perturb dataset configuration parameters:

      "parameters": {
        ...
        "use_additive_scope_perturbation": true,
        "scope_p_filename": "scope_pmatrix1",  // optional
      }

By default, the use_additive_scope_perturbation flag is false. If no scope_p_filename is provided, Protari uses the same p-matrix (given by p_filename) for scope-based perturbation.

4. Longitudinal values

The paper does not explicitly describe how to handle longitudinal values, eg. for a query such as count by (occupation) and (postcode in 2005), which would return a list with the following headings:

year, occupation, postcode in 2005, perturbed count

Protari implements scope keys for this by requiring a primitive scope key for each longitudinal value, eg.

SEX@2000,F,1630316312
SEX@2000,M,1132901257
SEX@2000,,770735990
SEX@2005,F,1690210264
SEX@2005,M,1187297248
SEX@2005,,656446047

If there is more than one longitudinal field, eg. years and simulation iterations, then primitive scope keys need to be defined for SEX@2000@1, SEX@2000@2, etc.

Laplace Perturbation

Protari comes with an experimental implementation of Laplace-distributed noise, with a single scale parameter for all queries. It should not be used in production at this stage.

In order to use this algorithm with a SQL database, your settings file must include sql_laplace_pipeline.

An example of the dataset configuration required to use the Laplace perturbation algorithm is:

  "transform_definitions": {
    "laplace_perturb": {
      "parameters": {
        "scale": 1,
        "key": "secret100"
      }
    }
  }