Configuring Multi-Table Datasets

Introduction

It is now possible to use Protari with a dataset stored as a SQL database consisting of several tables. This document describes the configuration for such a deployment. It should be read in conjunction with the general documentation on configuring a dataset here.

Protari joins data resources using inner joins, eg. a query such as group_by=buyer_transactions.type should be interpreted as: how many people had at least one transaction of each type, in which they were the buyer?

Protari does not currently support using more than one relationship to the same resource within one query. Attempts to submit queries with this structure will return an error.

In order to use multi-table datasets, your settings.yaml file must include sql_tbe_multi_table_pipeline (see global settings).

Data Resources

In a multi-table dataset, there are several sets of fields, one for each table.

In order to accommodate this, a resources array has been added to the dataset configuration. Each data resource element within this array contains the metadata for one table in the dataset, including how that table is related to the others.

(This format is an extension of the Data Package specification. Each data resource is itself based on an extension of the Data Resource specification, using a schema that conforms to an extension of the Table Schema specification.)

Each data resource includes these additional fields:

There is also an additional field at the top level of dataset_config:

Relationships

Each data resource's foreign_keys are used to establish the relationships between all the resources, including the path back to the main resource. The foreign key's relationship_names are used to specify fields when querying the dataset, by prefixing them to the data resource's field names, eg. buyer_transactions.item.

The following examples should make this clearer. They are taken from the sample_joined dataset, whose complete configuration file is provided at the end.

A "One-to-Many" Example

The example below demonstrates a case where each unit on the main resource (eg. each person) has many entries in the joined resource (eg. one person can have many transactions).

The foreign_keys are always defined on the data resource which can have many entries per unit. In this example, that's the transaction data resource.

This example defines two relationships back to the main person data resource. The first links the person_id column in the transaction table to the _id column in the person table, and makes this link accessible through the API as buyer_transactions. The second links the provider_id column in the transaction table to the _id column in the person table, and makes this link accessible through the API as seller_transactions.

This means that the type field of the transaction data resource is accessible as either buyer_transactions.type or seller_transactions.type, depending on the relationship of interest.

{
  "resources": [
    {
      "name": "transaction",
      "foreign_keys": [
        {
          "fields": [
            "person_id"
          ],
          "forward_relationship_name": "buyer",
          "reference": {
            "fields": [
              "_id"
            ],
            "resource": "person",
            "reverse_relationship_name": "buyer_transactions",
            "reverse_relationship_title": "Buyer Transactions"
          }
        },
        {
          "fields": [
            "provider_id"
          ],
          "forward_relationship_name": "seller",
          "reference": {
            "fields": [
              "_id"
            ],
            "resource": "person",
            "reverse_relationship_name": "seller_transactions",
            "reverse_relationship_title": "Seller Transactions"
          }
        },
        ...
      ],
    ...
    },
    ...
  ],
  ...
}

In the above example, queries use the "reverse" relationship names to refer to the transaction table, because when moving from the main resource ('person') to the transaction resource, you are moving against the direction of the foreign key (which is defined on the transaction table).

The "forward" relationship names, buyer and seller, are not used in this example. See below for an example that uses the forward names.

A "Many-to-One" Foreign Key Example

The example below demonstrates a case where multiple units on the main resource (eg. people) can have a single entry in the joined resource (eg. many people can live in one household). (This example is taken from the sample_joined dataset.)

As noted before, the foreign_keys are always defined on the data resource which can have many entries per unit. In this example, that's the main person data resource.

This example defines a single relationship to the household data resource, which links the household_id column in the person table to the _id column in the household table, and makes this link accessible through the API as household.

This means that the postcode field of the household data resource is accessible as household.postcode.

  "resources": [
    {
      "name": "person",
      "foreign_keys": [
        {
          "fields": [
            "household_id"
          ],
          "forward_relationship_name": "household",
          "reference": {
            "fields": [
              "_id"
            ],
            "resource": "household",
            "reverse_relationship_name": "people"
          }
        }
      ],
    ...
    },
    ...
  ],
  ...
}

In this case, it is the forward_relationship_name (household) that is used to refer to the relationship, because the foreign key is defined on people (the main resource) and points towards household.

Specifying Sensitive Resource Information

To prevent sensitive information from being disclosed when providing access to a multi-table dataset, sensitive data resources can be specified. These sensitive data resources are data resources at a higher level than the main resource and for which there may be a risk of disclosure as a result of queries against the unit of the main resource. For example, household may be considered to be a sensitive data resource for a dataset that has person as its main data resource. If sensitive resources are specified, an extended perturbation algorithm will be employed.

To specify a sensitive data resource for a multi-table dataset you first need to have a dataset configured with the same set of data resources as the dataset on which you are specifying the sensitive data resource but with the sensitive resource as its main resource. You then need to provide 2 pieces of information when configuring the dataset for which you wish to specify a sensitive data resource: a relationship path from the current main data resource to the sensitive data resource; and the name of the dataset with the same data resources as the current dataset and for which the sensitive data resource is the main resource.

The following is an example of how a household data resource may be specified as sensitive. The relationship_path of household indicates that fields on the household data resource can be specified using a prefix of household.. The dataset with the name joined_household would have the same data resources as the current dataset but would have household as its main data resource.

"sensitive_resource_information": [
  {
    "relationship_path": ["household"],
    "dataset_name": "joined_household"
  }
],

A Sample Multi-Table Configuration

The below provides a complete example of how a multiple table dataset might be configured. In this case, there are 4 data resources, corresponding to 4 tables, in the dataset: person, household, transaction and item. The main resource is person which means that all queries should be interpreted in terms of the unit of that data resource, people.

Imagine this as:

The foreign_keys effectively encode the clauses in the above list.

{
  "description": "A sample person-level dataset with joined tables",
  "main_resource_name": "person",
  "sensitive_resource_information": [
    {
      "relationship_path": ["household"],
      "dataset_name": "joined_household"
    }
  ],
  "name": "joined_person",
  "query_class_definitions": {
    "aggregation": {
      "allowed_functions": [
        "count",
        "mean",
        "sum"
      ],
      "run": "sql_tbe_multi_table_pipeline",
      "transform_definitions": {
        "check_field_exclusions": {
          "parameters": {
            "field_exclusion_rules": [
              {
                "limit": 4
              }
            ]
          }
        },
        "get_sql_top_rows": {
          "parameters": {
            "parameters": {
              "number": 4,
              "record_key_name": "{tbe_perturb.record_key_name}",
              "value_tiebreaker_name": "{tbe_perturb.record_key_name}"
            }
          }
        },
        "tbe_perturb": {
          "parameters": {
            "bigN": 1766976779,
            "m": [
              0.5,
              0.3,
              0.15,
              0.05
            ],
            "p_filename": "pmatrix2",
            "record_key_name": "RECORD_KEY",
            "s_filename": "smatrix2",
            "smallC": 7,
            "smallN": 60
          }
        }
      }
    }
  },
  "requires_permission_to_query": false,
  "requires_permission_to_see": false,
  "resources": [
    {
      "description": "Some detail about the transaction level data",
      "fields": [
        {
          "name": "type",
          "title": "Transaction Type",
          "type": "string",
          "values": [
            {
              "name": "A",
              "title": "avocado"
            },
            {
              "name": "B",
              "title": "banana"
            },
            {
              "name": "C",
              "title": "cucumber"
            },
            {
              "name": "D",
              "title": "dragonfruit"
            },
            {
              "name": "E",
              "title": "eggplant"
            },
            {
              "name": "F",
              "title": "fig"
            },
            {
              "name": "G",
              "title": "ginger"
            }
          ]
        },
        {
          "can_be_above_maximum": false,
          "can_be_below_minimum": false,
          "decimal_places": 2,
          "exclusive_maximum": 150,
          "interval_size": 5,
          "minimum": 0,
          "name": "value",
          "title": "Transaction Value",
          "type": "number",
          "values": [
            {
              "is_sentinel": true,
              "name": null,
              "title": "unknown"
            }
          ]
        }
      ],
      "foreign_keys": [
        {
          "fields": [
            "person_id"
          ],
          "forward_relationship_name": "buyer",
          "reference": {
            "fields": [
              "_id"
            ],
            "resource": "person",
            "reverse_relationship_name": "buyer_transactions",
            "reverse_relationship_title": "Buyer Transactions"
          }
        },
        {
          "fields": [
            "provider_id"
          ],
          "forward_relationship_name": "seller",
          "reference": {
            "fields": [
              "_id"
            ],
            "resource": "person",
            "reverse_relationship_name": "seller_transactions",
            "reverse_relationship_title": "Seller Transactions"
          }
        },
        {
          "fields": [
            "item_id"
          ],
          "forward_relationship_name": "item",
          "reference": {
            "fields": [
              "_id"
            ],
            "resource": "item",
            "reverse_relationship_name": "transactions"
          }
        }
      ],
      "name": "transaction",
      "notes": "And some notes about the transaction level data",
      "table_name": "transaction",
      "title": "Transaction",
      "unit": {
        "plural": "transactions",
        "singular": "transaction"
      },
      "unit_id_name": "_id"
    },
    {
      "description": "Some detail about the person level data",
      "fields": [
        {
          "name": "SEX",
          "title": "Sex",
          "type": "string",
          "values": [
            {
              "name": "F",
              "title": "Female"
            },
            {
              "name": "M",
              "title": "Male"
            },
            {
              "is_sentinel": true,
              "name": "",
              "title": "Not stated"
            }
          ]
        },
        {
          "name": "COUNTRY",
          "title": "Country of Birth",
          "type": "integer",
          "values": [
            {
              "name": 1000,
              "title": "Oceania",
              "values": [
                {
                  "name": 1100,
                  "title": "Australia (including External Territories)",
                  "values": [
                    {
                      "name": 1101,
                      "title": "Australia"
                    },
                    {
                      "name": 1102,
                      "title": "Norfolk Island"
                    },
                    {
                      "name": 1199,
                      "title": "Australian External Territories"
                    }
                  ]
                },
                {
                  "name": 1201,
                  "title": "New Zealand"
                },
                {
                  "name": 1300,
                  "title": "Melanesia",
                  "values": [
                    {
                      "name": 1301,
                      "title": "New Caledonia"
                    },
                    {
                      "name": 1302,
                      "title": "Papua New Guinea"
                    },
                    {
                      "name": 1303,
                      "title": "Solomon Islands"
                    },
                    {
                      "name": 1304,
                      "title": "Vanuatu"
                    }
                  ]
                }
              ]
            },
            {
              "name": 9999,
              "title": "Other"
            },
            {
              "is_sentinel": true,
              "name": null,
              "title": "Not stated"
            }
          ]
        },
        {
          "can_be_above_maximum": true,
          "can_be_below_minimum": false,
          "exclusive_maximum": 100,
          "interval_size": 10,
          "minimum": 0,
          "name": "INCOME",
          "precision": 0.001,
          "title": "Income ($'000 per year)",
          "type": "number"
        }
      ],
      "foreign_keys": [
        {
          "fields": [
            "household_id"
          ],
          "forward_relationship_name": "household",
          "reference": {
            "fields": [
              "_id"
            ],
            "resource": "household",
            "reverse_relationship_name": "people"
          }
        }
      ],
      "name": "person",
      "table_name": "joined_person",
      "title": "Person",
      "unit": {
        "plural": "people",
        "singular": "person"
      },
      "unit_id_name": "_id"
    },
    {
      "description": "Some detail about the item level data",
      "fields": [
        {
          "description": "Name of the Item",
          "name": "item_name",
          "type": "string",
          "values": [
            {
              "name": "A",
              "title": "Accordion"
            },
            {
              "name": "B",
              "title": "Bugle"
            },
            {
              "name": "C",
              "title": "Clarinet"
            },
            {
              "name": "D",
              "title": "Didgeridoo"
            },
            {
              "is_sentinel": true,
              "name": "NF",
              "title": "Not Found"
            },
            {
              "is_sentinel": true,
              "name": "NP",
              "title": "Not Provided"
            }
          ]
        }
      ],
      "name": "item",
      "table_name": "item",
      "title": "Item",
      "unit": {
        "plural": "items",
        "singular": "item"
      },
      "unit_id_name": "_id"
    },
    {
      "description": "Detail about [household](http://www.example.com/household) data",
      "fields": [
        {
          "fields": [
            {
              "description": "Type of dwelling of usual residence",
              "name": "DWELL",
              "title": "Dwelling Type",
              "type": "string",
              "values": [
                {
                  "name": "001",
                  "title": "House"
                },
                {
                  "name": "002",
                  "title": "Apartment"
                },
                {
                  "name": "003",
                  "title": "Other"
                }
              ]
            },
            {
              "can_be_below_minimum": false,
              "exclusive_maximum": 8,
              "minimum": 0,
              "name": "ROOMS",
              "title": "Number of Bedrooms",
              "type": "integer",
              "values": [
                {
                  "name": 0,
                  "title": "None (including bedsits)"
                },
                {
                  "is_sentinel": true,
                  "name": null,
                  "title": "Not stated"
                },
                {
                  "is_sentinel": true,
                  "name": 999,
                  "title": "Not applicable"
                }
              ]
            }
          ],
          "title": "Dwelling"
        },
        {
          "description": "Postcode of usual residence",
          "name": "POSTCODE",
          "region_type": "poa",
          "title": "Postcode",
          "type": "string",
          "values": [
            {
              "name": "6722"
            },
            {
              "name": "6725"
            },
            {
              "name": "0872"
            },
            {
              "name": "6762"
            },
            {
              "name": "6642"
            },
            {
              "name": "6710"
            },
            {
              "name": "6429"
            }
          ]
        },
        {
          "can_be_above_maximum": true,
          "can_be_below_minimum": false,
          "decimal_places": -1,
          "exclusive_maximum": 8000,
          "interval_size": 1000,
          "minimum": 0,
          "name": "RENT",
          "title": "Weekly rent in dollars",
          "type": "number"
        }
      ],
      "name": "household",
      "table_name": "joined_household",
      "title": "Household",
      "unit": {
        "plural": "households",
        "singular": "household"
      },
      "unit_id_name": "_id"
    }
  ],
  "title": "Sample person dataset with joined data"
}