Configuring New Datasets

Data and Metadata Configuration

After the API reads in the global settings file, it reads in the dataset configuration files. These configuration files specify how to connect to the available datasets, their metadata, and some internal parameters, eg. for perturbing and weighting the results.

To add a new table to the API, you need to put its data into a database (if you are using the SQL data interface), and add its dataset configuration file.

Dataset Configuration

The dataset configuration files contain all the information needed by the API to work with a particular table's data. You need one json file for each table you wish to make available to users of the API.

The complete specification is given by the schema protari/dataset_config/dataset_config_schema.yaml in the code. The dataset configuration files are validated against this schema whenever the API is started.

Each file is structured as follows:

While you are getting started, you will probably want to set "requires_permission_to_query": false so that you can query your dataset without authentication.

Here is an example from the protari-sample repository.

Multi-Table Datasets

For details of how to configure a multi-table dataset, see the documentation here.

Resources

As noted above, each data resource in the resources array represents the metadata for a table in the dataset.

Eg.

  "resources": [{
    "name": "The name of the data resource.",
    "table_name": "The name of the database table for which this data resource specifies metadata.",
    "unit_id_name": "The name of the field containing the unit record id  for this data resource.",
    "weight_name": "The name of the field containing the weights for this data resource (if any)."
    "unit": {
      "singular": "person",
      "plural": "people"
    }
    "fields": [
      {
        "name": "SEX"
      },
      {
        "name": "COUNTRY"
      },
      {
        "name": "POSTCODE"
      }
    ]
  }]

Of course, such minimal field metadata will not give your users much guidance. Use the following sections to provide more information.

For examples of these features, see dataset_config files for the sample datasets in the protari-sample repository.

Unit

Specifying the unit of the unit record data allows downstream applications to present queries and results in a user-friendly way, and helps make sure they do not add results on different units (eg. people, households or businesses).

Field Metadata

For each accessible field, supply the following metadata:

Range data fields (ie. date fields and fields with numeric data) require careful configuration - see the separate section on range data fields.

Fields that are not listed will not be accessible through the API.

Hierarchical fields

The fields can be structured as a hierarchy, which can help the user interface of downstream applications when there are large numbers of fields. The hierarchy of fields is completely independent of the representation in the database, i.e. "parent" fields in the hierarchy are never requested from the database. Accordingly, a field hierarchy has no effect on the API's aggregation results, because users cannot perform aggregations on the "parent" fields.

Longitudinal fields

If your data has one row for each "unit" (eg. person or household), then you do not have any longitudinal fields. However, if you have one row for each unit for each time period, then the time period is your longitudinal field. In that case, you must tell the API by setting the field's is_longitudinal property to true.

Typically there is at most one longitudinal field (eg. "year"), but the API can handle multiple (eg. with simulated data, the simulation number could be a second longitudinal field).

If there is a longitudinal field, you must also specify which field contains the unit's id in unit_id_name in the metadata.

Value Metadata

For each value, the metadata is:

Sentinel values

Sometimes "sentinel" values (eg. null, 9999 or -1) are used to represent special cases, such as missing data, unlinked data, inapplicable questions, or outliers. (If present, null must be a sentinel, and in fact defaults to is_sentinel: true.)

For numeric fields, when calculating functions like means and sums, it is important that sentinel values are not included. Configure this by explicitly listing the sentinel values of each field, and setting is_sentinel to true for them.

Eg.

      {
        "name": 9999,
        "is_sentinel": true,
        "title": "Not stated"
      }

You should also specify sentinel values for string fields, so users can count records with non-sentinel values. Values with the name null cannot be used on string fields - use an empty string instead. Empty strings are not automatically assumed to be sentinel values.

Hierarchical values

The values can also be hierarchical, again to help the user interface of downstream applications when the user may need to select from a large number of values. Unlike hierarchical fields, hierarchical values do have an effect on the API's aggregation capability. For example, for a "country of birth" field, you might place regions at the top level, followed by countries. Users can then query how many people are born in each region, not just in each country.

The API does this by replacing the region name in the query with the names of all its child values. It is therefore important that the region names do not appear in the data, because they will not be accessible directly. This is checked by the SQL validation script when the API is started.

Hierarchical values are not allowed on fields containing range data. Eg. an integer "age" field could not define any values (eg. "40", "900", or "elderly") to contain other values.

Alternative Metadata Formats

A script is available in the protari-setup repository to generate the field and value metadata from a pair of csv files. Some sample csv files provided in that repository are:

name,parent_titles,title,description,type,minimum,exclusive_maximum,precision,interval_size,can_be_below_minimum,region_type
SEX,,Sex,,string,,,,,,
COUNTRY,,Country of Birth,,integer,,,,,,
DWELL,Dwelling,Dwelling Type,Type of dwelling of usual residence,string,,,,,,
ROOMS,Dwelling,Number of Bedrooms,,integer,0,8,,,false,
POSTCODE,,Postcode,Postcode of usual residence,string,,,,,,poa
AMOUNT,,Dollar Amount,,number,0,100,0.01,10,,
EXPIRY,,Expiry date,,date,1970-01-01,2021-01-01,P1Y,P5Y,false,

and

field_name,is_sentinel,name,parent_name,title
SEX,,F,,Female
SEX,,M,,Male
SEX,TRUE,,,Not stated
COUNTRY,,1000,,Oceania
COUNTRY,,1100,1000,Australia (including External Territories)
COUNTRY,,1101,1100,Australia
COUNTRY,,1102,1100,Norfolk Island
COUNTRY,,1199,1100,Australian External Territories
COUNTRY,,1201,1000,New Zealand
COUNTRY,,1300,1000,Melanesia
COUNTRY,,1301,1300,New Caledonia
COUNTRY,,1302,1300,Papua New Guinea
COUNTRY,,1303,1300,Solomon Islands
COUNTRY,,1304,1300,Vanuatu
COUNTRY,,9999,,Other
COUNTRY,TRUE,,,Not stated
DWELL,,001,,House
DWELL,,002,,Apartment
DWELL,,003,,Other
ROOMS,,0,,None (including bedsits)
ROOMS,TRUE,,,Not stated
ROOMS,TRUE,999,,Not applicable
POSTCODE,,6722,,
POSTCODE,,6725,,
POSTCODE,,0872,,
POSTCODE,,6762,,
POSTCODE,,6642,,
POSTCODE,,6710,,
POSTCODE,,6429,,
EXPIRY,TRUE,,,Unknown

These files will generate a dataset configuration file containing the following hierarchy:

The protari-setup repository also contains csv files and full dataset configuration files for a sample dataset that includes a longitudinal (ie. date or time) field.

QueryClass Definitions

query_class_definitions is a mapping whose keys are query class names (eg. "aggregation"), which describe how to configure each query class. The key properties are transform_definitions, which sets the dataset-specific parameters for each transform, and run, which specifies which transform to execute for this query class.

Here is an example which uses the standard sql_tbe pipeline:

  "query_class_definitions": {
    "aggregation": {
      "allowed_functions": [
        "count",
        "mean",
        "sum"
      ],
      "run": "sql_tbe_pipeline",
      "transform_definitions": {
        "check_field_exclusions": {
          "parameters": {
            "field_exclusion_rules": [
              {
                "limit": 4
              }
            ]
          }
        },
        "check_query_size": {
          "parameters": {
            "maximum_output_values_length": 500000
          }
        },
        "tbe_perturb": {
          "parameters": {
            "record_key_name": "RECORD_KEY",
            "bigN": 1766976779,
            "m": [
              0.5,
              0.3,
              0.15,
              0.05
            ],
            "p_filename": "pmatrix1",
            "s_filename": "smatrix1",
            "smallC": 5,
            "smallN": 7
          }
        },
        "get_sql_top_rows": {
          "parameters": {
            "parameters": {
              "number": 4,
              "record_key_name": "{tbe_perturb.record_key_name}",
              "value_tiebreaker_name": "{tbe_perturb.record_key_name}"
            }
          }
        },
        "check_sparsity": {
          "parameters": {
            "a": 0.5,
            "b": 0.75
          }
        }
      }
    }
  }

Allowed functions

An optional list of the allowed functions for this query class. By default, only count is allowed.

Run

The name of the transform to apply for this query class. Must be defined in the settings file. Typically, this is a transform containing a "pipeline" of other transformations.

Transform Definitions

check_field_exclusions

Optional field exclusion rules let you restrict the combinations of fields that can be accessed together, whether as "group by", "where" or "function" (eg. mean or sum) clauses.

Eg. The following configuration would prevent a query which groups by POSTCODE, SEX and ROOMS (due to the second limit, to only 2 of these fields). It would also prevent a query for the mean AMOUNT grouped by DWELL, SEX, COUNTRY (due to the first limit, which only allows 3 fields total in a query).

    "field_exclusion_rules": [
      {
        "limit": 3
      },
      {
        "limit": 2,
        "fields": [
          {
            "name": "AMOUNT"
          },
          {
            "name": "SEX"
          },
          {
            "name": "ROOMS"
          },
          {
            "name": "POSTCODE"
          }
        ]
      }

Longitudinal fields are always excluded from the field exclusion rules, ie. they do not count towards the limit of 3 in the above example.

Note that unlike all other transforms at the time of writing, the field exclusion rules are displayed to the user, so that downstream apps can use the field exclusion rules to adjust the available options displayed to the user.

check_query_size

Using this transform you can specify the maximum number of potential output values allowed per query. This counts every potential cell in the output table, including zeros and suppressed values.

Because it calculates the number of potential output values by multiplying the number of possible values of all fields requested, it will raise an error on startup if a limit is set when any fields have allow_any_value true.

The example below will prevent any query which could return over 500,000 rows being executed (eg. three group-by fields which each have 100 potential values could return 1 million values, so would be prevented).

        "check_query_size": {
          "parameters": {
            "maximum_output_values_length": 500000
          }
        },

By default, no size limit is applied.

get_aggregated_sql_data

This transform gets the aggregated data from a SQL database.

The SQL database connection string is usually specified in the global settings file, but it can be overridden here too.

You can also specify which validation checks should be performed for this dataset when the API is started up, eg. for very large datasets you may wish to switch off some validation checks. Pass the additional parameter validations, eg:

        "parameters": {
            "validations": ["types", "config-values", "db-values", "db-group-values"]
        }

The list above is the full list, which is the default. If the empty array [] is passed, no validations are performed.

tbe_perturb

See the Perturbation Algorithms section for an explanation of the available perturbation algorithms, and how to configure them.

get_sql_top_rows

The TBE perturbation algorithm needs to know the top contributing rows in order to perturb means and sums. This is achieved by the get_sql_top_rows transform. It takes three parameters:

The standard SQL TBE pipeline sets the record key field to match the TBE transform, and also defaults the tiebreaker field to be the record key field. Either or both of these can be overridden if desired.

check_sparsity

The sparsity restriction algorithm is also described in the TBE paper. It consists simply of two thresholds on the ratio of low-count values, above which no results should be returned. If not provided, no threshold is applied.

Eg.

    "sparsity_thresholds": {
      "a": 0.5,
      "b": 0.75
    }

Data Formats

The data must be stored in a single SQL database table. A simple csv-formatted example that would work with the above metadata files, assuming no perturbation parameters are required, is:

SEX,COUNTRY,DWELL,ROOMS
F,1201,001,4
,1302,003,1
F,1301,002,1
M,1101,002,8

(This is a truncated version of the sample data provided in the protari-setup repository.)

Optionally, you may need the following additional fields:

You might consider creating indices on all fields in the table, to improve the speed of where and group by queries.

If needed, some data cleaning and processing scripts are provided in protari-setup. See its README for details.

Validation checks

When the API is started, the settings and dataset configuration files are validated against their schemas, and any errors reported, as are any conflicting dataset names.

(This validation step can be turned off in the global settings.)

Further, the configuration files are validated against the data in the SQL database – in particular:

"types":

"config-values":

"db-values":

"db-group-values":

"db-relationships":

Which validations are performed can be controlled by the get_aggregated_sql_data parameters for each dataset, with the valid names being the headings in the list above.

Suggested directory structure

I recommend creating a new directory following the structure of the protari-sample repository:

settings.yaml
inputs/
    dataset_name_1/
        base_config.json    # the non-field config for dataset 1
        field_metadata.csv
        value_metadata.csv
        data.csv
    dataset_name_2/
    dataset_name_3/
processed_data/
    dataset_name_1/
    dataset_name_2/
    dataset_name_3/
perturbation_data/
    pmatrix_1.csv
    pmatrix_2.csv
    smatrix_1.csv
    smatrix_2.csv
dataset_config/
    dataset_name_1.json
    dataset_name_2.json
    dataset_name_3.json
scripts/
    your_custom_processing_scripts.py
    your_custom_build.sh

Adjust this as needed if your data or metadata starts in a different format.

You can copy across the scripts and settings.yaml from protari-sample and modify them as needed for your situation. If used unchanged, build.sh will: