Configuring New Datasets
Data and Metadata Configuration
After the API reads in the global settings file, it reads in the dataset configuration files. These configuration files specify how to connect to the available datasets, their metadata, and some internal parameters, eg. for perturbing and weighting the results.
To add a new table to the API, you need to put its data into a database (if you are using the SQL data interface), and add its dataset configuration file.
Dataset Configuration
The dataset configuration files contain all the information needed by the API to work with a particular table's data.
You need one json
file for each table you wish to make available to users of the API.
The complete specification is given by the schema protari/dataset_config/dataset_config_schema.yaml
in the code.
The dataset configuration files are validated against this schema whenever the API is started.
Each file is structured as follows:
name
: The name used to identify the dataset in the API. The maximum length accessible through the API is 50 characters.title
: An optional human readable title for the dataset.description
: A description of the dataset, if desired.notes
: More detailed notes about the dataset, if desired.resources
: An array of data resources specifications (each data resource represents a table in the database) containing metadata about the data resources that make up the dataset, i.e.:name
(required) andtitle
(optional) of the resource.- Field and value metadata.
- The data resource's
unit
(eg. person, household or business, with bothsingular
andplural
specified). description
andnotes
about the resource specifically.weight_name
if the data resource has a weight field (the name of the field containing the weights).unit_id_name
the name of the field containing the unit record id for the data resource.
main_resource_name
: The name of the main data resource in the dataset. This is required even if there is only one resource.query_class_definitions
: Configuration for this dataset that applies to each available query class (eg. "aggregation").- Which functions are allowed on this dataset (eg. count, mean or sum).
- Which transform to execute for this query class (eg.
sql_tbe_pipeline
). - Dataset-specific transform parameters, eg.
- A list of rules restricting the combinations of fields that can be accessed together, either as group-by or where clauses.
- Perturbation parameters.
- Sparsity thresholds, if required.
- Authorization requirements:
requires_permission_to_see
: Set this to true if you do not want unauthorized users to know that the dataset exists. In that case, it will not appear under/datasets/
or/datasets/{name}
. Default false. (If true,requires_permission_to_query
must also be true.)requires_permission_to_query
: Set this to false if you want to allow unauthorized users to access any query (eg./dataset/{name}/aggregation
) on the dataset. Default true, ie. only authorized users can query the dataset.
While you are getting started, you will probably want to set "requires_permission_to_query": false
so that you
can query your dataset without authentication.
Here is an example from the protari-sample
repository.
Multi-Table Datasets
For details of how to configure a multi-table dataset, see the documentation here.
Resources
As noted above, each data resource
in the resources
array represents the metadata for a table in the dataset.
Eg.
"resources": [{
"name": "The name of the data resource.",
"table_name": "The name of the database table for which this data resource specifies metadata.",
"description": "A description of this specific table.",
"unit_id_name": "The name of the field containing the unit record id for this data resource.",
"weight_name": "The name of the field containing the weights for this data resource (if any)."
"unit": {
"singular": "person",
"plural": "people"
}
"fields": [
{
"name": "SEX"
},
{
"name": "COUNTRY"
},
{
"name": "POSTCODE"
}
]
}]
Of course, such minimal field metadata will not give your users much guidance. Use the following sections to provide more information.
For examples of these features, see dataset_config
files for the sample datasets in the protari-sample
repository.
Unit
Specifying the unit
of the unit record data allows downstream applications to present queries and results in a
user-friendly way, and helps make sure they do not add results on different units (eg. people, households or
businesses).
Field Metadata
For each accessible field, supply the following metadata:
name
: The field name (as used in the API and in the underlying data). The maximum length is 50 characters (25 if using Oracle).title
: An optional human-readable title (defaulting to the name if not present).description
: An optional description.type
: The type of the data as it is stored. This is required unless the field hasfields
. Must be one of:- string
- integer
- number (for floating point numbers)
- date (for dates; date-times and times are not currently supported)
is_longitudinal
: An optional boolean flag for whether it is a longitudinal field (default false).region_type
: If appropriate, what sort of region type the field represents, if any (default none), eg. csv-geo-au.values
: For categorical fields, metadata describing all of the possible values (see below). Note this includes sentinel values, which are ignored in means and sum calculations (eg.9999
may mean "missing data").fields
: An optional hierarchy of fields beneath this one, in which case this field is not actually queryable and should not be present in the data - see below.
Range data fields (ie. date fields and fields with numeric data) require careful configuration - see the separate section on range data fields.
Fields that are not listed will not be accessible through the API.
Hierarchical fields
The fields can be structured as a hierarchy, which can help the user interface of downstream applications when there are large numbers of fields. The hierarchy of fields is completely independent of the representation in the database, i.e. "parent" fields in the hierarchy are never requested from the database. Accordingly, a field hierarchy has no effect on the API's aggregation results, because users cannot perform aggregations on the "parent" fields.
Longitudinal fields
If your data has one row for each "unit" (eg. person or household), then you do not have any longitudinal fields.
However, if you have one row for each unit for each time period, then the time period is your longitudinal field.
In that case, you must tell the API by setting the field's is_longitudinal
property to true
.
Typically there is at most one longitudinal field (eg. "year"), but the API can handle multiple (eg. with simulated data, the simulation number could be a second longitudinal field).
If there is a longitudinal field, you must also specify which field contains the unit's id in unit_id_name
in the metadata.
Value Metadata
For each value, the metadata is:
name
: The "name" of the value, ie. how it is represented in the data and in the API. This can be a string, or in the case of numeric fields, a number (but a string representation of the number is fine too). The maximum string length is 50 characters.title
: A human-readable title (defaults to the name if not present).is_sentinel
: A boolean flag indicating if it this is a sentinel value (default false) - see below.
Sentinel values
Sometimes "sentinel" values (eg. null
, 9999 or -1) are used to represent special cases,
such as missing data, unlinked data, inapplicable questions, or outliers.
(If present, null
must be a sentinel, and in fact defaults to is_sentinel: true
.)
For numeric fields, when calculating functions like means and sums, it is important that
sentinel values are not included. Configure this by explicitly listing the sentinel values of
each field, and setting is_sentinel
to true for them.
Eg.
{
"name": 9999,
"is_sentinel": true,
"title": "Not stated"
}
You should also specify sentinel values for string fields,
so users can count records with non-sentinel values.
Values with the name null
cannot be used on string fields - use an empty string instead.
Empty strings are not automatically assumed to be sentinel values.
Hierarchical values
The values can also be hierarchical, again to help the user interface of downstream applications when the user may need to select from a large number of values. Unlike hierarchical fields, hierarchical values do have an effect on the API's aggregation capability. For example, for a "country of birth" field, you might place regions at the top level, followed by countries. Users can then query how many people are born in each region, not just in each country.
The API does this by replacing the region name in the query with the names of all its child values. It is therefore important that the region names do not appear in the data, because they will not be accessible directly. This is checked by the SQL validation script when the API is started.
Hierarchical values are not allowed on fields containing range data. Eg. an integer "age" field could not define any values (eg. "40", "900", or "elderly") to contain other values.
Alternative Metadata Formats
A script is available in the protari-setup
repository to generate the field and value metadata from a pair of csv files.
Some sample csv files provided in that repository are:
name,parent_titles,title,description,type,minimum,exclusive_maximum,precision,interval_size,can_be_below_minimum,region_type
SEX,,Sex,,string,,,,,,
COUNTRY,,Country of Birth,,integer,,,,,,
DWELL,Dwelling,Dwelling Type,Type of dwelling of usual residence,string,,,,,,
ROOMS,Dwelling,Number of Bedrooms,,integer,0,8,,,false,
POSTCODE,,Postcode,Postcode of usual residence,string,,,,,,poa
AMOUNT,,Dollar Amount,,number,0,100,0.01,10,,
EXPIRY,,Expiry date,,date,1970-01-01,2021-01-01,P1Y,P5Y,false,
and
field_name,is_sentinel,name,parent_name,title
SEX,,F,,Female
SEX,,M,,Male
SEX,TRUE,,,Not stated
COUNTRY,,1000,,Oceania
COUNTRY,,1100,1000,Australia (including External Territories)
COUNTRY,,1101,1100,Australia
COUNTRY,,1102,1100,Norfolk Island
COUNTRY,,1199,1100,Australian External Territories
COUNTRY,,1201,1000,New Zealand
COUNTRY,,1300,1000,Melanesia
COUNTRY,,1301,1300,New Caledonia
COUNTRY,,1302,1300,Papua New Guinea
COUNTRY,,1303,1300,Solomon Islands
COUNTRY,,1304,1300,Vanuatu
COUNTRY,,9999,,Other
COUNTRY,TRUE,,,Not stated
DWELL,,001,,House
DWELL,,002,,Apartment
DWELL,,003,,Other
ROOMS,,0,,None (including bedsits)
ROOMS,TRUE,,,Not stated
ROOMS,TRUE,999,,Not applicable
POSTCODE,,6722,,
POSTCODE,,6725,,
POSTCODE,,0872,,
POSTCODE,,6762,,
POSTCODE,,6642,,
POSTCODE,,6710,,
POSTCODE,,6429,,
EXPIRY,TRUE,,,Unknown
These files will generate a dataset configuration file containing the following hierarchy:
- SEX field, a string field which can take "F", "M" or "" (a sentinel value for "not stated").
- COUNTRY field, with a hierarchy of integer categorical values. When a parent value (1000, 1100, 1300) is used in the API, it is replaced with all its child values. The data should not contain the parent values.
- Dwelling directory containing:
- DWELL field, which is a string field that can take the values 001, 002 or 003.
- ROOMS field, which is an integer between 0 and 20.
- POSTCODE field, a string corresponding to a geographical region, with type "poa". It is recommended where possible that the region_type conform to csv-geo-au where possible for interoperability with NationalMap and other TerriaJS-based sites.
- AMOUNT field, a floating point field between 0 and 100 with 2 decimal places.
- EXPIRY field, a date field which can be queried in 5-year ranges.
The protari-setup
repository also contains csv files and full dataset configuration files for a sample dataset that
includes a longitudinal (ie. date or time) field.
QueryClass Definitions
query_class_definitions
is a mapping whose keys are query class names (eg. "aggregation"),
which describe how to configure each query class. The key properties are transform_definitions
,
which sets the dataset-specific parameters for each transform, and run
,
which specifies which transform to execute for this query class.
Here is an example which uses the standard sql_tbe pipeline:
"query_class_definitions": {
"aggregation": {
"allowed_functions": [
"count",
"mean",
"sum"
],
"run": "sql_tbe_pipeline",
"transform_definitions": {
"check_field_exclusions": {
"parameters": {
"field_exclusion_rules": [
{
"limit": 4
}
]
}
},
"check_query_size": {
"parameters": {
"maximum_output_values_length": 500000
}
},
"tbe_perturb": {
"parameters": {
"record_key_name": "RECORD_KEY",
"bigN": 1766976779,
"m": [
0.5,
0.3,
0.15,
0.05
],
"p_filename": "pmatrix1",
"s_filename": "smatrix1",
"smallC": 5,
"smallN": 7
}
},
"get_sql_top_rows": {
"parameters": {
"parameters": {
"number": 4,
"record_key_name": "{tbe_perturb.record_key_name}",
"value_tiebreaker_name": "{tbe_perturb.record_key_name}"
}
}
},
"check_sparsity": {
"parameters": {
"a": 0.5,
"b": 0.75
}
}
}
}
}
Allowed functions
An optional list of the allowed functions for this query class. By default, only count
is allowed.
Run
The name of the transform to apply for this query class. Must be defined in the settings file. Typically, this is a transform containing a "pipeline" of other transformations.
Transform Definitions
check_field_exclusions
Optional field exclusion rules let you restrict the combinations of fields that can be accessed together, whether as "group by", "where" or "function" (eg. mean or sum) clauses.
Eg. The following configuration would prevent a query which groups by POSTCODE, SEX and ROOMS (due to the second limit, to only 2 of these fields). It would also prevent a query for the mean AMOUNT grouped by DWELL, SEX, COUNTRY (due to the first limit, which only allows 3 fields total in a query).
"field_exclusion_rules": [
{
"limit": 3
},
{
"limit": 2,
"fields": [
{
"name": "AMOUNT"
},
{
"name": "SEX"
},
{
"name": "ROOMS"
},
{
"name": "POSTCODE"
}
]
}
Longitudinal fields are always excluded from the field exclusion rules, ie. they do not count towards the limit of 3 in the above example.
Note that unlike all other transforms at the time of writing, the field exclusion rules are displayed to the user, so that downstream apps can use the field exclusion rules to adjust the available options displayed to the user.
check_query_size
Using this transform you can specify the maximum number of potential output values allowed per query. This counts every potential cell in the output table, including zeros and suppressed values.
The example below will prevent any query which could return over 500,000 rows being executed (eg. three group-by fields which each have 100 potential values could return 1 million values, so would be prevented).
"check_query_size": {
"parameters": {
"maximum_output_values_length": 500000
}
},
By default, no size limit is applied.
get_aggregated_sql_data
This transform gets the aggregated data from a SQL database.
The SQL database connection string is usually specified in the global settings file, but it can be overridden here too.
You can also specify which validation checks should be performed for this dataset
when the API is started up, eg. for very large datasets you may wish to switch off some validation checks.
Pass the additional parameter validations
, eg:
"parameters": {
"validations": ["types", "config-values", "db-values", "db-group-values"]
}
The list above is the full list, which is the default.
If the empty array []
is passed, no validations are performed.
tbe_perturb
See the Perturbation Algorithms section for an explanation of the available perturbation algorithms, and how to configure them.
get_sql_top_rows
The TBE perturbation algorithm needs to know the top contributing rows in order to perturb means and sums.
This is achieved by the get_sql_top_rows
transform. It takes three parameters:
- The number of top rows required
(this must match the length of the TBE
m
vector - in the example, 4). - The name of the record key field.
- The name of the field to use as a tiebreaker.
The standard SQL TBE pipeline sets the record key field to match the TBE transform, and also defaults the tiebreaker field to be the record key field. Either or both of these can be overridden if desired.
check_sparsity
The sparsity restriction algorithm is also described in the TBE paper. It consists simply of two thresholds on the ratio of low-count values, above which no results should be returned. If not provided, no threshold is applied.
Eg.
"sparsity_thresholds": {
"a": 0.5,
"b": 0.75
}
Data Formats
The data must be stored in a single SQL database table. A simple csv-formatted example that would work with the above metadata files, assuming no perturbation parameters are required, is:
SEX,COUNTRY,DWELL,ROOMS
F,1201,001,4
,1302,003,1
F,1301,002,1
M,1101,002,8
(This is a truncated version of the sample data provided in the protari-setup
repository.)
Optionally, you may need the following additional fields:
- The "TBE" perturbation algorithm requires a unique, integer, record key column.
- If a longitudinal field is present (eg. "year"), a unit id column is required.
- A weight column.
You might consider creating indices on all fields in the table, to improve the speed of where
and group by
queries.
If needed, some data cleaning and processing scripts are provided in protari-setup
. See its README for details.
Validation checks
When the API is started, the settings and dataset configuration files are validated against their schemas, and any errors reported, as are any conflicting dataset names.
(This validation step can be turned off in the global settings.)
Further, the configuration files are validated against the data in the SQL database – in particular:
"types":
- If using the TBE perturbation algorithm, the record_key field is an integer.
- The weight field, if present, is either a float or an integer.
- The dataset config's types match those in the database.
"config-values":
- String fields have some values specified in the config.
- String fields only have string values in the config.
- String fields have no NULL values in the config.
- String fields do not have minimum or maximum defined in the config.
"db-values":
- String fields have no NULL values in the database.
- Range data field values are all inside their minimum/maximum range,
if one or both of
can_be_below_minimum
andcan_be_above_maximum
is false.
"db-group-values":
- No group-level field values occur in the database. (This is can be slow on large datasets.)
"db-relationships":
- The columns for fields in each data resource's foreign keys exist.
Which validations are performed can be controlled by the get_aggregated_sql_data parameters for each dataset, with the valid names being the headings in the list above.
Suggested directory structure
I recommend creating a new directory following the structure of the protari-sample
repository:
settings.yaml
inputs/
dataset_name_1/
base_config.json # the non-field config for dataset 1
field_metadata.csv
value_metadata.csv
data.csv
dataset_name_2/
dataset_name_3/
processed_data/
dataset_name_1/
dataset_name_2/
dataset_name_3/
perturbation_data/
pmatrix_1.csv
pmatrix_2.csv
smatrix_1.csv
smatrix_2.csv
dataset_config/
dataset_name_1.json
dataset_name_2.json
dataset_name_3.json
scripts/
your_custom_processing_scripts.py
your_custom_build.sh
Adjust this as needed if your data or metadata starts in a different format.
You can copy across the scripts
and settings.yaml
from protari-sample
and modify them as needed for your situation.
If used unchanged, build.sh
will:
- Read dataset metadata and data from the
inputs
directory - Output processed data files in
processed_data
- Output the dataset configuration files in
dataset_config
- Create a Postgres database called
protari_demo
containing each of the datasets, each populated from thedata.csv
files - Add an
auth_keys
table to the same database, with the dummy keysnatmapkey
,superkey
andabc123
.