Global Settings
When the API is first started, it first reads in a global settings file from ./settings.yaml
.
You can override this location by setting the system environment variable PROTARI_SETTINGS_PATH
.
The settings file tells the API where to find the data, the dataset configuration files, parameters used for perturbations, and how to authorize users. The file may be in either yaml or json format.
An example is:
dataset_config_path: ./dataset_config
logging_config_path: ./logging_config.yaml
validate_on_startup: true
log_attributes:
- remote_addr
log_headers:
- X-Forwarded-For
base_url: https://protari.example.com/api/v1
organization:
name: Sample
title: Sample organization
description: A description to include in every json-formatted API output.
terms: Optional terms and conditions to include in every json-formatted API output.
query_class_definitions:
aggregation:
parameters:
max_functions: 1
allowed_functions:
- sum
- mean
- count
transform_definitions:
get_aggregated_sql_data:
parameters:
url: DBTYPE://USERNAME:PASSWORD@HOSTNAME/DB_NAME
tbe_perturb:
parameters:
path: ./perturbation_data
auth_interface:
reference: protari.auth_interface.db_auth.DatabaseAuthInterface
parameters:
url: DBTYPE://USERNAME:PASSWORD@HOSTNAME/DB_NAME
qualifier_mapping:
limit: limit
permissions_post_processor:
reference: protari_api.sql_query_limiter.SqlQueryLimiter
parameters:
url: DBTYPE://USERNAME:PASSWORD@HOSTNAME/DB_NAME
table_name: query_limit
rate_limits:
limits:
aggregation:
default:
- 10/minute
includes:
- sql_tbe_pipeline
- sql_laplace_pipeline
The schema for the settings file is provided in the source code.
Environment variables
Settings values can reference environment variables as ${ENV_VAR}
. A typical use-case is
to include passwords without the need to put them directly in the file, eg:
transform_definitions:
get_aggregated_sql_data:
parameters:
url: DBTYPE://USERNAME:${DATABASE_PWD}@HOSTNAME/DB_NAME
together with the environment variable DATABASE_PWD
.
You could alternatively put the entire database URL into an environment variable, eg:
url: ${DATABASE_URL}
dataset_config_path
The path to your dataset configuration files. All json files found under this path will be read by the API and interpreted as dataset configuration files. This path is relative to the location of the settings file.
logging_config_path
The path to your logging configuration file. This path is relative to the location of the settings file.
validate_on_startup
If true (the default), all datasets will be put through a number of validation checks when the API starts up.
validate_responses
If true (the default is false), all output is validated explicitly against the API's swagger spec.
log_attributes
An optional list of attributes of Flask's request object
that you wish to include in the API's logs,
eg. to log the user's IP address (when it is not obscured by a proxy layer), include remote_addr
.
log_headers
An optional list of headers you wish to include in the API's logs, eg. User-Agent
or X-Forwarded-For
.
The X-Forwarded-For
header can be useful when a user's IP address is not available via the attribute remote_addr
.
base_url
The API's base url, eg. https://protari.example.com/api/v1
. Must include the swagger basePath.
If provided, this is returned by the API in the aggregation response. It is not used internally by Protari.
So your users don't get confused, this base url should match that served by the API,
which is set independently in protari-api
's application.py
, eg:
app.add_api(swagger_spec, base_path='/api/v1', ...
maximum_authorization_header_length
If provided, the Protari API raises an authentication error if the length of the Authorization header (measured in characters) exceeds the specified value.
organization
This is information for the user, and is returned in the relevant API outputs.
description
This is information for the user, and is returned in all json-formatted API output (not sdmx-json).
terms
This is intended for user interfaces to display terms and conditions of using the API. It is returned in all json-formatted API output (not sdmx-json).
query_class_definitions
By default only the 'aggregation' query class is defined, but there is the flexibility to add your own.
allowed_functions
The names of the functions allowed for this query class, eg.
allowed_functions:
- sum
- mean
- count
By default, only 'count' is allowed.
max_functions
The maximum number of functions that a user can request in a single query. Default 1.
function_types
References to custom function types defined in python can be added here, eg.
function_types:
- my_module.directory.module_name.neural_net
transform_definitions
All the operations that are needed to take a query from the user and return output to them are called "transforms". They include, but are not limited to:
- Data aggregators
- Perturbation algorithms
- Rounding algorithms
- Field exclusion rules
- Sparsity restriction checks
- Replacing functions with new ones
- Supplementing functions with derived data
- Removing hidden fields and data
The available transforms must be listed in the settings file. In the example above, the included sql_tbe_pipeline does this. It provides the references to the functions which perform each of the above tasks, and some sensible default parameters for them. These parameters and references can be supplemented (and/or overridden) in the settings file, and also in each dataset configuration file. For example, if all your datasets' data sit on the same SQL database, it makes sense to provide a single global connection URL to this database in the settings file, rather than repeat it for each dataset. On the other hand, perturbation parameters may differ between datasets, so should be defined there.
You will need to provide additional settings to use these transforms defined in the SQL TBE pipeline:
- get_aggregated_sql_data
- tbe_perturb
While it is possible to define the numerical TBE perturbation parameters in the settings file too, don't do this in
settings.yaml
!
That's because the yaml and json file readers treat floating point numbers slightly differently:
eg. the json reader treats 0.55
as the exact decimal 0.55, while the yaml reader represents it to machine precision,
which is accurate to about 10^-17. For this reason, yaml dataset configuration files are not supported either.
get_aggregated_sql_data
Requires a database "url", which you should set to the connection string of the database containing the data.
This interface uses SQLAlchemy to interface with the SQL database. This works with most SQL dialects, including Oracle, MS-SQL, MySQL, sqlite and Postgres.
The API issues SQL "SELECT" statements to read from the database. It requires the FROM, AS, WHERE, GROUP BY, and ORDER BY clauses, and the COUNT and SUM functions. For mean and sum perturbations using the TBE algorithm, it also requires ROW_NUMBER, PARTITION BY and OVER, which are available in Postgres and Oracle, but not sqlite.
You can also define max_custom_groups
to override the default 160; see
range data fields for more information.
tbe_perturb
The TBE perturbation algorithm requires large matrices
to perform its perturbation, which should be stored as csv files. Set path
to the path of the
directory containing these matrix files.
auth_interface
There are three auth interfaces provided by Protari.
NoAuthInterface
NoAuthInterface
is the default interface, used if none is provided in the settings file.
It treats all non-empty Authorization headers and key
URL parameters as invalid.
DatabaseAuthInterface
This is a simple database key lookup, with reference protari.auth_interface.db_auth.DatabaseAuthInterface
.
To use this, specify the database connection string in url
, using the same approach as for the SQL interface.
In addition, you can optionally specify the following further parameters:
dataset_name
(default "auth_keys")key_field_name
(default "key")permission_field_name
(default "perm")dataset_field_name
(default "dataset")qualifier_mapping
(default None). The values of this object are taken as column names which are passed to the permissions post-processor, using the keys provided. Eg. {"limit": "max_queries"} passes the value of the column named "max_queries" to the post-processor, eg. as {"limit": 1000}.engine_parameters
(allowing additional parameters for debugging).
The API user supplies either an Authorization
header with Protari key=
preceding the key, eg.
curl -X GET --header 'Accept: application/json' --header 'Authorization: Protari key=abc123' 'http://localhost:8080/v1/datasets/'
Currently, you can alternatively provide the auth key using the key
query parameter, eg. ?key=abc123
,
but this is deprecated and may be removed in a future version.
Note that Protari does not provide any mechanisms for generating, maintaining or updating auth keys.
It is not recommended to use this method in production.
JWTAuthInterface
This verifies an encrypted JSON web token (eg. see this introduction),
using the OpenID Connect standard.
The auth interface has reference protari.auth_interface.jwt_auth.JWTAuthInterface
.
You should specify the following parameters:
algorithms
. An optional list of algorithms you would permit when validating the JWT. By default, this class allows only the asymmetric (public-private) algorithm RS256. If you know that your server only uses HMAC (and you know the secret key), you can change this.issuer
. Identifies the principal that issued the JWT.audience
. Identifies the recipients that the JWT is intended for.key
. The public key for public-private RSA and EC algorithms, or the secret key for HMAC.json_web_key_set
. An alternative tokey
, if you have a static json web key set ("jwks"; see below).jwks_url
. It is common practice to provide the json web key sets at an online url, to allow for key rotation. If this is provided, the contents of the url is cached at startup. Whenever a token's "kid" field cannot be matched against the cached jwks, the url is loaded and cached again. Note that this is currently done synchronously, ie. it will hold up execution until it receives a response.jwks_timeout
. The attempt to accessjwks_url
times out after this many seconds. Only used ifjwks_url
is present. Defaults to 2 seconds, ie. give up after 2 second. This should be kept low because the load holds up execution of the API.jwks_wait_seconds
. Thejwks_url
is only reloaded at most once per wait period. Only used ifjwks_url
is present. Defaults to 180 seconds, ie. at most one reload every 3 minutes.permissions_name
. The Public (Collision-Resistant) Name used in the payload to provide the permissions. If not present, no special permissions are assumed. See below for more explanation.permission_string_keys
. By default, permissions of the form "aggregation:secret" are interpreted as: , with "op" being shorthand for the query class. Override permission_string_keys
to add custom fields to be passed to the permissions post processor, eg.permissions_string_keys: ["op", "dataset", "limit"]
.
You should configure your OpenID Connect identity provider to return permissions in the claim with
the name given in permissions_name
. These permissions should be a list, with each element being either:
- A colon-separated string, eg.
<op>:<dataset>:<limit>
, with the meaning of each part given bypermission_string_keys
(see above). Eg.*:secret:50
. - An object with
op
anddataset
keys at least, and optional keyword arguments for the permissions post-processor, eg.{"op": "*", "dataset": "secret", "limit": 50}
.
In each case, op
gives the query class (eg. aggregation
), or *
to match all query classes,
or an empty string "" to match only metadata queries.
A sample value for json_web_key_set
would be (with the <...>
s replaced with long strings):
json_web_key_set:
keys:
- alg: RS256
kty: RSA
use: sig
n: <...>
e: AQAB
kid: <...>
Here is an example that uses a jwks_url
, a permissions post processor to limit the number of queries
each user can ever make, and a permission_string_keys
setting so that these limits can be given in the token
in the string format <op>:<dataset>:<limit>
:
auth_interface:
reference: protari.auth_interface.jwt_auth.JWTAuthInterface
parameters:
jwks_url: https://example.com/jwks
issuer: https://example.com
audience: https://protari.d61.io
permissions_name: permissions.protari.d61.io
additional_names:
- email.protari.d61.io
algorithms:
- RS256
permission_string_keys:
- op
- dataset
- limit
permissions_post_processors:
- reference: protari_api.sql_query_limiter.SqlQueryLimiter
parameters:
url: postgresql:///protari_demo
table_name: query_limit
For more details see the JWT standard.
SqlQueryLimiter
The SqlQueryLimiter
permissions post-processor can take the following parameters:
permissions_post_processors:
- reference: protari_api.sql_query_limiter.SqlQueryLimiter
parameters:
url: postgresql:///protari_demo
exceeded_limit_message: You have reached the maximum number of queries allowed on this dataset # the default
column_name_mapping: # the default is shown here; the values of each property are the database table columns.
user_id: user
query_class_name: query_class
dataset_name: dataset
value: value
table_name: query_limit
engine_parameters: # extra parameters to send to the sql engine
echo: false
The SQL query limiter requires a writable SQL database table to store query limit counts.
The database connection string is provided as url
above, and the name of the table as table_name
.
The values on the right hand side of column_name_mapping
are the names of the relevant columns in the table –
in the example above, they are:
user
: string; in the case of OpenID Connect authentication, this is the token's "subject" (sub
claim)query_class
: string; eg. "aggregation"dataset
: string; the name of the dataset being queried, eg. "sample"value
: integer; the number of times this query class has been accessed by this user on this dataset.
The possible engine parameters are described in the SqlAlchemy documentation.
rate_limits
The API's aggregation-related endpoints can be rate limited per authenticated user, per dataset. Datasets that can be queried without authentication cannot be rate limited in this way, as they do not require users to identify themselves.
The format for this specification is:
rate_limits:
storage_uri: "memory://" # the default
headers_enabled: false # the default
strategy: fixed-window # the default
limits:
aggregation:
public:
- 100/second
default:
- 5/60 seconds
- 8/2 minutes
semicolon:
- 3/60 seconds
Rate limiting is handled by the flask-limiter library –
see its documentation for further information. In particular, redis (redis://host:port
) or
memcached (memcached://host:port
) can be used to store usage data.
The limits under limits.aggregation.public
are applied to dataset that does not require permission to query.
If no public limit is specified, Protari uses 1 million queries per second as the limit.
This limit applies across all unauthenticated users (and per user to authenticated users).
All aggregation queries that require authentication are rate-limited by the list under default
.
The limit applies per user for each dataset, and applies across all the aggregation endpoints
(ie. /aggregation
, /aggregation/csv
and /aggregation/sdmx-json
).
In addition, any query containing a semi-colon is further rate limited by those listed under semicolon
.
Note: semicolon
is an experimental option and may be refined in a future release.
includes
The example above finishes by including settings from other files, namely,
sql_tbe_pipeline
and sql_laplace_pipeline
.
If no path is provided,
these are assumed to be yaml
files located in Protari's protari/settings/includes/
directory.
Using the Protari library without the API
The above description applies to settings files used by the Protari API.
The underlying Protari library only recognizes query_class_definitions
, transform_definitions
and includes
.