Global Settings

When the API is first started, it first reads in a global settings file from ./settings.yaml. You can override this location by setting the system environment variable PROTARI_SETTINGS_PATH.

The settings file tells the API where to find the data, the dataset configuration files, parameters used for perturbations, and how to authorize users. The file may be in either yaml or json format.

An example is:

dataset_config_path: ./dataset_config
logging_config_path: ./logging_config.yaml
validate_on_startup: true
log_attributes:
  - remote_addr
log_headers:
  - X-Forwarded-For
base_url: https://protari.example.com/api/v1
organization:
  name: Sample
  title: Sample organization
description: A description to include in every json-formatted API output.
terms: Optional terms and conditions to include in every json-formatted API output.
query_class_definitions:
  aggregation:
    parameters:
      max_functions: 1
      allowed_functions:
      - sum
      - mean
      - count
transform_definitions:
  get_aggregated_sql_data:
    parameters:
      url: DBTYPE://USERNAME:PASSWORD@HOSTNAME/DB_NAME
  tbe_perturb:
    parameters:
      path: ./perturbation_data
auth_interface:
  reference: protari.auth_interface.db_auth.DatabaseAuthInterface
  parameters:
    url: DBTYPE://USERNAME:PASSWORD@HOSTNAME/DB_NAME
    qualifier_mapping:
      limit: limit
    permissions_post_processor:
      reference: protari_api.sql_query_limiter.SqlQueryLimiter
      parameters:
        url: DBTYPE://USERNAME:PASSWORD@HOSTNAME/DB_NAME
        table_name: query_limit
rate_limits:
  limits:
    aggregation:
      default:
        - 10/minute
includes:
- sql_tbe_pipeline
- sql_laplace_pipeline

The schema for the settings file is provided in the source code.

Environment variables

Settings values can reference environment variables as ${ENV_VAR}. A typical use-case is to include passwords without the need to put them directly in the file, eg:

transform_definitions:
  get_aggregated_sql_data:
    parameters:
      url: DBTYPE://USERNAME:${DATABASE_PWD}@HOSTNAME/DB_NAME

together with the environment variable DATABASE_PWD.

You could alternatively put the entire database URL into an environment variable, eg:

      url: ${DATABASE_URL}

dataset_config_path

The path to your dataset configuration files. All json files found under this path will be read by the API and interpreted as dataset configuration files. This path is relative to the location of the settings file.

logging_config_path

The path to your logging configuration file. This path is relative to the location of the settings file.

validate_on_startup

If true (the default), all datasets will be put through a number of validation checks when the API starts up.

validate_responses

If true (the default is false), all output is validated explicitly against the API's swagger spec.

log_attributes

An optional list of attributes of Flask's request object that you wish to include in the API's logs, eg. to log the user's IP address (when it is not obscured by a proxy layer), include remote_addr.

log_headers

An optional list of headers you wish to include in the API's logs, eg. User-Agent or X-Forwarded-For. The X-Forwarded-For header can be useful when a user's IP address is not available via the attribute remote_addr.

base_url

The API's base url, eg. https://protari.example.com/api/v1. Must include the swagger basePath. If provided, this is returned by the API in the aggregation response. It is not used internally by Protari.

So your users don't get confused, this base url should match that served by the API, which is set independently in protari-api's application.py, eg:

app.add_api(swagger_spec, base_path='/api/v1', ...

maximum_authorization_header_length

If provided, the Protari API raises an authentication error if the length of the Authorization header (measured in characters) exceeds the specified value.

organization

This is information for the user, and is returned in the relevant API outputs.

description

This is information for the user, and is returned in all json-formatted API output (not sdmx-json).

terms

This is intended for user interfaces to display terms and conditions of using the API. It is returned in all json-formatted API output (not sdmx-json).

query_class_definitions

By default only the 'aggregation' query class is defined, but there is the flexibility to add your own.

allowed_functions

The names of the functions allowed for this query class, eg.

        allowed_functions:
        - sum
        - mean
        - count

By default, only 'count' is allowed.

max_functions

The maximum number of functions that a user can request in a single query. Default 1.

function_types

References to custom function types defined in python can be added here, eg.

        function_types:
        - my_module.directory.module_name.neural_net

transform_definitions

All the operations that are needed to take a query from the user and return output to them are called "transforms". They include, but are not limited to:

The available transforms must be listed in the settings file. In the example above, the included sql_tbe_pipeline does this. It provides the references to the functions which perform each of the above tasks, and some sensible default parameters for them. These parameters and references can be supplemented (and/or overridden) in the settings file, and also in each dataset configuration file. For example, if all your datasets' data sit on the same SQL database, it makes sense to provide a single global connection URL to this database in the settings file, rather than repeat it for each dataset. On the other hand, perturbation parameters may differ between datasets, so should be defined there.

You will need to provide additional settings to use these transforms defined in the SQL TBE pipeline:

While it is possible to define the numerical TBE perturbation parameters in the settings file too, don't do this in settings.yaml! That's because the yaml and json file readers treat floating point numbers slightly differently: eg. the json reader treats 0.55 as the exact decimal 0.55, while the yaml reader represents it to machine precision, which is accurate to about 10^-17. For this reason, yaml dataset configuration files are not supported either.

get_aggregated_sql_data

Requires a database "url", which you should set to the connection string of the database containing the data.

This interface uses SQLAlchemy to interface with the SQL database. This works with most SQL dialects, including Oracle, MS-SQL, MySQL, sqlite and Postgres.

The API issues SQL "SELECT" statements to read from the database. It requires the FROM, AS, WHERE, GROUP BY, and ORDER BY clauses, and the COUNT and SUM functions. For mean and sum perturbations using the TBE algorithm, it also requires ROW_NUMBER, PARTITION BY and OVER, which are available in Postgres and Oracle, but not sqlite.

You can also define max_custom_groups to override the default 160; see range data fields for more information.

tbe_perturb

The TBE perturbation algorithm requires large matrices to perform its perturbation, which should be stored as csv files. Set path to the path of the directory containing these matrix files.

auth_interface

There are three auth interfaces provided by Protari.

NoAuthInterface

NoAuthInterface is the default interface, used if none is provided in the settings file.

It treats all non-empty Authorization headers and key URL parameters as invalid.

DatabaseAuthInterface

This is a simple database key lookup, with reference protari.auth_interface.db_auth.DatabaseAuthInterface. To use this, specify the database connection string in url, using the same approach as for the SQL interface.

In addition, you can optionally specify the following further parameters:

The API user supplies either an Authorization header with Protari key= preceding the key, eg.

curl -X GET --header 'Accept: application/json' --header 'Authorization: Protari key=abc123' 'http://localhost:8080/v1/datasets/'

Currently, you can alternatively provide the auth key using the key query parameter, eg. ?key=abc123, but this is deprecated and may be removed in a future version.

Note that Protari does not provide any mechanisms for generating, maintaining or updating auth keys.

It is not recommended to use this method in production.

JWTAuthInterface

This verifies an encrypted JSON web token (eg. see this introduction), using the OpenID Connect standard. The auth interface has reference protari.auth_interface.jwt_auth.JWTAuthInterface.

You should specify the following parameters:

You should configure your OpenID Connect identity provider to return permissions in the claim with the name given in permissions_name. These permissions should be a list, with each element being either:

  1. A colon-separated string, eg. <op>:<dataset>:<limit>, with the meaning of each part given by permission_string_keys (see above). Eg. *:secret:50.
  2. An object with op and dataset keys at least, and optional keyword arguments for the permissions post-processor, eg. {"op": "*", "dataset": "secret", "limit": 50}.

In each case, op gives the query class (eg. aggregation), or * to match all query classes, or an empty string "" to match only metadata queries.

A sample value for json_web_key_set would be (with the <...>s replaced with long strings):

    json_web_key_set:
      keys:
        - alg: RS256
          kty: RSA
          use: sig
          n: <...>
          e: AQAB
          kid: <...>

Here is an example that uses a jwks_url, a permissions post processor to limit the number of queries each user can ever make, and a permission_string_keys setting so that these limits can be given in the token in the string format <op>:<dataset>:<limit>:

auth_interface:
  reference: protari.auth_interface.jwt_auth.JWTAuthInterface
  parameters:
    jwks_url: https://example.com/jwks
    issuer: https://example.com
    audience: https://protari.d61.io
    permissions_name: permissions.protari.d61.io
    additional_names:
    - email.protari.d61.io
    algorithms:
    - RS256
    permission_string_keys:
    - op
    - dataset
    - limit
    permissions_post_processors:
      - reference: protari_api.sql_query_limiter.SqlQueryLimiter
        parameters:
          url: postgresql:///protari_demo
          table_name: query_limit

For more details see the JWT standard.

SqlQueryLimiter

The SqlQueryLimiter permissions post-processor can take the following parameters:

    permissions_post_processors:
      - reference: protari_api.sql_query_limiter.SqlQueryLimiter
        parameters:
          url: postgresql:///protari_demo
          exceeded_limit_message: You have reached the maximum number of queries allowed on this dataset  # the default
          column_name_mapping:  # the default is shown here; the values of each property are the database table columns.
            user_id: user
            query_class_name: query_class
            dataset_name: dataset
            value: value
          table_name: query_limit
          engine_parameters:  # extra parameters to send to the sql engine
            echo: false

The SQL query limiter requires a writable SQL database table to store query limit counts. The database connection string is provided as url above, and the name of the table as table_name. The values on the right hand side of column_name_mapping are the names of the relevant columns in the table – in the example above, they are:

The possible engine parameters are described in the SqlAlchemy documentation.

rate_limits

The API's aggregation-related endpoints can be rate limited per authenticated user, per dataset. Datasets that can be queried without authentication cannot be rate limited in this way, as they do not require users to identify themselves.

The format for this specification is:

rate_limits:
  storage_uri: "memory://"  # the default
  headers_enabled: false    # the default
  strategy: fixed-window    # the default
  limits:
    aggregation:
      public:
        - 100/second
      default:
        - 5/60 seconds
        - 8/2 minutes
      semicolon:
        - 3/60 seconds

Rate limiting is handled by the flask-limiter library – see its documentation for further information. In particular, redis (redis://host:port) or memcached (memcached://host:port) can be used to store usage data.

The limits under limits.aggregation.public are applied to dataset that does not require permission to query. If no public limit is specified, Protari uses 1 million queries per second as the limit. This limit applies across all unauthenticated users (and per user to authenticated users).

All aggregation queries that require authentication are rate-limited by the list under default. The limit applies per user for each dataset, and applies across all the aggregation endpoints (ie. /aggregation, /aggregation/csv and /aggregation/sdmx-json). In addition, any query containing a semi-colon is further rate limited by those listed under semicolon.

Note: semicolon is an experimental option and may be refined in a future release.

includes

The example above finishes by including settings from other files, namely, sql_tbe_pipeline and sql_laplace_pipeline. If no path is provided, these are assumed to be yaml files located in Protari's protari/settings/includes/ directory.

Using the Protari library without the API

The above description applies to settings files used by the Protari API. The underlying Protari library only recognizes query_class_definitions, transform_definitions and includes.