Internal data format


Internally, datasets have the bare minimum of information needed for successful linking. Most information from ecospold2 files is not read, as it is not needed during Ocelot runs and would needlessly consume resources to manage.

Ocelot uses the voluptuous validation library to make sure extracted datasets are formatted the way that Ocelot expects. The voluptuous schema is restrictive - only the listed values are allowed.


Here is the validation schema for an activity dataset:

dataset_schema = Schema({
    "exchanges": [Any(elementary_exchange_schema, activity_exchange_schema)],
    "parameters": [valid_parameter],
    # ecospold2 field 3550: accessRestrictedTo.
    # Currently ecoinvent releases do not use this field.
    'access restricted': valid_access_restriction,
    'economic scenario': str, # ecospold2 field 700: macroEconomicScenarioId
    'end date': str, # Starting and ending dates for dataset validity, in format '2015-12-31'
    'filepath': str,
    'id': str,  # Imported UUID. May not be unique due to allocation.
    # Guaranteed unique hash code based on dataset attributes like name, location, type, etc.
    Optional("code"): str,
    'location': str, # ecospold2 field 410: shortname
    'name': str, # ecospold2 field 100: activityName
    'start date': str,
    'technology level': valid_technology_levels, # ecospold2 field 500
    # The activity types used in ecoinvent are:
    # "transforming activity", "market activity", and "market group"
    'type': valid_activity_types, # ecospold2 field 115: specialActivityType
    'dataset author': str,
    'data entry': str,
    'ISIC classification': str,
    # Allocation method used. Added by a transformation function, should be
    # removed after allocation.
    Optional('allocation method'): valid_allocation_method,
    # Name of the reference product. Added by a transformation function.
    Optional('reference product'): str,
    # Temporary data - references to technosphere exchanges which supply a market
    Optional('suppliers'): list,
}, required=True)

Technosphere exchanges (activity_exchange_schema)

activity_exchange_schema = Schema({
    'amount': float, # ecospold2 field 1020: amount
    'id': str, # ecospold2 field 1005: id
    'name': str, # ecospold2 field 1000: flow name
    'tag': 'intermediateExchange',
    'type': Any('from technosphere', 'reference product', 'byproduct', 'dropped product'),
    'unit': str, # ecospold2 field 1035: unitName
    Optional('activity link'): str, # ecospold2 field 1520: activityLinkId
    Optional('byproduct classification'): valid_byproducts, # ecospold2 field 310: classificationValue, if classificationSystem is 'By-product classification'.
    Optional('conditional exchange'): bool,
    Optional('formula'): str, # ecospold2 field 1060: mathematicalRelation
    Optional('production volume'): valid_production_volume, # Only when needed for multioutput
    Optional('properties'): [valid_property],
    Optional('uncertainty'): valid_uncertainty,
    Optional('variable'): str, # ecospold2_meta field 1040: variableName
}, required=True)

Biosphere exchanges (elementary_exchange_schema)

elementary_exchange_schema = Schema({
    'amount': float, # ecospold2 field 1020: amount
    'compartment': str,
    'id': str, # ecospold2 field 1005: id
    'name': str, # ecospold2 field 1000: flow name
    'subcompartment': str,
    'tag': 'elementaryExchange',
    'type': Any('from environment', 'to environment'),
    'unit': str, # ecospold2 field 1035: unitName
    Optional('formula'): str, # ecospold2 field 1060: mathematicalRelation
    Optional('properties'): [valid_property],
    Optional('uncertainty'): valid_uncertainty,
    Optional('variable'): str, # ecospold2_meta field 1040: variableName
}, required=True)


valid_parameter = Schema({
    "unit": str,
    'amount': float, # ecospold2 field 1710: amount
    'id': str,
    'name': str, # ecospold2 field 1700: name
    Optional('formula'): str, # ecospold2 field 1720: mathematicalRelation
    Optional('uncertainty'): valid_uncertainty,
    Optional('variable'): str, # eocspold2 field 1715: variableName


valid_property = Schema({
    'amount': float, # ecospold2 field 2330: amount
    'id': str, # ecospold2 field 2300: propertyId
    'name': str,
    'unit': str, # ecospold2 field 2324: unitName
    'unit': str, # ecospold2 field 2324: unitName
    Optional('formula'): str, # field 2340: mathematicalRelation
    Optional('uncertainty'): valid_uncertainty,
    Optional('variable'): str, # ecospold2 field 2350: variableName
}, required=True)

Production volume

valid_production_volume = Schema({
    'amount': float, # ecospold2 field 1530: productionVolumeAmount
    Optional('formula'): str, # ecospold2 field 1534: productionVolumeMathematicalRelation
    Optional('uncertainty'): valid_uncertainty,  # ecospold2 field 1539: productionVolumeUncertainty
    Optional('variable'): str, # ecospold2 field 1532: productionVolumeVariableName
}, required=True)


Some fields can only take certain values. The activity dataset, for example, refers to valid_activity_types and valid_access_restriction. Here are the lists of possible values used in the format definition:

valid_access_restriction = Any('public', 'licensees', 'results only', 'restricted')

valid_activity_types = Any("transforming activity", "market activity", "market group",
                           "IO activity", "residual activity", "production mix",
                           "import activity", "supply mix", "export activity",
                           "re-export activity", "correction activity")

 valid_allocation_method = Any(
    'combined production with byproduct',
    'combined production without byproduct',
    'constrained market',
    'economic allocation',
    'no allocation',
    'recycling activity',
    'true value allocation',
    'waste treatment',

valid_byproducts = Any('allocatable product', 'waste', 'recyclable')

valid_technology_levels = Any("undefined", "new", "modern",
                              "current", "old", "outdated")


Eight uncertainty distributions can be extracted in Ocelot, though some, such as the gamma and binomial, and not currently used in ecoinvent and therefore are not currently supported. An uncertainty distribution can therefore be any of the following:

valid_uncertainty = Any(

The uncertainty distributions themselves have distribution-specific fields:

valid_lognormal = Schema({
    'mean': float,
    'pedigree matrix': valid_pedigree_matrix,
    'type': 'lognormal',
    'variance with pedigree uncertainty': float,
    Optional('mu'): float,  # Somehow this is optional (/missing) in some ecospold2 datasets
    Optional('variance'): float,
}, required=True)

valid_normal = Schema({
    'mean': float,
    'pedigree matrix': valid_pedigree_matrix,
    'type': 'normal',
    'variance with pedigree uncertainty': float,
    Optional('variance'): float,
}, required=True)

valid_uniform = Schema({
    'maximum': float,
    'minimum': float,
    'pedigree matrix': valid_pedigree_matrix,
    'type': 'uniform',
}, required=True)

valid_triangular = Schema({
    'maximum': float,
    'minimum': float,
    'mode': float,
    'pedigree matrix': valid_pedigree_matrix,
    'type': 'triangular',
}, required=True)

valid_binomial = Schema({
    'n': float,
    'p': float,
    'pedigree matrix': valid_pedigree_matrix,
    'type': 'binomial',
}, required=True)

valid_beta = Schema({
    'maximum': float,
    'minimum': float,
    'mode': float,
    'pedigree matrix': valid_pedigree_matrix,
    'type': 'beta',
}, required=True)

valid_gamma = Schema({
    'pedigree matrix': valid_pedigree_matrix,
    'scale': float,
    'shape': float,
    'type': 'gamma',
}, required=True)

valid_undefined = Schema({
    'maximum': float,
    'minimum': float,
    'pedigree matrix': valid_pedigree_matrix,
    'standard deviation 95%': float,
    'type': 'undefined',
}, required=True)

The pedigree matrix is a dictionary:

valid_pedigree_matrix = Any(
        'reliability': int,
        'completeness': int,
        'temporal correlation': int,
        'geographical correlation': int,
        'further technology correlation': int,
    {}  # Empty dictionary is also allowed

Logging format

The Logging class will generate the following types messages. Each message is JSON-encoded, and on a separate line.

Report start

    type: 'report start',
    time: time at report start, UNIX timestamp,
    count: int, number of raw datasets,
    uuid: UUID of current report, hex-encoded

Report end

    type: 'report end',
    time: time at report end, UNIX timestamp

Function start

    type: 'function start',
    time: time at function start, UNIX timestamp,
    count: current number of datasets,
    index: int, function index,
    name: name of function,
    description: description of function from function docstring,
    table: list of columns to be formatted into a table, or null

Function end

    type: 'report end',
    time: time at function end, UNIX timestamp,
    count: current number of datasets,
    index: int, function index,
    name: name of function,
    description: description of function from function docstring,
    table: list of columns to be formatted into a table, or null

Function data

Function will also write log messages about individual changes. These messages have no particular format, but if they are providing data which will be formatted into a table later, they will look like:

    type: 'table element',
    data: list of data elements in same order as columns

If the logging information is better represented in a list, they will look like:

    type: 'list element',
    data: HTML string