pymerit

Important

This library is still under development. Wait for the release.

This library is intended to help with standardizing the file metadata generated by various data applications so that downstream analytical and other data management applications can be built.

The problem it solves is that as pipelines and applications increase, it is hard to know which dataset was produced by whom, when, where, and why. Even if this information is captured, if it is not machine readable and standardized, it is of not much value.

See documentation for interface details.

Why Pymerit

Applications instantiate metadata objects, add data elements, and eventually serialize them. There is a corresponding deserialization process.

The default approach was to use Data Packages, both in structure and processing libraries, but we were looking for something a bit different. Something that will:

  1. Isolate namespaces of projects, runs and instances - Many data applications will use this in various contexts. The metadata for all of them has to be systematically isolated.
  2. Support extensible application-specific context - Every application captures different bits of information depending on the need.
  3. Validate structure of metadata to ensure consumability - We need to enforce organization-specific standardards to make sure that the metadata is discoverable.
  4. Capture dependencies - We need to be able to reconstruct the lineage for any data file. Dependencies could themselves be organization-specific and could possibly include policy documents.

The current structure has:

  • Namespace: A globally unique name for a set of files
  • Path: Path within the namespace
  • Name: Human readable name for this output
  • Description: Human readable description for this output
  • Contexts: Information about the data generation process including platform, application etc.
  • Resources: Information about artifacts generated such as files
  • Dependencies: Information about other metadata elements

Each context, resource, and dependency has its own schema allowing the user to define elements that meet their need.

The combination of (namespace, path) are assumed to be unique across the entire organization.

Requirements

  • Python 3.5 over or PyPy 2.4.0 over

Features

  • Simple - Minimal structure
  • Predictable - Metadata has a fixed and validated structure
  • Extensible - Add new context and resource types
  • Namespaces - For the metadata file, schemas
  • Custom Load/Store - over-ride the process at an element level
  • Few built-in context types

Setup

(venv)$ pip3 install pymerit

Usage

API

$ python3
>>> import pymerit
>>> m = pymerit.MeritDefault()
>>> m.name="SimpleRun"
>>> m.description="Test generate simple metadata"
>>> m.namespace = "dev.scribbledata.io"
>>> m.path = "n=project-25/run-56"
>>> with open(filename, 'w') as fd:
...     fd.write(m.dumps())

$ cat metadata.json
{
    "schema": "global:default:v1",
    "namespace": "dev.scribbledata.io",
    "path": "n=project-25/run-56",
    "name": "SimpleRun",
    "description": "Test generate simple metadata",
    "contexts": [
        {
            "schema": "context:platform:v1",
            "name": "PlatformContext",
            "description": "Host on which the execution took place",
            "platform": "Linux-4.15.0-42-generic-x86_64-with-Ubuntu-16.04-xenial",
            "node": "whale",
            "python": "3.5.2"
        },
        {
            "schema": "context:process:v1",
            "name": "ProcessContext",
            "description": "Process generating this metadata",
            "cmdline": [
                ""
            ],
            "pid": 5529,
            "ppid": 14929
        }
    ],
    "resources": []
}

>>> import hashlib
>>> r = pymerit.MeritResourceFile()
>>> r.name = "runlog"
>>> r.description = "Run log from execution"
>>> r.path = ".../log.json"
>>> attributes = {
...   'sha256sum': hashlib.sha256(open(r.path,'rb').read()).hexdigest()
... }
>>> r.attributes = attributes
>>> print(r.dumps())
...
   "resources": [
      {
          "schema": "resource:filebase:v1",
          "name": "runlog",
          "description": "Long string",
          "path": ".../log.json",
          "attributes": {
              "sha256sum": "185f8db32271fe25f561a6fc938b2e264306ec304eda518007d1764826381969"
          }
      }
  ]

CLI

# What are supported schemas
$ merit schema list
+---------------------+----------------------+---------------------------------+
|       Schema        |        Class         |             Module              |
+=====================+======================+=================================+
| context:base:v1     | MeritContextBase     | ....python/lib/.../pymerit/pyme |
|                     |                      | rit/base.py                     |
+---------------------+----------------------+---------------------------------+
| resource:base:v1    | MeritResourceBase    | ....python/lib/.../pymerit/pyme |
|                     |                      | rit/base.py                     |
+---------------------+----------------------+---------------------------------+
| global:base:v1      | MeritGlobalBase      | ....python/lib/.../pymerit/pyme |
|                     |                      | rit/base.py                     |
+---------------------+----------------------+---------------------------------+
| context:platform:v1 | MeritContextPlatform | ....python/lib/.../pymerit/pyme |
|                     |                      | rit/contrib.py                  |
+---------------------+----------------------+---------------------------------+
| context:process:v1  | MeritContextProcess  | ....python/lib/.../pymerit/pyme |
|                     |                      | rit/contrib.py                  |
+---------------------+----------------------+---------------------------------+
| global:default:v1   | MeritDefault         | ....python/lib/.../pymerit/pyme |
|                     |                      | rit/contrib.py                  |
+---------------------+----------------------+---------------------------------+

$ merit metadata show metadata.json
+-------------+--------------------------------------------------------------+
|  Dimension  |                           Summary                            |
+=============+==============================================================+
| schema      | global:default:v1                                            |
+-------------+--------------------------------------------------------------+
| namespace   | dev.scribbledata.io                                          |
+-------------+--------------------------------------------------------------+
| path        | n=project-25/run-56                                          |
+-------------+--------------------------------------------------------------+
| name        | SimpleRun                                                    |
+-------------+--------------------------------------------------------------+
| description | Test generate simple metadata                                |
+-------------+--------------------------------------------------------------+
| contexts    | +-------------+--------------------------------------------+ |
|             | |  Dimension  |                  Summary                   | |
|             | +=============+============================================+ |
|             | | schema      | context:platform:v1                        | |
|             | +-------------+--------------------------------------------+ |
|             | | name        | PlatformContext                            | |
|             | +-------------+--------------------------------------------+ |
|             | | description | Host on which the execution took place     | |
|             | +-------------+--------------------------------------------+ |
|             | | node        | whale                                      | |
|             | +-------------+--------------------------------------------+ |
|             | | platform    | Linux-4.15.0-42-generic-x86_64-with-       | |
|             | |             | Ubuntu-16.04-xenial                        | |
|             | +-------------+--------------------------------------------+ |
|             | | python      | 3.5.2                                      | |
|             | +-------------+--------------------------------------------+ |
|             | +-------------+----------------------------------+           |
|             | |  Dimension  |             Summary              |           |
|             | +=============+==================================+           |
|             | | schema      | context:process:v1               |           |
|             | +-------------+----------------------------------+           |
|             | | name        | ProcessContext                   |           |
|             | +-------------+----------------------------------+           |
|             | | description | Process generating this metadata |           |
|             | +-------------+----------------------------------+           |
|             | | cmdline     |                                  |           |
|             | |             |                                  |           |
|             | +-------------+----------------------------------+           |
|             | | pid         | 5529                             |           |
|             | +-------------+----------------------------------+           |
|             | | ppid        | 14929                            |           |
|             | +-------------+----------------------------------+           |
|             |                                                              |
+-------------+--------------------------------------------------------------+
| resources   | +-------------+--------------------------------------------+ |
|             | |  Dimension  |                  Summary                   | |
|             | +=============+============================================+ |
|             | | schema      | resource:filebase:v1                       | |
|             | +-------------+--------------------------------------------+ |
|             | | name        | runlog                                     | |
|             | +-------------+--------------------------------------------+ |
|             | | description | Run log from the execution                 | |
|             | +-------------+--------------------------------------------+ |
|             | | path        | ..../log.json                              | |
|             | +-------------+--------------------------------------------+ |
|             | | attributes  | {'sha256sum': '185f8db32271fe25f561a6fc938 | |
|             | |             | b2e264306ec304eda518007d1764826381969'}    | |
|             | +-------------+--------------------------------------------+ |
|             |                                                              |
+-------------+--------------------------------------------------------------+