Datatractor Beam: Reference implementation of the Datatractor API

Documentation of beamrepo, the reference implementation of the Datatractor API, available at yardsite.

Datatractor Beam contains beam, a draft Python 3.10 package, which can be used to:

  • query the registry of Extractors for extractors that support a given file type,

  • install those extractors in a fresh Python virtual environment environment via pip,

  • invoke the extractor either in Python or at the CLI, producing Python objects or files on disk.

Installation

git clone git@github.com:datatractor/beam.git
cd beam
pip install .

Usage

Currently, you can use the extract function from the beam module inside your own Python code:

from beam import extract

# extract(<input_type>, <input_path>)
data = extract("./example.mpr",  "biologic-mpr")

This example will install the first compatible biologic-mpr extractor it finds in the registry into a fresh virtualenv, and then execute it on the file at example.mpr.

By default, the extract function will attempt to use the extractor’s Python-based invocation (i.e. the optional preferred_mode="python" argument is specified). This means the extractor will be executed from within python, and the returned data object will be a Python object as defined (and supported) by the extractor. This may require additional packages to be installed, for examples pandas or xarray, which are both supported via the installation command pip install .[formats] above. If you encounter the following traceback, a missing “format” (such as xarray here) is the likely reason:

Traceback (most recent call last):
    [...]
    data = pickle.loads(shm.buf)
ModuleNotFoundError: No module named 'xarray'

Alternatively, if the preferred_mode="cli" argument is specified, the extractor will be executed using its command-line invocation. This means the output of the extractor will most likely be a file, which can be further specified using the output_type argument:

from beam import extract
ret = extract("example.mpr", "biologic-mpr", output_path="output.nc", preferred_mode = "cli")

In this case, the ret will be empty bytes, and the output of the extractor should appear in the output.nc file.