Datatractor Beam: Reference implementation of the Datatractor API
Documentation of , the reference implementation of the Datatractor API,
available at
.
Datatractor Beam contains beam
, a draft Python 3.10 package, which can be used to:
query the registry of Extractors for extractors that support a given file type,
install those extractors in a fresh Python virtual environment environment via pip,
invoke the extractor either in Python or at the CLI, producing Python objects or files on disk.
Installation
git clone git@github.com:datatractor/beam.git
cd beam
pip install .
Usage
Currently, you can use the extract
function from the beam
module inside your own Python code:
from beam import extract
# extract(<input_type>, <input_path>)
data = extract("./example.mpr", "biologic-mpr")
This example will install the first compatible biologic-mpr
extractor it finds in the registry into a fresh virtualenv, and then execute it on the file at example.mpr
.
By default, the extract
function will attempt to use the extractor’s Python-based invocation (i.e. the optional preferred_mode="python"
argument is specified). This means the extractor will be executed from within python, and the returned data
object will be a Python object as defined (and supported) by the extractor. This may require additional packages to be installed, for examples pandas
or xarray
, which are both supported via the installation command pip install .[formats]
above. If you encounter the following traceback, a missing “format” (such as xarray
here) is the likely reason:
Traceback (most recent call last):
[...]
data = pickle.loads(shm.buf)
ModuleNotFoundError: No module named 'xarray'
Alternatively, if the preferred_mode="cli"
argument is specified, the extractor will be executed using its command-line invocation. This means the output of the extractor will most likely be a file, which can be further specified using the output_type
argument:
from beam import extract
ret = extract("example.mpr", "biologic-mpr", output_path="output.nc", preferred_mode = "cli")
In this case, the ret
will be empty bytes, and the output of the extractor should appear in the output.nc
file.