Improved provenance tracking mechanism for uplifting
In !32 (merged) I introduced an implementation for a generic mechanism to track provenance of computations for different datatypes, providing a simple toy example for tracking int computations (see here)
Main Features:
- the provenance trace is pydantic-enabled and supports (de-)serialization
- the decorator does not fix the location of the provenance-tracked argument, it is controlled by the usage site
- the decorator takes care of cloning the input value and adding the computation result to the prov tracking context
- decorated functions can be used normally just as before if tracking is not needed
- tracking is accessed using a new
.with_prov
method attached to wrapped objects with the decorator - tracking functions are compositional (one can nest prov-tracked computations)
- patch creation can be disabled (e.g. for performance) tracking-enabled computation chain by setting a flag, no code needs to change
Further Generalization Ideas:
Should the need arise, this could be extended further into a more powerful provenance tracking mechanism.
- each tracked entity could get a UUID (generated on creation, restored from deserialization)
- tracked entities could keep reference to other tracked entities that were passed into a decorated function
- the decorator would allow to pass multiple tracked entities and add an entry to a subset of them, cross referencing each other's id (unwrapped arguments could be either ignored or be auto-wrapped with a one-use entity)
But this might be overengineering at this point and before doing it maybe existing lightweight Python provenance tracking solutions should be evaluated (such as the FZJ alpaca tool?).
TODO:
-
implement a subclass for
rdflib.Graph
(based on the currentrdfpatch.py
implementation, should be straight-forward) - substitute the old decorator in the codebase with the new one and adapt respective usage locations
- add tests for the new decorator (generic stuff and graph-specific)