2014 Provenance Reconstruction Challenge (machine-generated)

From Provenance Reconstruction Wiki
Jump to: navigation, search

The aim of the 2014 Provenance Reconstruction Challenge was to help spur research into the reconstruction of provenance by providing a common task and datasets for experimentation.

Challenge participants received an open data set and the corresponding provenance graphs (in W3C PROV format). They could then work with the data trying to reconstruct the provenance graphs from the open data set. The data consists of two distinct sets: one machine-generated, and one human-generated. This way, we are able to evaluate the reconstruction accuracy for provenance that was automatically collected based on observations, and provenance that was generated based on information provided by humans, which could not be captured automatically. For each dataset, we provide the raw data, and the ground truth provenance serialized in PROV-O.

The machine-generated dataset is available at: [1].

The ground truth (groundtruth.ttl) for the first dataset was generated from a number of GitHub repositories using the Git2PROV tool. As raw data, it includes every version of each file that was ever present in the repository (including deleted files). However, the filenames are randomized, to simulate a scenario where all provenance was lost. Due to these randomized filenames, the timing metadata associated with the files may differ from the original. The correct timings can be found in the ground truth provenance (see the prov:atTime property of the qualified generations).

The main goal is to reconstruct the derivation graph of the original files, serialized as PROV-O. Participants were encouraged to make their generated provenance as complete as possible to obtain the best result. By this, we mean that it is advised to elaborate on complex relations such as prov:wasDerivedFrom, prov:wasGeneratedBy, etc., by also providing their qualified forms, i.e., prov:qualifiedDerivation, prov:qualifiedGeneration, etc.

To execute an approach, any information embedded in the files or external information may be used, save from the ground truth or the GitHub repositories themselves. For example, crawling repository hosting websites such as GitHub would not classify as a valid approach. It is assumed that the timing information of the raw data has also been lost and needs to be reconstructed. However, if an approach relies heavily on correct timing information, the prov:atTime properties of the qualified generations in the ground truth can be used. Naturally, if this is the case, it needs to be explicitly mentioned when describing the results.

Results using the dataset should report at a minimum two types of evaluation criteria:

  • derivation recall: precision/recall of the prov:wasDerivedFrom relations;
  • overall recall: precision/recall of all provenance relations mentioned in the ground truth.

Also, when reporting results, it is advised to make the distinction clear as to whether timing information was used.