2014 Provenance Reconstruction Challenge (human-generated)

From Provenance Reconstruction Wiki
Jump to: navigation, search

The aim of the 2014 Provenance Reconstruction Challenge was to help spur research into the reconstruction of provenance by providing a common task and datasets for experimentation.

Challenge participants received an open data set and the corresponding provenance graphs (in W3C PROV format). They could then work with the data trying to reconstruct the provenance graphs from the open data set. The data consists of two distinct sets: one machine-generated, and one human-generated. This way, we are able to evaluate the reconstruction accuracy for provenance that was automatically collected based on observations, and provenance that was generated based on information provided by humans, which could not be captured automatically. For each dataset, we provide the raw data, and the ground truth provenance serialized in PROV-O.

The human-generated dataset is available for download at [1].

The ground truth for this dataset was created using the sources mentioned in news articles from WikiNews. The link between news articles and their sources is modeled using the prov:hadPrimarySource relation. The raw data consists of the entire HTML of the WikiNews articles, without the sources, and a list of URIs (human_sources.txt). In other words, the goal of this task is to match the source URIs from this list to the correct WikiNews article. The main goal is to reconstruct the derivation graph of the original files, serialized as PROV-O.

Approaches may use any information embedded in the files or external information as you see fit, save from the ground truth or WikiNews, for obvious reasons. Evaluations should report at a minimum the results of precision/recall of the prov:hadPrimarySource relations.