Human-generated dataset similar to 2014 Provenance Reconstruction Challenge

From Provenance Reconstruction Wiki
Jump to: navigation, search

We created an additional dataset similar to those provided at the 2014 Provenance Reconstruction Challenge (human-generated). This dataset was created by randomly scraping the Wikinews website [1] for news articles (20 articles). The Wikinews articles are downloaded as html files into one folder. In addition, the dataset contains the following:

  • a list of urls of source article files
  • a ground truth file (in Turtle notation)