Human-generated dataset 2 similar to 2014 Provenance Reconstruction Challenge

From Provenance Reconstruction Wiki
Jump to: navigation, search

We created additional datasets similar to those provided at the 2014 Provenance Reconstruction Challenge (human-generated). These datasets were created by randomly scraping the Wikinews website [1] for news articles. The Wikinews articles are downloaded as html files into one folder. In addition, each dataset contains the following:

  • a list of urls of source article files
  • downloaded source article files (in html)
  • a ground truth file (in Turtle notation)

These datasets are provided in various sizes: 10, 20, 50, 100, and 200 articles.