[PDF]
[URL]
[Features]
[Talk]
Dataset
The collected documents are selected sections from the Wikipedia’s featured articles collection. This is a continuously growing dataset, that at the time of collection (October 2009) had 2,669 articles spread over 29 categories. Some of the categories are very scarce, therefore we considered only the 10 most populated ones. The articles generally have multiple sections and pictures. We have split them into sections based on section headings, and assign each image to the section in which it was placed by the author(s). Then this dataset was prunned to keep only sections that contained a single image and at least 70 words. The final corpus contains 2,866 multimedia documents. The median text length is 200 words.