Mountain View
biomedical and healthCAre Data Discovery Index Ecosystem
help Advanced Search
Title: 2013 Imageclef Webupv Collection      
dateReleased:
01-27-2017
privacy:
information not avaiable
aggregation:
instance of dataset
dateCreated:
01-24-2017
refinement:
raw
ID:
doi:10.5281/ZENODO.257722
creators:
Villegas, Mauricio
Paredes, Roberto
availability:
available
types:
other
description:
This document describes the WEBUPV dataset compiled for the ImageCLEF 2013 Scalable Concept Image Annotation task. The data mentioned here indicates what is ready for download. However, upon request or depending on feedback from the participants, additional data may be released. The following is the directory structure of the collection, and bellow there is a brief description of what each compressed file contains. The corresponding MD5 checksums of the files shown (for verifying a correct download) can be found in md5sums.txt. Any publication in which this data has been used is required to cite the following paper: @inproceedings{Villegas13_CLEF,   author = {Mauricio Villegas and Roberto Paredes and Bart Thomee},   title = {{O}verview of the {ImageCLEF} 2013 {S}calable {C}oncept {I}mage {A}nnotation {S}ubtask},   booktitle = {CLEF 2013 Evaluation Labs and Workshop, Online Working Notes},   year = {2013},   month = {September 23-26},   address = {Valencia, Spain},   isbn = {978-88-904810-5-5},   issn = {2038-4963}, } If the 'hsvcolorhist' and/or the 'lbpcenter' visual features are used, then it is also required to cite: @inproceedings{SanchezOro13_CLEF,   author = {Jes\'us S\'anchez-Oro and Soto Montalvo and Antonio S. Montemayor and Juan J. Pant rigo and Abraham Duarte and V\'ictor Fresno and Raquel Mart\'inez},   title = {{URJC\&UNED} at {ImageCLEF} 2013 {P}hoto {A}nnotation {T}ask},   booktitle = {CLEF 2013 Evaluation Labs and Workshop, Online Working Notes},   year = {2013},   month = {September 23-26},   address = {Valencia, Spain},   isbn = {978-88-904810-5-5},   issn = {2038-4963}, } Directory structure ------------------- . | |--- README.txt |--- md5sums.txt |--- webupv13_train_lists.zip |--- webupv13_devel_lists.zip |--- webupv13_test_lists.zip |--- webupv13_baseline.zip | |--- feats_textual/ |      | |      |--- webupv13_train_textual_pages.zip |      |--- webupv13_train_textual.scofeat.gz |      |--- webupv13_train_textual.keywords.gz | |--- feats_visual/        |        |--- webupv13_{train|devel|test}_visual_images.zip        |--- webupv13_{train|devel|test}_visual_gist.feat.gz        |--- webupv13_{train|devel|test}_visual_sift_1000.feat.gz        |--- webupv13_{train|devel|test}_visual_csift_1000.feat.gz        |--- webupv13_{train|devel|test}_visual_rgbsift_1000.feat.gz        |--- webupv13_{train|devel|test}_visual_opponentsift_1000.feat.gz        |--- webupv13_{train|devel|test}_visual_colorhist.feat.gz        |--- webupv13_{train|devel|test}_visual_getlf.feat.gz        |--- webupv13_{train|devel|test}_visual_hsvcolorhist.feat.gz        |--- webupv13_{train|devel|test}_visual_lbpcenter.feat.gz Contents of files ----------------- * webupv13_train_lists.zip   -> train_iids.txt : IDs of the images (IIDs) in the training set                       (250000).   -> train_rids.txt : IDs of the webpages (RIDs) in the training set                       (262526).   -> train_*urls.txt : The original URLs from where the images (iurls)        and the webpages (rurls) were downloaded. Each line in the file        corresponds to an image, starting with the IID and is followed        by one or more URLs.   -> train_rimgsrc.txt : The URLs of the images as referenced in each        of the webpages. Each line of the file is of the form: IID RID        URL1 [URL2 ...]. This information is necessary to locate the        images within the webpages and it can also be useful as a        textual feature.   * webupv13_devel_lists.zip   -> devel_iids.txt : IDs of the images in the development set (1000).   -> devel_*urls.txt : The original URLs from where the images (iurls)        and the webpages (rurls) were downloaded. Each line in the file        corresponds to an image, starting with the IID and is followed        by one or more URLs.        Note: These are included only to acknowledge the source of the        data, not be used as input to the annotation systems.   -> devel_concepts.txt : List concepts for the development set.   -> devel_gnd.txt : Ground truth concepts for the development set                      images.   The concepts are defined by one or more WordNet synsets, which is   intended to make it possible to easily obtain more information about   the concepts, e.g. synonyms. In the concept list, the first column   (which is the name of the concept) indicates the word to search in   WordNet, the second column the synset type (either noun or   adjective), the third column is the sense number and the fourth   column is the WordNet offset (although this cannot be trusted since   it changes between WordNet versions). For most of the concepts there   is a fifth column which is a Wikipedia article related to the   concept. * webupv13_test_lists.zip   -> test_iids.txt : IDs of the images in the test set (2000).   -> test_*urls.txt : The original URLs from where the images (iurls)        and the webpages (rurls) were downloaded. Each line in the file        corresponds to an image, starting with the IID and is followed        by one or more URLs.        Note: These are included only to acknowledge the source of the        data, not be used as input to the annotation systems.   -> test_concepts.txt : List concepts for the test set.   -> test_gnd.txt : Ground truth concepts for the test set images.   The definition of the concepts is the same as for   devel_concepts.txt. Note that the concepts are not the same as for   the development set. * webupv13_baseline.zip   An archive that includes code for computing the evaluation measures   for two baseline techniques. See the included README.txt for   details.   * feats_textual/webupv13_train_textual_pages.zip   Contains all of the webpages which referenced the images in the   training set after being converted to valid xml. In total there are   262588 files, since each image can appear in more than one page, and   there can be several versions of same page which differ by the   method of conversion to xml. To avoid having too many files in a   single directory (which is an issue for some types of partitions),   the files are found in subdirectories named using the first two   characters of the RID, thus the paths of the files after extraction   are of the form:     ./WEBUPV/pages/{RID:0:2}/{RID}.{CONVM}.xml.gz   To be able to locate the training images withing the webpages, the   URLs of the images as referenced are provided in the file   train_rimgsrc.txt. * feats_textual/webupv13_train_textual.scofeat.gz   The processed text extracted from the webpages near where the images   appeared. Each line corresponds to one image, having the same order   as the train_iids.txt list. The lines start with the image ID,   followed by the number of extracted unique words and the   corresponding word-score pairs. The scores were derived taking into   account 1) the term frequency (TF), 2) the document object model   (DOM) attributes, and 3) the word distance to the image. The scores   are all integers and for each image the sum of scores is always   <=100000 (i.e. it is normalized). * feats_textual/webupv13_train_textual.keywords.gz   The words used to find the images when querying image search   engines. Each line corresponds to an image (in the same order as in   train_iids.txt). The lines are composed of triplets:     [keyword] [rank] [search_engine]   where [keyword] is the word used to find the image, [rank] is the   position given to the image in the query, and [search_engine] is a   single character indicating in which search engine it was found   ('g':google, 'b':bing, 'y':yahoo). * feats_visual/webupv13_*_images.zip   Contains thumbnails (maximum 640 pixels of either width or height)   of the images in jpeg format. To avoid having too many files in a   single directory (which is an issue for some types of partitions),   the files are found in subdirectories named using the first two   characters of the IID, thus the paths of the files after extraction   are of the form:     ./WEBUPV/images/{IID:0:2}/{IID}.jpg   * feats_visual/webupv13_*.feat.gz   The visual features in a simple ASCII text sparse format. The first   line of the file indicates the number of vectors (N) and the   dimensionality (DIMS). Then each line corresponds to one vector,   starting with the number of non-zero elements and followed by pairs   of dimension-value, being the first dimension 0. In summary the file   format is:     N DIMS     nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)     nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)     ...     nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)   The order of the features is the same as in the lists   devel_iids.txt, test_iids.txt and train_iids.txt.   The procedure to extract the SIFT based features in this   subdirectory was conducted as follows. Using the ImageMagick   software, the images were first rescaled to having a maximum of 240   pixels, of both width and height, while preserving the original   aspect ratio, employing the command:     convert {IMGIN}.jpg -resize '240>x240>' {IMGOUT}.jpg   Then the SIFT features where extracted using the ColorDescriptor   software from Koen van de Sande   (http://koen.me/research/colordescriptors). As configuration we   used, 'densesampling' detector with default parameters, and a hard   assignment codebook using a spatial pyramid as   'pyramid-1x1-2x2'. The number in the file name indicates the size of   the codebook. All of the vectors of the spatial pyramid are given in   the same line, thus keeping only the first 1/5th of the dimensions   would be like not using the spatial pyramid. The codebook was   generated using 1.25 million randomly selected features and the   k-means algorithm. The GIST features were extracted using the   LabelMe Toolbox. The images where first resized to 256x256 ignoring   original aspect ratio, using 5 scales, 6 orientations and 4   blocks. The other features colorhist and getlf, are both color   histogram based extracted using our own implementation.
accessURL: https://doi.org/10.5281/ZENODO.257722
storedIn:
Zenodo
qualifier:
not compressed
format:
HTML
accessType:
landing page
authentication:
none
authorization:
none
abbreviation:
ZENODO
homePage: https://zenodo.org/
ID:
SCR:004129
name:
ZENODO

Feedback?

If you are having problems using our tools, or if you would just like to send us some feedback, please post your questions on GitHub.