Structured Facts 380K: Structured Coco, Structured Flickr30K Entities

This dataset has two parts. You can use the part you need.

A) Structured COCO Facts.

B) Structured Flickr30K Facts.

For instance, these data could be use as an automatically augmented facts for MS COCO dataset and hence can help access fact level data to improve captioning.

Goal: Collecting data for the space of visual actions and interaction is huge, so we propose to collect such data guided by captions.

How it works, please refer to the paper for details.

Motivated by the application of fact-level image understanding, we present an automatic method for data collection of structured visual facts from images with captions. Example structured facts include attributed objects (e.g., <flower, red>), actions (e.g., <baby, smile>), interactions (e.g., <man, walking, dog>), and positional information (e.g., <vase, on, table>). The collected annotations are in the form of fact-image pairs (e.g.,<man, walking, dog> and an image region containing this fact). With a language approach, the proposed method is able to collect hundreds of thousands of visual fact annotations with accuracy of 83% according to human judgment. Our method automatically collected more than 380,000 visual fact annotations and more than 110,000 unique visual facts from images with captions and localized them in images in less than one day of processing time on standard CPU platforms.

Download Data collected by the Method from MSCOCO and Flickr30K datasets

Version 1 (Light)

Download Link

Version 1 (More Detailed).
I prepared the annotations on dropbox. It was actually few hundred megabytes. I am trying to illustrate the annotations in this email. You can always get back to me for any questions as well. I try to ground subjects/objects that is either coco object or one of sun dataset scenes.

Train 2014
https://dl.dropboxusercontent.com/u/33950950/sherlock_project/detailed_auto_data/coco_train2014/AllGroundedData_with_GroundingInfo.mat

Val 2014
https://dl.dropboxusercontent.com/u/33950950/sherlock_project/detailed_auto_data/coco_val2014/AllGroundedData_with_GroundingInfo.mat

coco senses.
https://dl.dropboxusercontent.com/u/33950950/sherlock_project/detailed_auto_data/coco_senses.mat

Sun Scene
https://dl.dropboxusercontent.com/u/33950950/sherlock_project/detailed_auto_data/scene_names.mat

README for Version1 (More detailed)

AllGroundedData{i} has the annotations for the ith image in coco

AllGroundedData{i}.imId is the cocoid
AllGroundedData{i}. UncleanedGroundedData is the list for grounding data at this image.

Each element inside AllGroundedData{i}. UncleanedGroundedData is a grounded example, that has the following info
tuple: <subject, predicate, object>
1)      statement_ID: if of the statement from which this tuple is extracted
2)      type: S/SP/SPO
3)      GroundingInfo: has inside it, some properties detailed below.
i)                    tuple_bbox : is the box associated
ii)                  object_bboxes: caindate boxes in the image that match object type
iii)                subject_bboxes: caindate boxes in the image that match subject type
iv)                is_subject_scene: 1 if subject is scene
v)                  is_object_scene : 1 if object is scene
vi)                linkingData
           a.      subjectInfo
                                                                          i.      linkedObjectsIndeces: [1] if for person, it will be a number between 2 and 80 for other coco objects.
                                                                        ii.      linkedSceneIndex: is sun scene will include the scene index
                                                                      iii.      isPlural: if subject is plural
                                                                       iv.      assigned: if subject is grounded (can find some annotation in coco that matches its type.
           b.      objectInfo
                                                                          i.      linkedObjectsIndeces: [1] if for person, it will be a number between 2 and 80 for other coco objects.
                                                                        ii.      linkedSceneIndex: is sun scene will include the scene index
                                                                      iii.      isPlural: if objecr is plural
                                                                       iv.      assigned: if object is grounded (can find some annotation in coco that matches its type.
vii)              complete_assignment=1, if we manage to assign all elements
viii)            TH: you can use all annotations with TH=-1. Some times, when I am not able to locate one of the interacting object/subject, I assign the tuple_bbox in (i) above to the whole image. TH in this case will have the ratio between the subject/object I found and the total image area (i.e. TH = object_found_area/ (W*H). I found it good to use tuple_bbox as an annotation for the whole interaction with TH>0.3. However, You can still find the box that contain the object/subject only

Papers / Citation

If you find this data useful, please consider citing

@inproceedings{elhoseiny2016automatic,  title={Automatic Annotation of Structured Facts in Images},  author={Elhoseiny, Mohamed and Cohen, Scott and Chang, Walter and Price, Brian and Elgammal, Ahmed},  booktitle={ACL  2016  (Vision&Language long-paper workshop proceedings)},  year={2016}}

@article{elhoseiny2017sherlock,  title={Sherlock: Scalable Fact Learning in Images},  author={Elhoseiny, Mohamed and Cohen, Scott and Chang, Walter and Price, Brian and Elgammal, Ahmed},  conference={AAAI},  year={2017}}