Metadata
- Name
- MUHACU: A Benchmark Dataset for Multi-modal HumanActivity Understanding
- Repository
- ZENODO
- Identifier
- doi:10.5281/zenodo.4968721
- Description
- Abstract:
Video understanding extends the level of temporal action recognition. Taking an example of a video containing rich human action, we can reason and predict future actions based on the first several actions in the stream. However, when the task comes to the machine, it could be still difficult to make the forecast and planning based on the video feature of these daily human actions. We formalize the task as Multi-modal Human Activity Understanding. Given a small fraction of the original video clip and a set of action sequences, a machine should be able to find the most reasonable action sequence in the set which can well represent the future actions of the observed video frames. We design the task into two settings: one is completely on the understanding of initial video frames; another provides with both the initial state (video frames)and the goal state (high-level intent). We called them Human ActionForecasting and Human Action Planning separately. We then propose the fully annotated benchmark called MUHACU (MUlti-modalHuman ACtivity Understanding), consisting of 2.9k videos and 157action classes from the original Charades [1] videos. We refine the original labels of the Charades video labels and add more features to aid our task completion. In addition, we provide two strong baseline systems from two directions, information retrieval, and end-to-end training, sharing some insights on potential solutions to this task.
Introduction:
We have tailored and refined the original annotation in the Charades dataset by selecting 2.9k videos and crowdsourcing the corresponding intent in each video. In order to meet the design of the initial state, we generally choose the first 20% length of each video as the initial states. Along with the dataset, the multi-modal knowledge base is crafted semi-automatically. Containing temporal action relationships, visual and textual features of atomic actions, and action sequence and high-level intents, the knowledge base is well served the idea of generalization. We demonstrate the Multi-modal Human Activity Understanding (MUHACU) task is challenging to machines by evaluating a strong hybrid end-to-end framework in the format of multi-modal cloze task.
In summary, MUHACU facilitates multi-modal learning systems that observe through visual features, and forecast and plan in the language in the real-world environment. Our contributions, in brief, are: (1) We propose the first multi-modal knowledge base for temporal activity understanding. (2) We propose baselines for demonstrating the effectiveness of the knowledge base. (3) We propose the novel multi-modal benchmark for evaluating models backed by the knowledge base and dataset.
MUHACU contains the following fields:
KB:2402 videos
________________________________________________________
KB
________________________________________________________
# of action-level entities 157
# of activity video entities 2402
# of intent for each video 2
# of action video entities 12118
# of action sequences(non repeat seq) 2402(1969)
# of action state templates 27
avg. # of action sequence length 5.04
————————————————————————————————————————————————————————
Features in KB:
_____________________________________________________________________
feature num size
_______________________________________________________________________
action visual prototype feat 157 [1024,]
action textual prototype feat 157 [768,]
intent feat 2402*2 [768,]
video-level visual feat 2402 + 12118 [1024,]
snippet-level visual feat 2402 + 12118 [frames//8, 1024]
________________________________________________________________________
evaluation task: 510 videos for human action planning and human action forecasting
_________________________________________________________________________________________
num human action planning human action forecasting
_________________________________________________________________________________________
# of videos (action sequences) 510 510
avg. # of observed acts 2.79 2.79
avg. # of predicted acts 2.40 2.40
avg. # of total acts 5.19 5.19
\# of choices 6 6
\# of answers 1(435)/2(75) 1
\# of intent 0 1
_________________________________________________________________________________________
training dataset-split: We also provide a dataset-split to training the baseline model to learn the future ground truth sequence. The initial 2402 KB videos are distributed by the standard split 8:2 for training (1921videos) and validation (481 videos).
_____________________________________________________________
train validation test
_____________________________________________________________
1921 481 510
_____________________________________________________________
More details about the dataset are in README.txt
Availability:
Our data set and knowledge base is available online at https://zenodo.org/deposit/4968721 in order to support sustainability. The resource is maintained under creative Commons Attribution4.0 International license, implying the re-usability. We follow the widely-used standards of FAIR Data principles, which are designed to make resources findable, accessible, interoperable, and re-usable. TheGitHubrepository contains the complete source code and check-points for the baseline systems are available at https://github.com/MUHACU/MUHACU.
[1]Sigurdsson, Gunnar A., et al. "Hollywood in homes: Crowdsourcing data collection for activity understanding." European Conference on Computer Vision. Springer, Cham, 2016. - Data or Study Types
- multiple
- Source Organization
- Unknown
- Access Conditions
- available
- Year
- 2021
- Access Hyperlink
- https://doi.org/10.5281/zenodo.4968721
Distributions
- Encoding Format: HTML ; URL: https://doi.org/10.5281/zenodo.4968721