Hi there! We are three students from CMU MCDS program. Our capstone project is Multimodal Question Answering.

Left to right: Kunal Dhawan, Wenxing Deng, Manoj Ghuhan and our mentor Prof. Eric Nyberg

Multimodal Question Answering (MQA) is a rapidly growing area of research which aims at building intelligent systems that can respond to user queries by reasoning over information from multiple modalities. Such systems try to emulate human beings who also rely on cross-modal reasoning for answering any question thrown at them. Current MQA approaches suffer from various drawbacks like biased datasets used for training, inability to answer simple counting-based questions, and tendency to learn surface-level relationships rather than building reasoning. In this work, we aim to overcome these limitations and propose a new end-to-end MQA system. The major contributions of our work are:

  • Curation of a MQA dataset which consists of a diverse set of question types capturing complex interactions and relationships between different objects in the images and is devoid of any inherent biases
  • Improved feature extraction module which can handle and even generate scene graphs given input images
  • Instance segmentation module to improve MQA system performance for counting related questions
  • End-to-end trainable MQA pipeline which outperforms current state-of-the-art