Visual Dialogue for Open Tasks

Abstract

Visual Dialogue is a task requiring an AI agent to hold dialogue with humans in natural, conversational language about visual content. It is a challenging task, requiring a high level of understanding about both the visual world and natural language. The open nature of conversational agents further increases the complexity of this task. This task brings together the two main fields of AI and, being sufficiently detached from typical downstream tasks, serves as a general test of machine intelligence. In addition to the technical challenge, it is also an impactful application of AI, as it can help users when interacting with systems, improving their experience. In the context of this work, we propose to enrich the multimodal aspect of a task assistant, in two ways: 1) Dialogue Video Moment Retrieval: We will allow users to navigate through videos by voice. We will extract the video’s most relevant frames, create useful data about these frames, and index the data, so it can later be retrieved; 2) Task-Grounded Image Sequence Synthesis: We will use Image Synthesis models to illustrate task steps, with an emphasis on sequence coherence.

João Bordalo
João Bordalo
PhD Student

PhD Student from NOVA School of Science and Technology.