In this work, we investigate the interaction between text, visual, and task attention models during the learning and execution of structured tasks expressed in natural language. In this direction, we propose an architecture that leverages and combines different attention models at multiple levels. Firstly, a multi-modal attention mechanism is introduced, enabling the agent to map objects in the environment to the words in the given mission, expressed in natural language, to effectively perform the required task. Secondly, an additional attention mechanism is introduced to direct the agent's textual attention to the parts of the sentence relevant to the subtasks yet to be completed. The agent is trained in MiniGrid environments using the Proximal Policy Optimization algorithm, and its performance is evaluated by comparing the proposed architecture with a baseline that excludes attention mechanisms. In addition, an ablation study is conducted on the attention module for task attention.

Integrating Text-Visual and Task Attention for Language-Guided Robot Learning / Rauso, G.; Caccavale, R.; Lippiello, V.; Finzi, A.. - 3956:(2025), pp. 42-48. ( 11th Italian Workshop on Artificial Intelligence and Robotics, AIRO 2024 ita 2024).

Integrating Text-Visual and Task Attention for Language-Guided Robot Learning

Rauso G.;Caccavale R.;Lippiello V.;Finzi A.
2025

Abstract

In this work, we investigate the interaction between text, visual, and task attention models during the learning and execution of structured tasks expressed in natural language. In this direction, we propose an architecture that leverages and combines different attention models at multiple levels. Firstly, a multi-modal attention mechanism is introduced, enabling the agent to map objects in the environment to the words in the given mission, expressed in natural language, to effectively perform the required task. Secondly, an additional attention mechanism is introduced to direct the agent's textual attention to the parts of the sentence relevant to the subtasks yet to be completed. The agent is trained in MiniGrid environments using the Proximal Policy Optimization algorithm, and its performance is evaluated by comparing the proposed architecture with a baseline that excludes attention mechanisms. In addition, an ablation study is conducted on the attention module for task attention.
2025
Integrating Text-Visual and Task Attention for Language-Guided Robot Learning / Rauso, G.; Caccavale, R.; Lippiello, V.; Finzi, A.. - 3956:(2025), pp. 42-48. ( 11th Italian Workshop on Artificial Intelligence and Robotics, AIRO 2024 ita 2024).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/1016846
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact