We present a framework that enables a collaborative robot to rapidly replicate structured manipulation tasks demonstrated by a human operator through a single 3D video recording. The system combines object segmentation with hand and gaze tracking to analyze and interpret the video demonstrations. The manipulation task is decomposed into primitive actions that leverage 3D features, including the proximity of the hand trajectory to objects, the speed of the trajectory, and the user’s gaze. In line with the One-Shot Learning paradigm, we introduce a novel object segmentation method called SAM+CP-CVV, ensuring that objects appearing in the demonstration require labeling only once. Segmented manipulation primitives are also associated with object-related data, facilitating the implementation of the corresponding robotic actions. Once these action primitives are extracted and recorded, they can be recombined to generate a structured robotic task ready for execution. This framework is particularly well-suited for flexible manufacturing environments, where operators can rapidly and incrementally instruct collaborative robots through video-demonstrated tasks. We discuss the approach applied to heterogeneous manipulation tasks and show that the proposed method can be transferred to different types of robots and manipulation scenarios.

One-shot learning for rapid generation of structured robotic manipulation tasks from 3D video demonstrations / Duque-Domingo, Jaime; Caccavale, Riccardo; Finzi, Alberto; Zalama, Eduardo; Gómez-García-Bermejo, Jaime. - In: JOURNAL OF INTELLIGENT MANUFACTURING. - ISSN 0956-5515. - (2025). [10.1007/s10845-025-02673-7]

One-shot learning for rapid generation of structured robotic manipulation tasks from 3D video demonstrations

Caccavale, Riccardo;Finzi, Alberto;
2025

Abstract

We present a framework that enables a collaborative robot to rapidly replicate structured manipulation tasks demonstrated by a human operator through a single 3D video recording. The system combines object segmentation with hand and gaze tracking to analyze and interpret the video demonstrations. The manipulation task is decomposed into primitive actions that leverage 3D features, including the proximity of the hand trajectory to objects, the speed of the trajectory, and the user’s gaze. In line with the One-Shot Learning paradigm, we introduce a novel object segmentation method called SAM+CP-CVV, ensuring that objects appearing in the demonstration require labeling only once. Segmented manipulation primitives are also associated with object-related data, facilitating the implementation of the corresponding robotic actions. Once these action primitives are extracted and recorded, they can be recombined to generate a structured robotic task ready for execution. This framework is particularly well-suited for flexible manufacturing environments, where operators can rapidly and incrementally instruct collaborative robots through video-demonstrated tasks. We discuss the approach applied to heterogeneous manipulation tasks and show that the proposed method can be transferred to different types of robots and manipulation scenarios.
2025
One-shot learning for rapid generation of structured robotic manipulation tasks from 3D video demonstrations / Duque-Domingo, Jaime; Caccavale, Riccardo; Finzi, Alberto; Zalama, Eduardo; Gómez-García-Bermejo, Jaime. - In: JOURNAL OF INTELLIGENT MANUFACTURING. - ISSN 0956-5515. - (2025). [10.1007/s10845-025-02673-7]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/1016839
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact