Promotionsvorhaben
On the Recognition of Human Activities and the Evaluation of Its Imitation by Robotic Systems
Name
Raphael Memmesheimer
Status
Abgeschlossen
Abschluss der Promotion
Erstbetreuer*in
Prof. Dr.-Ing. Dietrich Paulus
Gutachter*in 2
Prof. Dr. Hildegard Kühne
This thesis addresses the problem of action recognition by the analysis of human motion and the bench- marking of its imitation by robotic systems.
For our action recognition related approaches, we focus on presenting approaches that generalize well across different sensor modalities. We transform multivariate signal streams, originating from various sensors, to a common image representation. The action recognition problem on sequential multivariate signal streams can then be reduced to an image classification task for which we utilize recent advances in machine learning. We demonstrate wide applicability of our approaches formulated as a supervised clas- sification task for action recognition, a semi-supervised classification task for one-shot action recognition, modality fusion and temporal action segmentation.
For action classification, we use an EfficientNet Convolutional Neural Network (CNN) model to classify the image representations of various data modalities. Further, we present approaches for filtering and the fusion of various modalities on a representation level. We extend the approach to be applicable for semi-supervised classification and train a metric-learning model that encodes action similarity. Dur- ing training, the encoder optimizes the distances in embedding space for self-, positive- and negative- pair similarities. The resulting encoder allows estimating action similarity by calculating distances in embedding space. At training time, no action classes from the test set are used.
Graph Convolutional Network (GCN) generalize the concept of CNNs to non-Euclidean data struc- tures and showed great success for action recognition directly operating on spatio-temporal sequences like skeleton sequences. GCNs have recently shown state-of-the-art performance for skeleton-based ac- tion recognition, but are currently widely neglected as the foundation for the fusion of various sensor modalities. We propose to incorporate additional modalities, like from inertial measurements or RGB features into a skeleton-graph, by proposing fusion on two different dimensionality levels. On a chan- nel dimension, modalities are fused by introducing additional node attributes. On a spatial dimension, additional nodes are incorporated into the skeleton-graph.
Transformer models showed great performance for the analysis of sequential data. We formulate the temporal action segmentation task as an object detection task and use a detection transformer model on our proposed motion image representations. Experiments for our action recognition related approaches are executed on large-scale public available datasets. Our approaches for action recognition for various modalities, action recognition by fusion of various modalities, one-shot action recognition demonstrate state-of-the-art results on some of the datasets.
Finally, we present a hybrid imitation learning benchmark. The benchmark consists of a dataset, metrics and an integration into a simulator. The dataset contains RGB-D image sequences of humans performing movements and executing manipulation tasks, as well as the corresponding ground truth. The RGB-D camera is calibrated against a motion capturing system and the resulting sequences serve as input for imitation learning approaches. The resulting policy is then executed in the simulated environment on different robots. We propose two metrics to assess the quality of the imitation. The trajectory metric gives insights into how close the execution was to the demonstration. The effect metric describes how close the final state was reached according to the demonstration. We believe that the Simitate benchmark can improve the comparability of imitation learning approaches.