Generative model‐enhanced human motion prediction

Abstract The task of predicting human motion is complicated by the natural heterogeneity and compositionality of actions, necessitating robustness to distributional shifts as far as out‐of‐distribution (OoD). Here, we formulate a new OoD benchmark based on the Human3.6M and Carnegie Mellon University (CMU) motion capture datasets, and introduce a hybrid framework for hardening discriminative architectures to OoD failure by augmenting them with a generative model. When applied to current state‐of‐the‐art discriminative models, we show that the proposed approach improves OoD robustness without sacrificing in‐distribution performance, and can theoretically facilitate model interpretability. We suggest human motion predictors ought to be constructed with OoD challenges in mind, and provide an extensible general framework for hardening diverse discriminative architectures to extreme distributional shift. The code is available at: https://github.com/bouracha/OoDMotion.

The favoured approach to predicting movements over time has been purely inductive, relying on the history of a specific class of movement to predict its future.For example, state space models [Koller and Friedman, 2009] enjoyed early success for simple, common or cyclic motions [Taylor et al., 2007, Sutskever et al., 2009, Lehrmann et al., 2014].The range, diversity and complexity of human motion has encouraged a shift to more expressive, deep neural network architectures [Fragkiadaki et al., 2015, Butepage et al., 2017, Martinez et al., 2017, Li et al., 2018, Mao et al., 2019, Li et al., 2020b, Cai et al., 2020], but still within a simple inductive framework.This approach would be adequate were actions both sharply distinct and highly stereotyped.But their complex, compositional nature means that within one category of action the kinematics may vary substantially, while between two categories they may barely differ.Moreover, few real-world tasks restrict the plausible repertoire to a small number of classes-distinct or otherwise-that could be explicitly learnt.Rather, any action may be drawn from a great diversity of possibilities-both kinematic and teleological-that shape the characteristics of the underlying movements.This has two crucial implications.First, any modelling approach that lacks awareness of the full space of motion possibilities will be vulnerable to poor generalisation and brittle performance in the face of kinematic anomalies.Second, the very notion of In-Distribution (ID) testing becomes moot, for the relations between different actions and their kinematic signatures are plausibly determinable only across the entire domain of action.A test here arguably needs to be Out-of-Distribution (OoD) if it is to be considered a robust test at all.These considerations are amplified by the nature of real-world applications of kinematic modelling, such as anticipating arbitrary deviations from expected motor behaviour early enough for an automatic intervention to mitigate them.Most urgent in the domain of autonomous driving [Bhattacharyya et al., 2018, Wang et al., 2019], such safety concerns are of the highest importance, and are best addressed within the fundamental modelling framework.Indeed, Amodei et al. [2016] cites the ability to recognize our own ignorance as a safety mechanism that must be a core component in safe AI.Nonetheless, to our knowledge, current predictive models of human kinematics neither quantify OoD performance nor are designed with it in mind.There is therefore a need for two frameworks, applicable across the domain of action modelling: one for hardening a predictive model to anomalous cases, and another for quantifying OoD performance with established benchmark datasets.General frameworks are here desirable in preference to new models, for the field is evolving so rapidly greater impact can be achieved by introducing mechanisms that can be applied to a breadth of candidate architectures, even if they are demonstrated in only a subset.Our approach here is founded on combining a latent variable generative model with a standard predictive model, illustrated with the current state-of-the-art discriminative architecture [Mao et al., 2019, Wei et al., 2020].Myronenko [2018], take an analogous approach, regularising an encoder-decoder model for brain tumor segmentation on magnetic resonance images by simultaneously modelling the distribution of the data using a variational autoencoder (VAE) [Kingma and Welling, 2013].Here the aim is to achieve robust performance within a low data regime, which coincides with the demand for OoD generalisation.
In short, our contributions to the problem of achieving robustness to distributional shift in human motion prediction are as follows: 1. We provide a framework to benchmark OoD performance on the most widely used opensource motion capture datasets: Human3.6M[Ionescu et al., 2013], and CMU-Mocap1 , and evaluate state-of-the-art models on it.
2. We present a framework for hardening deep feed-forward models to OoD samples.We show that the hardened models are fast to train, and exhibit substantially improved OoD performance with minimal impact on ID performance.
We begin section 2 with a brief review of human motion prediction with deep neural networks, and of OoD generalisation using generative models.In section 3, we define a framework for benchmarking OoD performance using open-source multi-action datasets.We introduce in section 4 the discriminative models that we harden using a generative branch to achieve a state-of-the-art (SOTA) OoD benchmark.We then turn in section 5 to the architecture of the generative model and the overall objective function.Section 6 presents our experiments and results.We conclude in section 7 with a summary of our results, current limitations, and caveats, and future directions for developing robust and reliable OoD performance and a quantifiable awareness of unfamiliar behaviour.

Related Work
Deep-network based human motion prediction.Historically, sequence-to-sequence prediction using Recurrent Neural Networks (RNNs) have been the de facto standard for human motion prediction [Fragkiadaki et al., 2015, Jain et al., 2016, Martinez et al., 2017, Guo and Choi, 2019, Gopalakrishnan et al., 2019, Li et al., 2020b].Currently, the SOTA is dominated by feed forward models [Butepage et al., 2017, Li et al., 2018, Mao et al., 2019, Wei et al., 2020].These are inherently faster and easier to train than RNNs.The jury is still out, however, on the optimal way to handle temporality for human motion prediction.Meanwhile, recent trends have overwhelmingly shown that graph-based approaches are an effective means to encode the spatial dependencies between joints [Mao et al., 2019, Wei et al., 2020], or sets of joints [Li et al., 2020b].In this study, we consider the SOTA models that have graph-based approaches with a feed forward mechanism as presented by [Mao et al., 2019], and the subsequent extension which leverages motion attention, Wei et al. [2020].We show that these may be augmented to improve robustness to OoD samples.
Generative models for Out-of-Distribution prediction and detection.Despite the power of deep neural networks for prediction in complex domains [LeCun et al., 2015], they face several challenges that limits their suitability for safety-critical applications.Amodei et al. [2016] list robustness to distributional shift as one of the five major challenges to AI safety.Deep generative models, have been used extensively for detection of OoD inputs and have been shown to generalise well in such scenarios [Hendrycks and Gimpel, 2016, Liang et al., 2017, Hendrycks et al., 2018].While recent work has showed some failures in simple OoD detection using density estimates from deep generative models [Nalisnick et al., 2018, Daxberger andHernández-Lobato, 2019], they remain a prime candidate for anomaly detection [Kendall and Gal, 2017, Grathwohl et al., 2019, Daxberger and Hernández-Lobato, 2019].Myronenko [2018] use a Variational Autoencoder (VAE) [Kingma and Welling, 2013] to regularise an encoder-decoder architecture with the specific aim of better generalisation.By simultaneously using the encoder as the recognition model of the VAE, the model is encouraged to base its segmentations on a complete picture of the data, rather than on a reductive representation that is more likely to be fitted to the training data.Furthermore, the original loss and the VAE's loss are combined as a weighted sum such that the discriminator's objective still dominates.Further work may also reveal useful interpretability of behaviour (via visualisation of the latent space as in Bourached and Nachev [2019]), generation of novel motion [Motegi et al., 2018], or reconstruction of missing joints as in Chen et al. [2015].

Quantifying out-of-distribution performance of human motion predictors
Even a very compact representation of the human body such as OpenPose's 17 joint parameterisation Cao et al. [2018] explodes to unmanageable complexity when a temporal dimension is introduced of the scale and granularity necessary to distinguish between different kinds of action: typically many seconds, sampled at hundredths of a second.Moreover, though there are anatomical and physiological constraints on the space of licit joint configurations, and their trajectories, the repertoire of possibility remains vast, and the kinematic demarcations of teleologically different actions remain indistinct.Thus, no practically obtainable dataset may realistically represent the possible distance between instances.To simulate OoD data we first need ID data that is as small in quantity, and narrow in domain as possible.For this reason we propose to define OoD on multi-action motion capture datasets as being the scenario where only a single action, the smallest labelled subset, is available for training and hyperparameter search.In appendix A, to show that the motion categories we have chosen can actually be distinguished at the time scales on which our trajectories are encoded we train a simple classifier and show that it can separate the selected ID action from the others with high accuracy (100% precision and recall for the CMU dataset).In this way OoD performance may be considered over the remaining set of actions.

Background
Here we describe the current SOTA model proposed by Mao et al. [2019] (GCN).We then describe the extension by Wei et al. [2020] (attention-GCN) which antecedes the GCN prediction model with motion attention.

Problem Formulation
We are given a motion sequence X 1:N = (x 1 , x 2 , x 3 , • • • , x N ) consisting of N consecutive human poses, where x i ∈ R K , with K the number of parameters describing each pose.The goal is to predict the poses X N +1:N +T for the subsequent T time steps.

DCT-based Temporal Encoding
The input is transformed using Discrete Cosine Transformations (DCT).In this way each resulting coefficient encodes information of the entire sequence at a particular temporal frequency.Furthermore, the option to remove high or low frequencies is provided.Given a joint, k, the position of k over N time steps is given by the trajectory vector: where we convert to a DCT vector of the form: , these coefficients may be computed as If no frequencies are cropped, the DCT is invertible via the Inverse Discrete Cosine Transform (IDCT): The target is now simply the ground truth x k .

Graph Convolutional Network
Suppose C ∈ R K×(N +T ) is defined on a graph with k nodes and N + T dimensions, then we define a graph convolutional network to respect this structure.First we define a Graph Convolutional Layer (GCL) that, as input, takes the activation of the previous layer (A [l−1] ), where l is the current layer.

GCL(A
where +T ) , and S ∈ R K×K is a layer-specific learnable normalised graph laplacian that represents connections between joints, are the learnable inter-layer weightings and b are the learnable biases where n [l] are the number of hidden units in layer l.

Network Structure and Loss
The network consists of 12 Graph Convolutional Blocks (GCBs), each containing 2 GCLs with skip (or residual) connections, see figure 5. Additionally, there is one GCL at the beginning of the network, and one at the end.n [l] = 256, for each layer, l.There is one final skip connection from the DCT inputs to the DCT outputs, which greatly reduces train time.The model has around 2.6M parameters.
Hyperbolic tangent functions are used as the activation function.Batch normalisation is applied before each activation.
The outputs are converted back to their original coordinate system using the IDCT (equation 2) to be compared to the ground truth.The loss used for joint angles is the average l 1 distance between the ground-truth joint angles, and the predicted ones.Thus, the joint angle loss is: where xk,n is the predicted k th joint at timestep n and x k,n is the corresponding ground truth.This is separately trained on 3D joint coordinate prediction making use of the Mean Per Joint Position Error (MPJPE), as proposed in Ionescu et al. [2013] and used in Mao et al. [2019], Wei et al. [2020].This is defined, for each training example, as where pj,n ∈ R 3 denotes the predicted jth joint position in frame n.And p j,n is the corresponding ground truth, while J is the number of joints in the skeleton.
Here n z = 16 is the number of latent variables per joint.

Motion attention extension
Wei et al. [2020] extend this model by summing multiple DCT transformations from different sections of the motion history with weightings learned via an attention mechanism.For this extension, the above model (the GCN) along with the anteceding motion attention is trained end-to-end.We refer to this as the attention-GCN.
5 Our Approach Myronenko [2018] augment an encoder-decoder discriminative model by using the encoder as a recognition model for a Variational Autoencoder (VAE), [Kingma andWelling, 2013, Rezende et al., 2014].Myronenko [2018] show this to be a very effective regulariser.Here, for conjugacy with the discriminator, we consider the Variational Graph Autoencoder (VGAE), proposed by Kipf and Welling [2016] as a framework for unsupervised learning on graph-structured data.
The generative model sets a precedence for information that can be modelled causally, while leaving elements of the discriminative machinery, such as skip connections, to capture correlations that remain useful for prediction but are not necessarily persuant to the objective of the generative model.In addition to performing the role of regularisation in general, we show that we gain robustness to distributional shift across similar, but different, actions that are likely to share generative properties.The architecture may be considered with the visual aid in figure 1.

Variational Graph Autoencoder (VGAE) Branch and Loss
Here we define the first 6 GCB blocks as our VGAE recognition model, with a latent variable z ∈ R K×nz = N (µ z , σ z ), where µ z ∈ R K×nz , σ z ∈ R K×nz .n z = 8, or 32 depending on training stability.
The KL divergence between the latent space distribution and a spherical Gaussian N (0, I) is given by: The decoder part of the VGAE has the same structure as the discriminative branch; 6 GCBs.We parametrise the output neurons as µ ∈ R K×(N +T ) , and log(σ 2 ) ∈ R K×(N +T ) .We can now model the reconstruction of inputs as samples of a maximum likelihood of a Gaussian distribution which constitutes the second term of the negative Variational Lower Bound (VLB) of the VGAE: where C k,l are the DCT coefficients of the ground truth.

Training
We train the entire network together with the additional of the negative VLB: Here λ is a hyperparameter of the model.The overall network is ≈ 3.4M parameters.The number of parameters varies slightly as per the number of joints, K, since this is reflected in the size of the graph in each layer (k = 48 for H3.6M, K = 64 for CMU joint angles, and K = J = 75 for CMU Cartesian coordinates).Furthermore, once trained, the generative model is not required for prediction and hence for this purpose is as compact as the original models.

Datasets and Experimental Setup
Human3.6M (H3.6M)The H3.6M dataset [Ionescu et al., 2011[Ionescu et al., , 2013]], so called as it contains a selection of 3.6 million 3D human poses and corresponding images, consists of seven actors each performing 15 actions, such as walking, eating, discussion, sitting, and talking on the phone.Martinez et al. [2017], Mao et al. [2019], Li et al. [2020b] all follow the same training and evaluation procedure: training their motion prediction model on 6 (5 for train and 1 for cross-validation) of the actors, for each action, and evaluate metrics on the final actor, subject 5.For easy comparison to these ID baselines, we maintain the same train; cross-validation; and test splits.However, we use the single, most well-defined action (see appendix A), walking, for train and cross-validation, and we report test error on all the remaining actions from subject 5.In this way we conduct all parameter selection based on ID performance.
CMU motion capture (CMU-mocap) The CMU dataset consists of 5 general classes of actions.
Similarly to [Li et al., 2018, 2020a, Mao et al., 2019] we use 8 detailed actions from these classes: 'basketball', 'basketball signal', 'directing traffic' 'jumping, 'running', 'soccer', 'walking', and 'window washing'.We use two representations, a 64-dimensional vector that gives an exponential map representation [Grassia, 1998] of the joint angle, and a 75-dimensional vector that gives the 3D Cartesian coordinates of 25 joints.We do not tune any hyperparameters on this dataset and use only a train and test set with the same split as is common in the literature [Martinez et al., 2017, Mao et al., 2019].

Model configuration
We implemented the model in PyTorch [Paszke et al., 2017] using the ADAM optimiser [Kingma and Ba, 2014].The learning rate was set to 0.0005 for all experiments where, unlike Mao et al. [2019], Wei et al. [2020], we did not decay the learning rate as it was hypothesised that the dynamic relationship between the discriminative and generative loss would make this redundant.The batch size was 16.For numerical stability, gradients were clipped to a maximum 2-norm of 1 and log(σ 2 ) and values were clamped between -20 and 3.

Walking
Eating Smoking Discussion Average milliseconds 560 1000 560 1000 560 1000 560 1000 560 1000 GCN (OoD) 0.80 0.80 0.89 1.20 1.26 1.85 1.45 1.88 1.10 1.43 ours (OoD) 0.66 0.72 0.90 1.19 1.17 1.78 1.44 1.90 1.04 1.40 Table 2: Long-term prediction of Eucildean distance between predicted and ground truth joint angles on H3.6M.Mao et al. [2019] (GCN), and Wei et al. [2020] (attention-GCN) use this same Graph Convolutional Network (GCN) architecture with DCT inputs.In particular, Wei et al. [2020] increase the amount of history accounted for by the GCN by adding a motion attention mechanism to weight the DCT coefficients from different sections of the history prior to being inputted to the GCN.We compare against both of these baselines on OoD actions.For attention-GCN we leave the attention mechanism preceding the GCN unchanged such that the generative branch of the model is reconstructing the weighted DCT inputs to the GCN, and the whole network is end-to-end differentiable.

Baseline comparison Both
Hyperparameter search Since a new term has been introduced to the loss function, it was necessary to determine a sensible weighting between the discriminative and generative models.In Myronenko [2018], this weighting was arbitrarily set to 0.1.It is natural that the optimum value here will relate to the other regularisation parameters in the model.Thus, we conducted random hyperparameter search for p drop and λ in the ranges p drop = [0, 0.5] on a linear scale, and λ = [10, 0.00001] on a logarithmic scale.For fair comparison we also conducted hyperparameter search on GCN, for values of the dropout probability (p drop ) between 0.1 and 0.9.For each model, 25 experiments were run and the optimum values were selected on the lowest ID validation error.The hyperparameter search was conducted only for the GCN model on short-term predictions for the H3.6M dataset and used for all future experiments hence demonstrating generalisability of the architecture.

Results
Consistent with the literature we report short-term (< 500ms) and long-term (> 500ms) predictions.In comparison to GCN, we take short term history into account (10 frames, 400ms) for both datasets to predict both short-and long-term motion.In comparison to attention-GCN, we take long term history (50 frames, 2 seconds) to predict the next 10 frames, and predict futher into the future by

Conclusion
We draw attention to the need for robustness to distributional shifts in predicting human motion, and propose a framework for its evaluation based on major open source datasets.We demonstrate that state-of-the-art discriminative architectures can be hardened to extreme distributional shifts by augmentation with a generative model, combining low in-distribution predictive error with maximal generalisability.The introduction of a surveyable latent space further provides a mechanism for model perspicuity and interpretability, and explicit estimates of uncertainty facilitate the detection of anomalies: both characteristics are of substantial value in emerging applications of motion prediction, such as autonomous driving, where safety is paramount.Our investigation argues for wider use of generative models in behavioural modelling, and shows it can be done with minimal or no performance penalty, within hybrid architectures of potentially diverse constitution.The general increase in the distinguishability that can be seen in figure 3b increases the demand to be able to robustly handle distributional shifts as the distribution of values that represent different actions only gets more pronounced as the time scale is increased.This is true with even the näive DCT transformation to capture longer time scales without increasing vector size.
As we can see from the confusion matrix in figure 3c, the actions in the CMU dataset are even more easily separable.In particular, our selected ID action in the paper, Basketball, can be identified with 100% precision and recall on the test set.

B Latent space of the VGAE
One of the advantages of having a generative model involved is that we have a latent variable which represents a distribution over deterministic encodings of the data.We considered the question of whether or not the VGAE was learning anything interpretable with its latent variable as was the case in Kipf and Welling [2016].
The purpose of this investigation was two-fold.First to determine if the generative model was learning a comprehensive internal state, or just a non-linear average state as is common to see in the training of VAE like architectures.The result of this should suggest a key direction of future work.Second, an interpretable latent space may be of paramount usefulness for future applications of human motion prediction.Namely, if dimensionality reduction of the latent space to an inspectable number of dimensions yields actions, or behaviour that are close together if kinematically or teleolgically similar, as in Bourached and Nachev [2019], then human experts may find unbounded potential application for a interpretation that is both quantifiable and qualitatively comparable to all other classes within their domain of interest.For example, a medical doctor may consider a patient to have unusual  symptoms for condition, say, A. It may be useful to know that the patient's deviation from a classical case of A, is in the direction of condition, say, B.
We trained the augmented GCN model discussed in the main text with all actions, for both datasets.We use Uniform Manifold Approximation and Projection (UMAP) [McInnes et al., 2018] to project the latent space of the trained GCN models onto 2 dimensions for all samples in the dataset for each dataset independently.From figure 4 we can see that for both models the 2D project relatively closely resembles a spherical gaussian.Further, we can see from figure 4b that the action walking does not occupy a discernible domain of the latent space.This result is further verified by using the same classifier as used in appendix A, which achieved no better than chance when using the latent variables as input rather than the raw data input.This result implies that the benefit observed in the main text is by using the generative model is significant even if the generative model has poor performance itself.In this case we can be sure that the reconstructions are at least not good enough to distinguish between actions.It is hence natural for future work to investigate if the improvement on OoD performance is greater if trained in such a way as to ensure that the generative model performs well.There are multiple avenues through which such an objective might be achieve.Pre-training the generative model being one of the salient candidates.
Mao et al.  use the DCT transform with a graph convolutional network architecture to predict the output sequence.This is achieved by having an equal length input-output sequence, where the input is the DCT transformation of x k = [x k,1 , . . ., x k,N , x k,N +1 , . . ., x k,N +T ], here [x k,1 , . . ., x k,N ] is the observed sequence and [x k,N +1 , . . ., x k,N +T ] are replicas of x k,N (ie x k,n = x k,N for n ≥ N ).
(a) Distribution of short-term training instances for actions in h3.6M.(b) Distribution of training instances for actions in CMU.

Figure 3 :
Figure 3: Confusion matricies for a multi-class classifier for action labels.In each case we use the same input convention x k = [x k,1 , . . ., x k,N , x k,N +1 , . . ., x k,N +T ] where x k,n = x k,N for n ≥ N .Such that in each case input to the classifier is 48 × 20 = 960.The classifier has 4 fully connected layers.Layer 1: input dimensions × 1024, layer 2: 1024 × 512, layer 3: 512 × 128, layer 4: 128 × 15 (or 128 × 8 for CMU).Where the final layer uses a softmax to predict the class label.Cross entropy loss is used for training and ReLU activations with a dropout probability of 0.5.We used a batch size of 2048, and a learning rate of 0.00001.

Figure 4 :Figure 5 :
Figure 4: Latent embedding of the trained model on both the H3.6m and the CMU datasets independently projected in 2D using UMAP from 384 dimensions for H3.6M, and 512 dimensions for CMU using default hyperparameters for UMAP.

Table 1 :
Code for all experiments is available at the following link: https://github.com/bouracha/OoDMotionShort-term prediction of Eucildean distance between predicted and ground truth joint angles on H3.6M.