Vecerik, Mel;
(2025)
Vision-Based Spatial Representations for Robot Manipulation.
Masters thesis (Ph.D), UCL (University College London).
Preview |
Text
Vecerik_10203467_Thesis.pdf Download (35MB) | Preview |
Abstract
Robotics focuses on how to control complex mechanical machines to perform useful and complex behaviours. The range of behaviours that machines can exhibit is limited not only by their physical capabilities but also by their ability to perceive the environment. Cheap and abundant RGB cameras are readily available, but using their data for robotic applications is still difficult. To confront this challenge, we leverage the progress made in deep learning and computer vision, employing spatial representations. In this thesis we show how to effectively extract information from visual data, in the form of keypoints, and demonstrate how to use it as a compact and meaningful representation. In the first part of this work we focus on detecting a small number of keypoints from dozens of human annotations. Training on this data alone would not yield a robust detection model. Therefore we introduce a novel way to use unlabelled multi-view data. We show that this gives us a representation which is not only useful for human defined motions, but also for learned agents. A downside of this approach is that it requires retraining the model if the desired points change. We propose a solution to this in the second part of this thesis, where we present a method to learn a latent space of point identities without the need for prior human annotations. We build upon this in the concluding section, where we move towards generalisation to novel, unseen objects. We show how using a point tracking model, such as TAPIR, we can extract task information from a few demonstrations and then reproduce the motion autonomously. This enables programming robots to solve long horizon visuo-motor tasks, such as gluing or block insertion. Notably, it works with unseen objects, without annotations, and is robust to background changes.
Type: | Thesis (Masters) |
---|---|
Qualification: | Ph.D |
Title: | Vision-Based Spatial Representations for Robot Manipulation |
Open access status: | An open access version is available from UCL Discovery |
Language: | English |
Additional information: | Copyright © The Author 2025. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Licence (https://creativecommons.org/licenses/by-nc-nd/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request. |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science |
URI: | https://discovery.ucl.ac.uk/id/eprint/10203467 |




Archive Staff Only
![]() |
View Item |