UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

Villegas, R; Moraldo, H; Castro, S; Babaeizadeh, M; Zhang, H; Kunze, J; Kindermans, PJ; ... Erhan, D; + view all (2023) Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions. In: 11th International Conference on Learning Representations, ICLR 2023. International Conference on Learning Representations (ICLR): Kigali, Rwanda. Green open access

[thumbnail of 4854_phenaki_variable_length_video_.pdf]
Preview
Text
4854_phenaki_variable_length_video_.pdf - Published Version

Download (16MB) | Preview

Abstract

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from open domain time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

Type: Proceedings paper
Title: Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions
Event: 11th International Conference on Learning Representations, ICLR 2023
Open access status: An open access version is available from UCL Discovery
Publisher version: https://openreview.net/forum?id=vOEXS39nOF
Language: English
Additional information: This version is the version of record. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10196597
Downloads since deposit
206Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item