Leveraging Automated Unit Tests for Unsupervised Code Translation

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Bookmark & Share

Leveraging Automated Unit Tests for Unsupervised Code Translation

Zhang, Jie; Harman, mark; (2022) Leveraging Automated Unit Tests for Unsupervised Code Translation. In: Proceedings of The Tenth International Conference on Learning Representations: ICLR 2022. ICLR: Virtual conference. Green open access

Preview

PDF
UnitTests (1).pdf - Published Version
Download (1MB) | Preview

Abstract

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java→Python and Python→C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.

Type:	Proceedings paper
Title:	Leveraging Automated Unit Tests for Unsupervised Code Translation
Event:	The Tenth International Conference on Learning Representations
Dates:	25 Apr 2022 - 29 Apr 2022
Open access status:	An open access version is available from UCL Discovery
Publisher version:	https://openreview.net/pdf?id=cmt-6KtR4c4
Language:	English
Additional information:	This version is the version of record. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords:	unsupervised, translation, code, self-training, pseudo-labelling, unit tests, programming languages, deep learning, transformer
UCL classification:	UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science UCL > Provost and Vice Provost Offices > UCL BEAMS UCL
URI:	https://discovery.ucl.ac.uk/id/eprint/10149501

Downloads since deposit

156Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item