UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

POSIT: Simultaneously Tagging Natural and Programming Languages

Partachi, P-P; Treude, C; Dash, SK; Barr, ET; (2020) POSIT: Simultaneously Tagging Natural and Programming Languages. In: Proceedings of the 42nd International Conference on Software Engineering (ICSE '20). (pp. pp. 1348-1358). ACM: Seoul, Republic of Korea. Green open access

[thumbnail of Partachi_PID6342633.pdf]
Preview
Text
Partachi_PID6342633.pdf - Accepted Version

Download (933kB) | Preview

Abstract

Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems --- traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.

Type: Proceedings paper
Title: POSIT: Simultaneously Tagging Natural and Programming Languages
Event: 42nd International Conference on Software Engineering (ICSE '20)
Location: Seoul, Republic of Korea
Dates: 23 May 2020 - 29 May 2020
ISBN-13: 978-1-4503-7121-6
Open access status: An open access version is available from UCL Discovery
DOI: 10.1145/3377811.3380440
Publisher version: https://doi.org/10.1145/3377811.3380440
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: part-of-speech Tagging, Mixed-Code, Code-Switching, Language Identification
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10091199
Downloads since deposit
68Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item