Ragkhitwetsagul, Chaiyong;
(2018)
Code similarity and clone search in large-scale source code data.
Doctoral thesis (Ph.D), UCL (University College London).
Preview |
Text
Chaiyong_Thesis_Minor_Corrections.pdf - Accepted Version Download (5MB) | Preview |
Abstract
Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse.
Type: | Thesis (Doctoral) |
---|---|
Qualification: | Ph.D |
Title: | Code similarity and clone search in large-scale source code data |
Event: | UCL (University College London) |
Open access status: | An open access version is available from UCL Discovery |
Language: | English |
Keywords: | code clones, software engineering |
UCL classification: | UCL UCL > Provost and Vice Provost Offices UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science |
URI: | https://discovery.ucl.ac.uk/id/eprint/10057538 |
Archive Staff Only
View Item |