Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search (2305.11626v2)

Published 19 May 2023 in cs.CL and cs.SE

Abstract: We consider the well-known and important tasks of clone detection and information retrieval for source code. The most standard setup is to search clones inside the same language code snippets. But it is also useful to find code snippets with identical behaviour in different programming languages. Nevertheless multi- and cross-lingual clone detection has been little studied in literature. We present a novel training procedure, cross-consistency training (CCT) leveraging cross-lingual similarity, that we apply to train LLMs on source code in various programming languages. We show that this training is effective both for encoder- and decoder-based models. The trained encoder-based CCT-LM model achieves a new state of the art on POJ-104 (monolingual C++ clone detection benchmark) with 96.73\% MAP and AdvTest (monolingual Python code search benchmark) with 47.18\% MRR. The decoder-based CCT-LM model shows comparable performance in these tasks. In addition, we formulate the multi- and cross-lingual clone detection problem and present XCD, a new benchmark dataset produced from CodeForces submissions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Nikita Sorokin (3 papers)
  2. Dmitry Abulkhanov (7 papers)
  3. Sergey Nikolenko (33 papers)
  4. Valentin Malykh (24 papers)
  5. Anton Tikhonov (2 papers)
  6. Irina Piontkovskaya (24 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.