Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

How to get better embeddings with code pre-trained models? An empirical study (2311.08066v1)

Published 14 Nov 2023 in cs.SE

Abstract: Pre-trained LLMs have demonstrated powerful capabilities in the field of NLP. Recently, code pre-trained model (PTM), which draw from the experiences of the NLP field, have also achieved state-of-the-art results in many software engineering (SE) downstream tasks. These code PTMs take into account the differences between programming languages and natural languages during pre-training and make adjustments to pre-training tasks and input data. However, researchers in the SE community still inherit habits from the NLP field when using these code PTMs to generate embeddings for SE downstream classification tasks, such as generating semantic embeddings for code snippets through special tokens and inputting code and text information in the same way as pre-training the PTMs. In this paper, we empirically study five different PTMs (i.e. CodeBERT, CodeT5, PLBART, CodeGPT and CodeGen) with three different architectures (i.e. encoder-only, decoder-only and encoder-decoder) on four SE downstream classification tasks (i.e. code vulnerability detection, code clone detection, just-in-time defect prediction and function docstring mismatch detection) with respect to the two aforementioned aspects. Our experimental results indicate that (1) regardless of the architecture of the code PTMs used, embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet; (2) the quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information; (3) using the method that aggregates the vector representations of all code tokens, the decoder-only PTMs can obtain code embeddings with semantics as rich as or even better quality than those obtained from the encoder-only and encoder-decoder PTMs.

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.