Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus (2201.11313v1)

Published 27 Jan 2022 in cs.CL and cs.IR

Abstract: Semantic code search is the task of retrieving relevant code snippet given a natural language query. Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic concepts and semantics. Recently, deep neural network for code search has been a hot research topic. Typical methods for neural code search first represent the code snippet and query text as separate embeddings, and then use vector distance (e.g. dot-product or cosine) to calculate the semantic similarity between them. There exist many different ways for aggregating the variable length of code or query tokens into a learnable embedding, including bi-encoder, cross-encoder, and poly-encoder. The goal of the query encoder and code encoder is to produce embeddings that are close with each other for a related pair of query and the corresponding desired code snippet, in which the choice and design of encoder is very significant. In this paper, we propose a novel deep semantic model which makes use of the utilities of not only the multi-modal sources, but also feature extractors such as self-attention, the aggregated vectors, combination of the intermediate representations. We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search. We align cross-lingual embedding for multi-modality learning with large batches and hard example mining, and combine different learned representations for better enhancing the representation learning. Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark. Models and code are available at https://github.com/overwindows/SemanticCodeSearch.git.

Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube