Code Retrieval Models | SFR-Embedding-Code

Latest
January 31, 2025

Code retrieval is a critical, yet under explored area in the field of artificial intelligence. While text retrieval systems have seen remarkable success in natural language processing (NLP) tasks, these approaches often fall short when applied to code. Developers face unique challenges when retrieving code snippets, such as understanding syntax, control flow, and variable dependencies. Enter SFR-Embedding-Code, a groundbreaking family of code embedding models that aims to address these challenges and revolutionize how we retrieve and generate code. Whether you’re a seasoned programmer or someone curious about the intersection of AI and coding, this blog will walk you through how SFR-Embedding-Code is revolutionizing the landscape.

Why Do We Need Retrieval for Code Generation?

Developers often require code that adheres to specific project conventions, handles dependencies, or integrates with frameworks correctly. Code generation models may produce generic or incomplete results that fail to address these nuances. Meanwhile, retrieval-augmented generation approaches fetch tested, real-world examples directly from source code repositories, ensuring relevance and best practices for a more robust and efficient migration process.

Figure 1: When migrating from transformers==3.x to transformers==4.x, you often run into code-breaking changes. The figure emphasizes that generating fixes alone is insufficient—you must retrieve information from multiple sources: the library’s GitHub PRs and Release Notes (to learn what changed), your own codebase (to find outdated usage), and community examples (to see the recommended new patterns).

By integrating retrieval with generation, developers significantly reduce time spent debugging or modifying incomplete generated code.

Why Do We Need Code Retrieval Instead of Text Retrieval for Retrieving Code?

Text-based retrievers trained on natural language embeddings often fail to capture the structural and semantic nuances of code. Code retrieval requires models are trained specifically on code embeddings to understand programming-specific syntax, structure, and context.

Figure 2: Text-based retrieval fails because it only works for 2D lists (e.g., [[1, 2], [3, 4]]), not arbitrary nesting (e.g., [[1, [2, 3]], 4]). Code-based retrieval works for arbitrarily nested lists through recursion.

A text-based retriever focuses on keyword similarity (e.g., “flatten” and “list”) and surface-level matches, overlooking deeper semantic relationships in the code.
A code-based retriever understands the syntax, logic, and patterns in code, enabling it to retrieve contextually correct and general-purpose implementations.

Meet SFR-Embedding-Code: Salesforce AI Research’s Code Embedding Models

SFR-Embedding-Code introduces a family of large-scale, open-source embedding models with parameter sizes ranging from 400 million to 7 billion. These models redefine the state-of-the-art in code retrieval, outperforming the second best model by over 20% on the CoIR benchmark.

SFR-Embedding-Code is designed for both code and text retrieval, making it a versatile tool across domains. With its cutting-edge capabilities and broad applicability, SFR-Embedding-Code stands out as a transformative tool in the AI and coding space. Let’s break it down into key highlights:

Unified Framework: SFR-Embedding-Code converts diverse programming tasks into a common retrieval format. It supports 12 programming languages and five retrieval categories, including code-to-text, text-to-code, and hybrid tasks.
Performance Boost: Leveraging LLMs with bidirectional attention, the 7B model achieves unprecedented results in code retrieval, setting a new benchmark on CoIR while maintaining competitive performance on BEIR for text retrieval.
Open Source: Unlike closed models, SFR-Embedding-Code is open to the community, fostering transparency and collaboration.
Retrieval-Augmented Generation (RAG): Enhanced retrieval improves downstream tasks like code summarization, completion, and issue fixing, making it invaluable for developers.
Scalable Model Choices: SFR-Embedding-Code provides multiple model sizes, each optimized for different use cases. The 7B model is designed for maximum quality, excelling in performance but requiring more compute resources. The 400M model prioritizes efficiency, offering faster inference, lower cost, and the ability to run on CPUs. The 2B model strikes a balance between quality and efficiency, serving as a practical ground for various applications.

To build an effective code retrieval system, we leverage existing datasets by structuring them into meaningful (query, document) pairs for contrastive learning.

To enable Text-to-Code retrieval, we leverage Text2SQL and Code Contest tasks, treating queries as textual descriptions and documents as corresponding code solutions. Code-to-Text retrieval relies on Code Summarization, using full code as the query and its concise summary as the document.

For Code-to-Code retrieval, we use:

Code Translation treats code written in one programming language (PL A) as the query and its equivalent translation in another language (PL B) as the positive document, ensuring bidirectional learning.
Code Clone Detection identifies functionally identical but syntactically different code snippets and repurposes them into (query, document) pairs, capturing variations in implementation.
Code Completion uses a given code snippet as the query and the next logical segment as the positive document, aligning with how developers iteratively build code.

Expanding into Hybrid Code Retrieval, we introduce more dynamic query-document structures:

Code Issue Fixing pairs bug reports (queries) with their corresponding fixes (documents).
Code Agent Conversations use past conversational turns as queries and the next expected response as the document.

To further enhance the code retrieval capabilities, we incorporate negative examples within the same programming language, ensuring better model generalization and distinction between relevant and irrelevant matches. By repurposing these datasets into structured query-document pairs, we create a robust retrieval system tailored for efficient and effective code search.

Real-World Application Results

Code Retrieval

*Represents that the numbers are reported from the CoIR Leaderboard*

Averaged over all 10 datasets in the CoIR benchmark, the SFR-Embedding-Code7B is currently the top-ranked model, casing the effectiveness of its general code and text training. Smaller models like SFR-Embedding-Code 400M and 2B also outperform the comparable sized models, offering efficient alternatives with advantages in latency and cost. Their success underscores the robustness and generalizability of our training methodology.

Bridging Text and Code Retrieval

SFR-Embedding-Code does more than handle code alone. Its text retrieval capabilities make it a versatile, dual-purpose tool for locating documentation for specific code snippets, searching for explanations or tutorials, and integrating text and code insights into cohesive solutions.

Compared to the strongest text embedding models, NV-Embed-V2 and SFR-V2, SFR-Embedding-Code consistently outperforms them with higher average scores across text and code retrieval datasets.

Resolving GitHub Issues

SFR-Embedding-Code was evaluated on SWE-Bench Lite following Code-RAG bench, which has a 300-problem dataset for modifying multiple files to pass test cases. Improved retrieval quality by SFR-Embedding-Code model family boosted problem-solving accuracy and efficiency, approaching close to gold-standard performance in a standardized Docker environment.

Conclusion

SFR-Embedding-Code represents a major leap forward in AI-driven code retrieval, bridging the gap between text and code search, enabling higher productivity, and driving innovation. Its open-source nature invites collaboration and transparency, empowering the broader AI and developer communities to build on its foundation.