Basic Information
- Model Name: CodeGen2 (7B)
- Developer/Creator: Salesforce AI Research
- Release Date: 2023
- Version: 2.0
- Model Type: Autoregressive language model
Description
Overview
CodeGen2 (7B) is a 7 billion parameter autoregressive language model, a competitor in the realm of program synthesis. Developed by the minds at Salesforce AI Research, this model generates executable code based on natural language descriptions and completes partially-generated code snippets with precision.
Key Features
- Supports code infilling: CodeGen2 (7B takes you partially completed code, filling in the gaps and bringing it to life.
- Trained on a diverse dataset: Covering 12 programming languages and popular frameworks, this model is a diverse companion, capable of adapting to various coding environments and use cases.
- Capable of multi-turn code generation and completion: Engage in a dynamic dialogue with CodeGen2 (7B), refining and iterating on your code until it meets your exact specifications.
Intended Use
CodeGen2 (7B) is the ultimate helper for program synthesis . Whether you're a seasoned developer looking to streamline your workflow or an aspiring coder - this model has got you covered. It's able to generate code from natural language descriptions to completing partially-written code snippets and assisting in code refactoring and optimization.
Language Support
Supported languages (and frameworks) are as follows: C, C++, C-Sharp, Dart, Go, Java, Javascript, Kotlin, Lua, PHP, Python, Ruby, Rust, Scala, Shell, SQL, Swift, Typescript, and Vue.
Technical Details
Architecture
CodeGen2 (7B) is built upon the foundation of transformer-based architecture, popularized widely by GPT-3. However, modifications for program synthesis tasks were introduced. The result is a transformer-based architecture that captures long-range dependencies in the input sequence with high precision, ensuring that your code is not only well-structured but also semantically coherent.
Training Data
This checkpoint is trained on the stricter permissive subset of the deduplicated version of the Stack dataset (v1.1). From complex algorithms to simple of scripts, CodeGen2 (7B) has been exposed to a wide range of programming practices and techniques.
Data Source and Size
A dataset of approximately 1.5 billion tokens was used. The code has been curated to ensure high-quality and relevance to the target programming languages.
Knowledge Cutoff
Like a wise mentor, CodeGen2 (7B) has been trained on a wealth of knowledge, but even the most knowledgeable have their limits. The model's knowledge cutoff is based on the timestamp of the training data, that was collected up to June 2022.
Diversity and Bias
From niche programming domains to popular use cases, this model has been exposed to a wide range of coding practices and techniques.
Performance Metrics
On the HumanEval benchmark, this model achieved a score of 30.7, outperforming GPT-3. And on the MBPP (Mostly Basic Programming Problems) benchmark, CodeGen2 (7B) scored 43.1.
Usage
API Usage Example
const { OpenAI } = require('openai');
const api = new OpenAI({
baseURL: 'https://api.aimlapi.com/v1,
apiKey: '<YOUR_API_KEY>',
});
const main = async () => {
const result = await api.chat.completions.create({
model: 'Salesforce/codegen2-7B',
messages: [
{
role: 'system',
content: 'You are SQL code assistant.',
},
{
role: 'user',
content: 'Could you please provide me with an example of a database structure that I could use for a project in MySQL?'
}
],
});
const message = result.choices[0].message.content;
console.log(\`Assistant: \${message}\`);
};
main();
License Type
CodeGen2 (7B) is available under a commercial license. Developers interested in using the model for commercial purposes should contact Salesforce for licensing information and terms of use.