Replit-Code-v1 (3B)

Replit's powerful code completion model, replit-code-v1-3b, supports 20 languages.

Basic Information

Model Name: Replit-Code-v1-3b
Developer/Creator: Replit
Release Date: 2023
Version: 1.0 3B
Model Type: Causal Language Model

Description

Overview

replit-code-v1-3b is a 2.7B parameter Causal Language Model developed by Replit, Inc. The model is focused on code completion and has been trained on a diverse dataset of 20 programming languages, including Markdown, Java, JavaScript, Python, and more, totaling 525B tokens.

Key Features

Extensive permissively licensed training data
State-of-the-art results on HumanEval and Multi-PLe benchmarks
Broad multi-language support for Replit's top 30 programming languages
Latest techniques like Flash Attention, AliBi positional embeddings, and LionW optimizer
High-quality curated training data with specialized filtering and cleaning

Intended Use

The model is intended to be used by anyone as a foundation for application-specific fine-tuning without strict limitations on commercial use.

Language Support

The model supports 20 different programming languages: Markdown, Java, JavaScript, Python, TypeScript, PHP, SQL, JSX, reStructuredText, Rust, C, CSS, Go, C++, HTML, Vue, Ruby, Jupyter Notebook, R, Shell.

Technical Details

Architecture

The model utilizes advanced techniques like Flash Attention and AliBi positional embeddings to enable efficient training and inference on long input sequences.

Training Data

The model was trained on a subset of the Stack Dedup v1.2 dataset, which contains 175B tokens across 20 programming languages.
The training data was repeated over 3 epochs, resulting in a total of 525B tokens used for training.
The model's knowledge cutoff date is unknown.

Performance Metrics

When fine-tuned on public Replit user code, the model outperforms much larger models like CodeLlama7B.
Comparison to other models: The model outperforms CodeLlama7B, a much larger model, on the HumanEval and Multi-PLe benchmarks.

Usage

API Example Usage

Ethical Guidelines

The model's training data went through data cleansing filters. Users are still advised to exercise reasonable caution when using the model in production systems.

License Type

The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0). The source code files are licensed under the Apache 2.0 license.

Try it now