One of the most recent advancements in natural language processing (NLP) is the emergence of large language models (LLMs) built using vast datasets with enormous amounts of data. Several LLMs are available, such as Google’s BERT and OpenAI’s GPT-2 and GPT-3. They can be trained on 570 gigabytes of text. It is possible to generate everything from simple essays to actual financial models with these models.

AI startups including OpenAI, Hugging Face, Cohere, AI21 Labs are pushing the boundaries of LLM by training models with billions of parameters.
Here are five AI-based code generators based on the large language models that can generate high-quality code:
1. OpenAI Codex
OpenAI Codex is the model based on GPT-3 that powers GitHub Copilot – a tool from Microsoft to generate code within VS Code development environment. It claims to write code in at least a dozen languages, including JavaScript, Go, Perl, PHP, Ruby, Swift and TypeScript, and even BASH. The model is trained on billions of lines of code available in the public domain, such as GitHub repositories.

OpenAI made the model available through a private beta to developers and platform companies to build tools and integration.
2. Tabnine
While Tabnine is not an end-to-end code generator, it puts the auto-completion feature of the integrated development environment (IDE) on steroids. Developed in Rust by Jacob Jackson when he was a student at the University of Waterloo, Tabnine has evolved into a fully-fledged, AI-based code completion tool.

Tabnine supports over 20 languages and 15 editors, including popular IDEs like VS Code, IntelliJ, Android Studio, and even Vim. It is available at the price of $432 per year for a team of 3 developers.
3. CodeT5
CodeT5 is an open source programming language model built by researchers at SalesForce. It is based on Google’s T5 (Text-to-Text Transfer Transformer) framework. In order to train CodeT5, the team sourced over 8.35 million instances of code, including user comments, from publicly accessible GitHub repositories. A majority of these datasets were derived from the CodeSearchNet dataset, which includes Ruby, JavaScript, Go, Python, PHP, C, and C#, in addition to two C and C# datasets from BigQuery.
CodeT5 can potentially bring three capabilities to software programming:
- Text-to-code generation: generate code based on the natural language description
- Code autocompletion: complete the whole function of code given the target function name
- Code summarization: generate the summary of a function in natural language description
4. Polycoder
Polycoder is an open source alternative to OpenAI’s Codex. Developed by the researchers at Carnegie Mellon University, the model is based on OpenAI’s GPT-2, which is trained on a 249 GB codebase written in 12 programming languages. According to PolyCoder’s authors, the program is capable of writing C with greater accuracy than any other model, including Codex.
While most of the code generators are not open source, Polycoder is one of the first open source code generation models.
5. Cogram
Cogram, a Y-Combinator, Berlin-based Startup, is a code generation tool aimed at data scientists and Python programmers using SQL queries and Jupyter Notebooks. Data scientists can write queries in the English language that the tool translates into complex SQL queries with joins and grouping. It supports SQLite, PostgreSQL, MySQL, and Amazon Redshift.

Python and Julia developers can integrate Cogram with Jupyter Notebooks to auto-generate code. The tool can generate contextual code for a specific task based on the comments. Data scientists can even generate visualizations based on mainstream Python modules such as Matplotlib, Plotly, or Seaborn.
Be First to Comment