Simplify your Code LLM solutions using CodeLLM DevKit
As our team at IBM Research started to build AI coding assistant capabilities based on large language models for code (which we call CodeLLMs), we were faced with the choice of different static-analysis tooling to use. These tools, such as Tree-sitter and WALA, each vary in their support for different programming languages, programming-language features, and types of program analyses. And both require having a deep understanding of what the tools offer for each programming language to create meaningful CodeLLM prompt augmentations. This often requires considerable amount of code to be written to support various CodeLLM use cases, such as test generation, code summarization, and code translation.
To address these challenges, we developed the CodeLLM Development Kit (CLDK), which abstracts the interaction with the static-analysis tools and enables seamless integration with CodeLLMs.
CLDK finds its use in different phases of the CodeLLM lifecycle. It augments and allows developers to generate instruct datasets to enhance existing code models. Developers can use CLDK to build new code models by creating fine-tuning datasets, to build code-related LLM-assisted solutions, support evaluation of models, and simplify the development of enterprise use cases such as application modernization and other coding tasks.
At IBM Research, we have used CLDK for several enterprise use cases, including code explanation and automated test generation. We observed a significant boost in both productivity and creativity using this tool.
Our experience showed that CLDK offers several benefits. It abstracts the complex details of various program analysis engines, such as WALA, which runs on Java and typically requires significant expertise to generate code-analysis results, and Tree-sitter, which relies on query-based language processing and requires proper post-processing for extracting relevant information. CLDK also facilitates the creation of Pydantic models for code analysis (such as a Java code analysis model), and enables support for multiple program-analysis backends, including WALA and Tree-sitter. CLDK supports multiple programming languages, including Java and Python, with plans to support additional languages soon.
We built the kit to offer different levels of analysis, allowing for more efficient workflows by enabling users to bypass more intensive analyses, such as those based on call graphs, program dependency graphs, or system dependency graphs. Sometimes a symbol-table-based analysis suffices, which is especially valuable for enterprise-grade projects.
CLDK makes code analysis significantly easier, and developers can use it without worrying about a lot of intrinsic details. Because we believe that the future of AI relies on the open-source community, we are open-sourcing CLDK on GitHub. Given its customizability, developers can add different static analysis abstractions and support for different programming languages. Check out our IBM Granite model cookbooks here, and our complete Python notebook with CLDK examples is available here. To learn how to install and use CLDK, read our how-to on GitHub. We have just started this journey — join our IBM Granite community today.