Sep 2, 2024Updated Sep 3, 2024

Distilling LLM programming techniques

Developers are using language models (LMs) to tackle increasingly complex tasks. This is a big deal – programs that mix non-deterministic LM calls with deterministic control flow represent a new kind of program.

A lot of new terms have emerged to describe how to build this kind of program. Much of the discourse in this space does more harm than good, adding extra cognitive overhead, and obscuring fundamentally simple concepts. In this post, we'll distill what's actually going on behind the jargon. We'll also mention some popular tools. Tools and external abstractions can be useful for building fast, but it's important to remember that we're still super early, and none of the tooling is mature.

The 3 most important techniques for writing effective LM programs are:

Prompting
Structuring output
Augmenting context

We’ll walk through each of these techniques, mention some popular tools, and also touch on some accessory tools for orchestration and model routing.

Prompting

Prompt engineering techniques coax a language model to produce a desired output – either using a single prompt, or multiple prompts across several LM calls.

The simplest prompting techniques take place within a single LM call. These include:

Providing examples of the desired output – a.k.a. few-shot prompting, or in-context learning
Asking for intermediate reasoning steps – a.k.a. chain of thought

More advanced prompting techniques involve multiple LM calls. These include:

Using another LM call to classify user input – a.k.a. content moderation
Using another LM call to detect hallucinations – e.g. Lynx
Breaking down a task into iterative steps – e.g. chain of density
Asking an LM to break down the task for you – e.g. skeleton of thought
Generating multiple responses and merging them – e.g. self-consistency, mixture of agents, tree of thought, branch-solve-merge

Relevant tools

You don't really need external tools for these techniques – template strings and functions are quite effective – but a number of tools have emerged.

LangChain offers components for templating prompts and pipelining LM calls.
Domain-specific languages like BAML and LMQL provide interfaces for templating prompts.
DSPy automatically optimizes prompts by injecting examples from input datasets, and includes high-level modules for techniques like chain-of-thought.

Structuring output

Structuring LM outputs simplifies classification and data extraction tasks, and enables integrating LMs with existing software systems.

When structuring output, you provide a JSON schema (or another form of type hint) to your LM call, and the LM's output follows the provided structure. The reliability of struture-following depends on the underlying implementation. Prompt engineering and post-processing can work with 3rd party APIs, but may be less reliable. We recommend constrained decoding as the most reliable approach, but it requires native support from API vendors or running your own inference.

With prompt engineering and post-processing

OpenAI API has supported JSON mode since Nov 2023, and other providers like Anthropic and Together also support JSON mode. We can't be sure how JSON mode is implemented, but unreliable schema-following indicates a prompt engineering based approach.
Instructor patches the OpenAI SDK to enable structured generation using frameworks like Pydantic and Zod.
Domain-specific languages like BAML and LMQL provide interfaces for calling LMs as typed functions.

With constrained decoding

OpenAI now supports reliable structured output (August 2024), and other providers will likely follow their lead. Reliability is generally robust, but your mileage may vary with more complex schemas.
Outlines is a constrained decoding library for producing structured output with open-source LMs.

Augmenting context

Augmenting LM generation is a useful pattern for producing higher quality outputs by "grounding" generation with specific context.

Retrieval-augmented generation (RAG), is the most well-known pattern for augmenting LM generation. RAG sounds fancy, but it's actually quite simple:

Retrieve search results – e.g. from a vector database or search engine.
(Optionally) Use a reranking model to rank the most relevant results.
Prompt LM with the search results.

Function calling or Tool calling is a related pattern. (RAG is really a subset of function/tool calling). By Structuring output, you can coax a LM to call an external function – and then make another LM call with the results.

In Function calling, the API returns the function(s) to call
In Tool calling, the API automatically calls function(s) and proceeds with the next LLM call

Relevant tools

You don't really need external tools for RAG, but some frameworks offer useful components for common tasks like parsing and chunking data. Function/tool calling is neat (using LMs to drive control flow certainly feels futuristic). But in practice, this pattern is unreliable, slow, and adds an opaque layer of indirection to your program (compared to implementing specific steps yourself).

The OpenAI Assistants API also supports a constrained form of RAG via the File Search tool. The Chat Completions API supports function calling, and the Assistants API supports tool use.
LangChain, LlamaIndex, and Haystack offer components for RAG.
Agent frameworks like LangGraph, llama-agents, Ax, ControlFlow, and Crew support Tool use.

References

Andrej Karpathy, LLM OS

(Sep 2023) https://x.com/karpathy/status/1707437820045062561

TLDR looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.

SGLang, LM Programs

(Jun 2024) https://arxiv.org/pdf/2312.07104

There are two common properties of LM programs: (1) LM programs typically contain multiple LLM calls interspersed with control flow. This is needed to complete complex tasks and improve overall quality. (2) LM programs receive structured inputs and produce structured outputs. This is needed to enable the composition of LM programs and to integrate LM programs into existing software systems.

Douglas Crockford, JSON vs XML

(Apr 2023) https://corecursive.com/json-vs-xml-douglas-crockford/#making-json-a-standard

The JavaScript frameworks. They have gotten so big and so weird. People seem to love them. I don’t understand why.

For a long time I was a big advocate of using some kind of JavaScript library, because the browsers were so unreliable, and the web interfaces were so incompetent, and make someone else do that work for you. But since then, the browsers have actually gotten pretty good. The web standards thing have finally worked, and the web API is stable pretty much. Some of it’s still pretty stupid, but it works and it’s reliable.

And so, when I’m writing interactive stuff in browsers now, I’m just using plain old JavaScript. I’m not using any kind of library, and it’s working for me.

BAIR, Compound AI Systems

(Feb 2024) https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems

AI caught everyone’s attention in 2023 with Large Language Models (LLMs) that can be instructed to perform general tasks, such as translation or coding, just by prompting. This naturally led to an intense focus on models as the primary ingredient in AI application development, with everyone wondering what capabilities new LLMs will bring. As more developers begin to build using LLMs, however, we believe that this focus is rapidly changing: state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models.

The shift to compound systems in Generative AI also matches the industry trends in other AI fields, such as self-driving cars: most of the state-of-the-art implementations are systems with multiple specialized components. For these reasons, we believe compound AI systems will remain a leading paradigm even as models improve.

Distilling LLM programming techniques

On this page