Using ChatGPT Plugins with LLaMA

Published in

lastmile ai — blog

5 min readMar 26, 2023

OpenAI just released initial support for plugins to ChatGPT, allowing the language model to act as agents and interact with the outside world using APIs. Here we show a proof of concept using OpenAI’s chatgpt-retrieval-plugin with Meta’s LLaMA language model.

This is more than just a guide. It is a call-to-action to build an open protocol for foundation model plugins allowing us to share plugins across LLMs, and govern their interactions.

LLaMA answering a question about the LLaMA paper with the chatgpt-retrieval-plugin. So Meta!

Background

OpenAI’s documentation on plugins explains that plugins are able to enhance ChatGPT’s capabilities by specifying a manifest & an openapi specification.

There are few details available about how the plugins are wired to ChatGPT, but OpenAI open-sourced the chatgpt-retrieval-plugin for semantic search and retrieving custom data for additional context.

In this guide we will take that retrieval plugin, and add a script that integrates it with LLaMA 7B running on your local machine.

The code that glues the plugin to LLaMA is available in this repo (we welcome contributions):
lastmile-ai/llama-retrieval-plugin

Limitations

This approach successfully adds external context to LLaMA, albeit with gaps compared to OpenAI’s plugin approach:

Limitations in the underlying model. LLaMA is far from ChatGPT in many ways. It requires significant additional fine-tuning (such as Alpaca).
Not generalizable to other plugins. The OpenAI documentation suggests ChatGPT can read a plugin’s API schema, and dynamically construct the right API calls that satisfy the user’s request. By contrast, things didn’t go well when we tried to ask LLaMA to construct a cURL request given an OpenAPI schema. One solution to this would be fine-tuning a model specifically for OpenAPI schemas.

Demo

We first set up our data store and upload two PDFs to it — the LLaMA paper and the Conda cheatsheet.

Then we can query this data, with the relevant embeddings pulled in to the prompt as additional context.

Step-by-step guide

Step 0: Clone the llama-retrieval-plugin repo

GitHub - lastmile-ai/llama-retrieval-plugin: LLaMA retrieval plugin script using OpenAI's retrieval…

The LLaMA Retreival Plugin repository shows how to use a similar structure to the chatgpt-retrieval-plugin for…

github.com

Step 1: Set up the data store

This step is almost identical to setting up the OpenAI retrieval plugin, but simplified through the use of conda and using pinecone as the vector DB. Following the quickstart in the repo:

Set up the environment:

conda env create -f environment.yml
conda activate llama-retrieval-plugin
poetry install

Define the environment variables:

# In production use-cases, make sure to set up the bearer token properly
export BEARER_TOKEN=test1234
export OPENAI_API_KEY=my_openai_api_key

# We used pinecone for our vector database, but you can use a different one
export DATASTORE=pinecone
export PINECONE_API_KEY=my_pinecone_api_key
export PINECONE_ENVIRONMENT=us-east1-gcp
export PINECONE_INDEX=my_pinecone_index_name

Start the server:

poetry run start

Step 2: Upload files to the data store

For this step, we used the Swagger UI available locally at http://localhost:8000/docs

Authorize:

Upsert File:

Specify any PDF file that you would like to get chunked into embeddings

Query the data store to test:

Take the id returned by the upsert, and construct a query in the Swagger UI to see what embeddings will be returned given a prompt:

{
  "queries": [
    {
      "query": "What is the title of the LLaMA paper?",
      "filter": {
        "document_id": "f443884b-d137-421e-aac2-9809113ad53d"
      },
      "top_k": 3
    }
  ]
}

The query API lets you test out the vector store, and configure the filters

Step 3: Set up LLaMA

Our repo links to llama.cpp as a submodule, which is what we used for getting LLaMA 7B running locally.

Follow the llama.cpp readme to get set up

llama.cpp/README.md · ggerganov/llama.cpp

Inference of LLaMA model in pure C/C++ Hot topics: The main goal is to run the model using 4-bit quantization on a…

github.com

Step 4: Use LLaMA to query your custom data

Open a new Terminal, and navigate to the llama-retrieval-plugin repo.

Activate the Conda environment (from Step 1):

conda activate llama-retrieval-plugin

Define the environment variables:

# Make sure the BEARER_TOKEN is set to the same value as in Step 1
export BEARER_TOKEN=test1234
# Set the URL to the query endpoint that you tested in Step 2
export DATASTORE_QUERY_URL=http://0.0.0.0:8000/query
# Set to the directory where you have LLaMA set up -- such as the root of the llama.cpp repo
export LLAMA_WORKING_DIRECTORY=./llama.cpp

Run the llama_with_retrieval script with the desired prompt:

python3 llama_with_retrieval.py "What is the title of the LLaMA paper?"

This script takes the prompt, calls the query endpoint to extract the most relevant embeddings from the data store, and then constructs a prompt to pass to LLaMA that contains these embeddings.

You can read the code here: llama-retrieval-plugin/llama_with_retrieval.py

Step 5: Tweak and experiment

You can modify the llama_with_retrieval script to experiment with different settings that may yield better performance:

Change the token limit (e.g. reduce it to give more room for the model response).
Change the prompt template and observe model behavior.
Change the LLaMA model parameters by modifying the command line. Note: You can also specify a custom LLaMA command line by setting the LLAMA_CMD environment variable.

You can also use lastmileai.dev to track your various experiments as you tweak and tune models. For example, here’s a notebook saving some trials using Stable Diffusion.

Protocols over Platforms

We hope this exercise shows the need for standardizing interactions between foundation models and plugins/extensions. We should be able to use a plugin designed for OpenAI models with another large language model, and vice versa. This is only possible with a Foundation Model Plugin Protocol standard.

We are in the early stages of a revolution in computing, powered by the advent of state-of-the-art foundation models. We have an opportunity to define the behaviors that govern our interactions with these models, and return to the rich legacy of open protocols of the early internet instead of closed platforms of the modern era.

Foundation Model Plugin Protocol

The lastmile ai team is exploring what it takes to define a plugin protocol, and spur its adoption. We believe the protocol should be:

model-agnostic — support GPTx, LLaMA, Bard, and any other foundation model.
modal-agnostic — support different types of inputs and outputs, instead of just text.

Our early thinking on this is inspired by SMTP for email, and LSP (Language Server Protocol) for IDEs. We will be sharing what we have in this space in the coming days, and would love to collaborate with you.

Call to action

We are just getting started at lastmile ai, and would love to hear from you, especially if you share our vision for an open and interoperable future. You can reach us here:

We would also appreciate your feedback on our initial product offering, available at lastmileai.dev.