vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving, offering:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Optimized CUDA kernels
This notebooks goes over how to use a LLM with langchain and vLLM.
To use, you should have the vllm
python package installed.
%pip install --upgrade --quiet vllm -q
from langchain_community.llms import VLLM
llm = VLLM(
model="mosaicml/mpt-7b",
trust_remote_code=True, # mandatory for hf models
max_new_tokens=128,
top_k=10,
top_p=0.95,
temperature=0.8,
)
print(llm.invoke("What is the capital of France ?"))
INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512
``````output
Processed prompts: 100%|█ █████████| 1/1 [00:00<00:00, 2.00it/s]
``````output
What is the capital of France ? The capital of France is Paris.
Integrate the model in an LLMChain
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "Who was the US president in the year the first Pokemon game was released?"
print(llm_chain.invoke(question))
Processed prompts: 100%|██████████| 1/1 [00:01<00:00, 1.34s/it]
``````output
1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.
Distributed Inference
vLLM supports distributed tensor-parallel inference and serving.
To run multi-GPU inference with the LLM class, set the tensor_parallel_size
argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs
from langchain_community.llms import VLLM
llm = VLLM(
model="mosaicml/mpt-30b",
tensor_parallel_size=4,
trust_remote_code=True, # mandatory for hf models
)
llm.invoke("What is the future of AI?")
Quantization
vLLM supports awq
quantization. To enable it, pass quantization
to vllm_kwargs
.
llm_q = VLLM(
model="TheBloke/Llama-2-7b-Chat-AWQ",
trust_remote_code=True,
max_new_tokens=512,
vllm_kwargs={"quantization": "awq"},
)
OpenAI-Compatible Server
vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
This server can be queried in the same format as OpenAI API.
OpenAI-Compatible Completion
from langchain_community.llms import VLLMOpenAI
llm = VLLMOpenAI(
openai_api_key="EMPTY",
openai_api_base="http://localhost:8000/v1",
model_name="tiiuae/falcon-7b",
model_kwargs={"stop": ["."]},
)
print(llm.invoke("Rome is"))
a city that is filled with history, ancient buildings, and art around every corner
LoRA adapter
LoRA adapters can be used with any vLLM model that implements SupportsLoRA
.
from langchain_community.llms import VLLM
from vllm.lora.request import LoRARequest
llm = VLLM(
model="meta-llama/Llama-3.2-3B-Instruct",
max_new_tokens=300,
top_k=1,
top_p=0.90,
temperature=0.1,
vllm_kwargs={
"gpu_memory_utilization": 0.5,
"enable_lora": True,
"max_model_len": 350,
},
)
LoRA_ADAPTER_PATH = "path/to/adapter"
lora_adapter = LoRARequest("lora_adapter", 1, LoRA_ADAPTER_PATH)
print(
llm.invoke("What are some popular Korean street foods?", lora_request=lora_adapter)
)
Related
- LLM conceptual guide
- LLM how-to guides