vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving with an OpenAI-compatible API. Use it to run large language models on your own hardware.
Dependencies
None, but requires a GPU node in your cluster.
Configuration
Key settings configured through your instance's config.yaml:
- model - Hugging Face model to serve (default:
Qwen/Qwen2.5-7B-Instruct) - maxModelLen - Maximum sequence length (default:
8192) - gpuProduct - Required GPU type (default:
RTX 4090) - gpuCount - Number of GPUs to use (default:
1) - gpuMemoryUtilization - Fraction of GPU memory to use (default:
0.9) - domain - Where the API will be accessible (default:
vllm.{your-cloud-domain})
Access
After deployment, the OpenAI-compatible API will be available at:
https://vllm.{your-cloud-domain}/v1
Other apps on the cluster (such as Open WebUI) can connect internally at http://vllm-service.llm.svc.cluster.local:8000/v1.
Hardware Requirements
This app requires a GPU node in your cluster. Adjust the gpuProduct, gpuCount, and memory settings to match your available hardware.