vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving with an OpenAI-compatible API. Use it to run large language models on your own hardware.

Dependencies

None, but requires a GPU node in your cluster.

Configuration

Key settings configured through your instance's config.yaml:

model - Hugging Face model to serve (default: Qwen/Qwen2.5-7B-Instruct)
maxModelLen - Maximum sequence length (default: 8192)
gpuProduct - Required GPU type (default: RTX 4090)
gpuCount - Number of GPUs to use (default: 1)
gpuMemoryUtilization - Fraction of GPU memory to use (default: 0.9)
domain - Where the API will be accessible (default: vllm.{your-cloud-domain})

Access

After deployment, the OpenAI-compatible API will be available at:

https://vllm.{your-cloud-domain}/v1

Other apps on the cluster (such as Open WebUI) can connect internally at http://vllm-service.llm.svc.cluster.local:8000/v1.

Hardware Requirements

This app requires a GPU node in your cluster. Adjust the gpuProduct, gpuCount, and memory settings to match your available hardware.

1.1 KiB Raw Blame History

vLLM

Dependencies

Configuration

Access

Hardware Requirements

1.1 KiB

Raw Blame History