# vLLM vLLM is a fast and easy-to-use library for LLM inference and serving with an OpenAI-compatible API. Use it to run large language models on your own hardware. ## Dependencies None, but requires a GPU node in your cluster. ## Configuration Key settings configured through your instance's `config.yaml`: - **model** - Hugging Face model to serve (default: `Qwen/Qwen2.5-7B-Instruct`) - **maxModelLen** - Maximum sequence length (default: `8192`) - **gpuProduct** - Required GPU type (default: `RTX 4090`) - **gpuCount** - Number of GPUs to use (default: `1`) - **gpuMemoryUtilization** - Fraction of GPU memory to use (default: `0.9`) - **domain** - Where the API will be accessible (default: `vllm.{your-cloud-domain}`) ## Access After deployment, the OpenAI-compatible API will be available at: - `https://vllm.{your-cloud-domain}/v1` Other apps on the cluster (such as Open WebUI) can connect internally at `http://vllm-service.llm.svc.cluster.local:8000/v1`. ## Hardware Requirements This app requires a GPU node in your cluster. Adjust the `gpuProduct`, `gpuCount`, and memory settings to match your available hardware.