Files
wild-directory/vllm/README.md

1.1 KiB

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving with an OpenAI-compatible API. Use it to run large language models on your own hardware.

Dependencies

None, but requires a GPU node in your cluster.

Configuration

Key settings configured through your instance's config.yaml:

  • model - Hugging Face model to serve (default: Qwen/Qwen2.5-7B-Instruct)
  • maxModelLen - Maximum sequence length (default: 8192)
  • gpuProduct - Required GPU type (default: RTX 4090)
  • gpuCount - Number of GPUs to use (default: 1)
  • gpuMemoryUtilization - Fraction of GPU memory to use (default: 0.9)
  • domain - Where the API will be accessible (default: vllm.{your-cloud-domain})

Access

After deployment, the OpenAI-compatible API will be available at:

  • https://vllm.{your-cloud-domain}/v1

Other apps on the cluster (such as Open WebUI) can connect internally at http://vllm-service.llm.svc.cluster.local:8000/v1.

Hardware Requirements

This app requires a GPU node in your cluster. Adjust the gpuProduct, gpuCount, and memory settings to match your available hardware.