中芸汇科技
2026-05-05
Private DeploymentLarge ModelsData Security
Article cover
Article cover

Introduction

Industries such as finance, healthcare, and government impose strict data security requirements that public large model APIs cannot meet. Private deployment of large models is a must for these sectors.

Drawing on our experience delivering private large model deployments for over 10 enterprises, this article systematically walks through the 7 key steps.

Step 1: Model Selection

1.1 Comparison of Mainstream Open-Source Models

ModelParametersChinese CapabilityInference SpeedOpen-Source LicenseRecommended Scenarios
Qwen2.5-72B72B★★★★★ModerateApache 2.0Top choice for general purpose
Qwen2.5-7B7B★★★★FastApache 2.0Lightweight scenarios
DeepSeek-V3671B MoE★★★★★FastMITWhen budget is ample
ChatGLM4-9B9B★★★★FastApache 2.0Conversational scenarios
Llama3.1-70B70B★★★ModerateLlama3English-centric use
Yi-1.5-34B34B★★★★Relatively fastApache 2.0Best cost-performance ratio

1.2 Selection Advice

  • Prioritize general capability: Qwen2.5-72B
  • Limited budget: Yi-1.5-34B or Qwen2.5-7B
  • Inference-heavy scenarios: DeepSeek-V3
  • Resource-constrained: Quantized version of Qwen2.5-7B
  • Step 2: Computing Resource Assessment

    2.1 GPU Requirements Reference

    ModelFP16INT8INT4
    7B1×A100 40G1×A10 24G1×RTX4090 24G
    34B2×A100 80G1×A100 80G1×A100 40G
    72B4×A100 80G2×A100 80G2×A100 40G

    2.2 Cost Estimation

    ConfigurationPurchase CostMonthly RentalSuitable Scenarios
    1×RTX4090¥15,000¥3,0007B model testing
    1×A100 40G¥80,000¥15,0007B-34B models
    2×A100 80G¥250,000¥40,00034B-72B models
    4×A100 80G¥500,000¥80,00072B+ models

    Step 3: Inference Engine Selection

    EngineThroughputLatencyEase of UseRecommended Scenarios
    vLLM★★★★★★★★★★★★★Preferred for production
    TGI★★★★★★★★★★★★When compatibility is key
    TensorRT-LLM★★★★★★★★★★★★Latency-sensitive use
    Ollama★★★★★★★★★★★Local development and testing

    Our recommendation: Use vLLM for production (highest throughput, active community) and Ollama for development/testing (one-click deployment).

    Step 4: Model Quantization

    4.1 Quantization Method Comparison

    MethodAccuracy LossSpeed IncreaseModel Size ReductionUse Case
    FP16→INT8 (AWQ)<1%2x2xGeneral recommendation
    FP16→INT4 (GPTQ)1%-3%3x4xResource-constrained
    FP16→INT4 (GGUF)2%-5%3x4xCPU inference

    4.2 Quantization Effect Reference

    Quantization results for Qwen2.5-72B on Chinese benchmarks:

    Quantization MethodC-EvalInference Speed (Tokens/s)GPU Memory Usage
    FP1683.525144GB
    AWQ-INT882.84872GB
    GPTQ-INT481.27240GB

    Step 5: Containerized Deployment

    ```yaml

    docker-compose.yml example

    services:

    vllm:

    image: vllm/vllm-openai:latest

    deploy:

    resources:

    reservations:

    devices:

  • capabilities: [gpu]
  • count: 2

    command: >

    --model Qwen/Qwen2.5-72B-Instruct-AWQ

    --quantization awq

    --tensor-parallel-size 2

    --max-model-len 8192

    --gpu-memory-utilization 0.9

    ports:

  • "8000:8000"
  • ```

    Step 6: Performance Optimization

    OptimizationMethodEffect
    Continuous BatchingDynamic batching2-3x throughput increase
    PagedAttentionPaged VRAM management40% better VRAM utilization
    Prefix CachingSystem prompt caching50% latency reduction for identical prefixes
    Speculative DecodingSmall model drafts, large model verifies2-3x inference speed boost

    Step 7: Monitoring and Operations

    7.1 Key Monitoring Metrics

    MetricAlert Threshold
    GPU utilization>95% sustained for 5 minutes
    Inference latency P99>5 seconds
    Request failure rate>1%
    VRAM usage>90%
    Model service availability<99.9%

    7.2 Operations Strategies

  • Auto-scaling: Dynamically adjust inference instance count based on request volume
  • Blue-green deployment: Zero-downtime model updates
  • Canary releases: Route 5% of traffic to the new model first for validation
  • Log aggregation: Full-chain request tracing
  • Conclusion

    Private deployment is not simply "buy a server and install a model." Selecting the right model, provisioning sufficient computing power, optimizing inference, and establishing solid operations are what make a private large model truly effective. We recommend starting with a 7B model to quickly validate your business use case, then scaling up to a 72B model once feasibility is confirmed.

    Interested in a private large model deployment solution? Book a free computing resource assessment