LLM / Chat / Embedding Streaming / Batching 전용 GPU 옵션

모델 호스팅 (LLM Inference)

오픈 LLM(예: Llama, Mistral, Qwen 등)을 API로 바로 사용하세요. 고성능 런타임과 자동 스케일링, 스트리밍, 가드레일/로그까지 한 번에. 요금은 별도문의로 안내합니다.

⚡ 고성능 추론

배치/KV 캐시/스펙큘레이티브 디코딩(옵션)으로 토큰당 지연 최소화.

🌊 스트리밍

SSE/WebSocket으로 토큰 스트림 전송, 중간 중단/재시작 지원.

🧩 호환 API

OpenAI 호환 엔드포인트(/v1/chat/completions 등) 및 Embedding API.

🛡️ 가드레일

정책 필터/금칙어/출력 길이/온도 상한, 감사 로그와 함께 관리.

모델/런타임 옵션(예시)

항목	옵션	설명
지원 모델군	Llama / Mistral / Qwen / Yi / Gemma 등	Decoder-only 계열, 지시따르기(Instruction) 체크포인트
컨텍스트 길이	8K ~ 128K(모델별)	롱컨텍스트는 메모리 소모 증가, 샤딩/오프로딩 가능
정밀도/양자화	FP16, BF16, INT8/4	메모리/성능 균형 선택(예: AWQ, GPTQ, bitsandbytes)
런타임	vLLM / TGI / TensorRT-LLM	배치/프리빌트 커널, CUDA/ROCm
토크나이저	tiktoken/transformers	호환 토큰 카운터 제공
RAG	벡터 스토어 연동	임베딩+검색 → 컨텍스트 주입
* 실제 조합/버전은 상담 후 확정됩니다.

성능/스케일(예시)

항목	값	비고
GPU 티어	L4 / A10 / A100 / H100	메모리 24~80GB, NVLink(모델별)
동시 요청	배치 기반 동시 처리	LLM 헤비 워크로드 병합
오토스케일	min~max 레플리카	큐 길이/지연 기준 스케일
캐시	KV 캐시/프롬프트 캐시	반복 프롬프트에 유리
릴리즈	가중치/카나리	버전 A/B 비율 전환
* 수치는 구성/프롬프트/토큰 길이에 따라 달라집니다.

엔드포인트

Chat Completions(호환)

curl -X POST https://api.example.com/v1/chat/completions \
  -H "Authorization: Bearer sk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-instruct",
    "messages": [
      {"role":"system","content":"You are a helpful assistant."},
      {"role":"user","content":"서울의 가을 날씨 묘사"}
    ],
    "stream": true,
    "temperature": 0.6,
    "max_tokens": 256
  }'

JavaScript(fetch, SSE)

const res = await fetch("https://api.example.com/v1/chat/completions",{
  method: "POST",
  headers: {"Authorization":"Bearer sk-...","Content-Type":"application/json"},
  body: JSON.stringify({ model:"llama-3-instruct", messages:[{role:"user",content:"한 줄 요약"}], stream:true })
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while(true){
  const {value, done} = await reader.read(); if(done) break;
  const chunk = decoder.decode(value, {stream:true});
  // process Server-Sent Events lines
}

Python(requests)

import requests, json
url = "https://api.example.com/v1/chat/completions"
headers = {"Authorization":"Bearer sk-...","Content-Type":"application/json"}
payload = {"model":"llama-3-instruct","messages":[{"role":"user","content":"요약"}]}
print(requests.post(url, headers=headers, data=json.dumps(payload)).json())

Embedding

curl -X POST https://api.example.com/v1/embeddings \
  -H "Authorization: Bearer sk-..." -H "Content-Type: application/json" \
  -d '{"model":"e5-large","input":["문장1","문장2"]}'

보안/거버넌스

항목	설명	비고
프라이빗 엔드포인트	IP 화이트리스트 / VPC 피어링	공개/비공개 선택
키/쿼터	API 키/조직 키, RPM/TPM 제한	키 롤/회전
가드레일	정책/금칙어/출력길이/온도 상한	엔드포인트별 프로필
감사/로그	프롬프트/응답 메타, IP/시간	보관기간 선택
데이터 처리	학습 미사용(옵션)	PII 마스킹(옵션)

모니터링

메트릭	범위	설명
지연/토큰속도	엔드포인트/모델	p50/p95, 토큰/sec
사용량	조직/키	요청 수, 입력/출력 토큰
에러율	클래스별	429/5xx, 재시도율
큐 길이	런타임	오토스케일 트리거

요금 안내

요금: 별도문의

GPU 타입/전용 여부, 컨텍스트 길이, 동시성/쿼터, 로그 보관/가드레일 옵션에 따라 산정됩니다.

당신의 앱에 LLM을 연결하세요

모델/쿼터/보안 요구만 알려주시면 바로 구성해드립니다.

빠르고 간단한 호스팅

개발자를 위한 플랫폼

디도스방어존

모델호스팅

모델 호스팅 (LLM Inference)