API Reference
Apollo RAG provides a REST API for document retrieval and question answering. All endpoints return JSON and support CORS.
Base URL
Local Development:
http://localhost:8000Production:
https://apollo.onyxlab.aiAll code examples in this documentation use localhost:8000 for local development. Replace with your actual deployment URL when using in production.
Authentication
Currently, Apollo runs without authentication for development. For production deployments, implement authentication middleware or use a reverse proxy like Nginx.
Security: Enable rate limiting and authentication before deploying to production. Apollo includes built-in rate limiting (30 requests/minute per IP).
Endpoints
Health Check
GET /health
Check system status and configuration.
Response:
{
"status": "healthy",
"version": "4.1.0",
"gpu_enabled": true,
"gpu_count": 1,
"gpu_name": "NVIDIA RTX 4090",
"vector_store": "chroma",
"document_count": 142589,
"cache_enabled": true,
"conversation_memory_enabled": true
}Response Fields:
| Field | Type | Description |
|---|---|---|
status | string | System health status: healthy or unhealthy |
version | string | Apollo version number |
gpu_enabled | boolean | Whether GPU acceleration is active |
gpu_count | number | Number of available GPUs |
gpu_name | string | GPU model name (if available) |
vector_store | string | Active vector store: chroma or qdrant |
document_count | number | Total indexed documents |
cache_enabled | boolean | Whether caching layer is active |
conversation_memory_enabled | boolean | Whether conversation memory is enabled |
cURL Example:
curl http://localhost:8000/healthQuery
POST /api/query
Process a question using the RAG system.
Request Body:
{
"question": "Can I grow a beard?",
"mode": "simple",
"use_context": false,
"rerank_preset": "balanced"
}Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
question | string | ✅ Yes | - | The question to answer (max 10,000 chars) |
mode | string | ❌ No | "simple" | Retrieval mode: simple or adaptive |
use_context | boolean | ❌ No | false | Use conversation history |
rerank_preset | string | ❌ No | "balanced" | Re-ranking preset: speed, balanced, quality |
Retrieval Modes:
Simple Mode
- Use for: Direct questions, quick lookups
- Speed: 8-15s (GPU) / 30-60s (CPU)
- Strategy: Pure vector search with top-k retrieval
- Accuracy: High for straightforward questions
Example:
{
"question": "What are the fitness standards?",
"mode": "simple"
}Adaptive Mode
- Use for: Complex questions, multi-hop reasoning, research
- Speed: 15-90s (varies by complexity)
- Strategy: Query classification → Hybrid search → Re-ranking
- Accuracy: Superior for complex, multi-faceted questions
Example:
{
"question": "Compare beard policies across different regulations",
"mode": "adaptive"
}Re-ranking Presets:
| Preset | Speed | Accuracy | Best For |
|---|---|---|---|
speed | Fastest | Good | High-volume queries, simple questions |
balanced | Medium | Better | General use (recommended) |
quality | Slowest | Best | Critical accuracy, research tasks |
Response:
{
"answer": "According to AFI 36-2903, Air Force members are generally prohibited from growing beards except for approved medical or religious accommodations...",
"sources": [
{
"content": "3.1.2. Beards. Beards are not authorized except for...",
"metadata": {
"source": "AFI_36-2903_Dress_and_Appearance.pdf",
"page": 12,
"chunk_id": "chunk_1234"
},
"relevance_score": 0.9234
}
],
"metadata": {
"processing_time_ms": 12456,
"mode": "simple",
"cache_hit": false,
"query_type": "simple",
"strategy": "vector_search",
"retrieved_chunks": 7,
"model": "llama3.1:8b",
"gpu_accelerated": true
},
"explanation": null
}Response Fields:
| Field | Type | Description |
|---|---|---|
answer | string | Generated answer to the question |
sources | array | Retrieved source documents with relevance scores |
sources[].content | string | Relevant text excerpt from source document |
sources[].metadata | object | Document metadata (source file, page, chunk ID) |
sources[].relevance_score | number | Relevance score (0-1, higher is more relevant) |
metadata | object | Processing metadata and performance metrics |
metadata.processing_time_ms | number | Total processing time in milliseconds |
metadata.mode | string | Retrieval mode used |
metadata.cache_hit | boolean | Whether answer was served from cache |
metadata.query_type | string | Classified query complexity |
metadata.strategy | string | Retrieval strategy applied |
metadata.retrieved_chunks | number | Number of chunks retrieved |
metadata.model | string | LLM model used for generation |
metadata.gpu_accelerated | boolean | Whether GPU was used |
explanation | string|null | Detailed explanation (adaptive mode only) |
cURL Example:
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{
"question": "Can I grow a beard?",
"mode": "simple",
"use_context": false
}'Python Example:
import requests
response = requests.post(
"http://localhost:8000/api/query",
json={
"question": "Can I grow a beard?",
"mode": "simple",
"use_context": False,
"rerank_preset": "balanced"
}
)
if response.status_code == 200:
result = response.json()
print(f"Answer: {result['answer']}")
print(f"\nSources ({len(result['sources'])}):")
for i, source in enumerate(result['sources'], 1):
print(f"{i}. {source['metadata']['source']} (score: {source['relevance_score']:.3f})")
print(f"\nProcessing time: {result['metadata']['processing_time_ms']}ms")
print(f"GPU accelerated: {result['metadata']['gpu_accelerated']}")
else:
print(f"Error: {response.status_code} - {response.text}")JavaScript/TypeScript Example:
interface QueryResponse {
answer: string;
sources: Array<{
content: string;
metadata: {
source: string;
page?: number;
chunk_id: string;
};
relevance_score: number;
}>;
metadata: {
processing_time_ms: number;
mode: string;
cache_hit: boolean;
gpu_accelerated: boolean;
};
}
async function queryApollo(question: string): Promise<QueryResponse> {
const response = await fetch('http://localhost:8000/api/query', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
question,
mode: 'simple',
use_context: false
})
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${await response.text()}`);
}
return response.json();
}
// Usage
const result = await queryApollo("Can I grow a beard?");
console.log(`Answer: ${result.answer}`);
console.log(`Processing time: ${result.metadata.processing_time_ms}ms`);Error Responses:
| Status Code | Meaning | Example |
|---|---|---|
400 | Bad Request | Invalid parameters, empty query |
429 | Rate Limit Exceeded | More than 30 requests/minute |
500 | Internal Server Error | RAG engine error, model failure |
503 | Service Unavailable | System not initialized |
Example Error:
{
"detail": "Query cannot be empty or whitespace-only"
}Query Stream
POST /api/query/stream
Process a query with streaming response using Server-Sent Events (SSE).
Request Body: Same as /api/query
Response Format: text/event-stream
Each event is a JSON object:
data: {"type":"token","content":"The"}
data: {"type":"token","content":" Air"}
data: {"type":"token","content":" Force"}
...
data: {"type":"sources","content":[...]}
data: {"type":"metadata","content":{...}}
data: {"type":"done"}Event Types:
| Type | Description | Content |
|---|---|---|
token | Generated text token | String: next word/phrase |
sources | Retrieved sources | Array: source documents |
metadata | Processing metadata | Object: performance metrics |
done | Generation complete | null |
error | Error occurred | String: error message |
cURL Example:
curl -N -X POST http://localhost:8000/api/query/stream \
-H "Content-Type: application/json" \
-d '{
"question": "Explain the chain of command",
"mode": "simple"
}'JavaScript Example (EventSource):
const eventSource = new EventSource(
'http://localhost:8000/api/query/stream?' + new URLSearchParams({
question: 'Explain the chain of command',
mode: 'simple'
})
);
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case 'token':
process.stdout.write(data.content); // Stream text as it arrives
break;
case 'sources':
console.log('\n\nSources:', data.content);
break;
case 'metadata':
console.log('Metadata:', data.content);
break;
case 'done':
eventSource.close();
console.log('\n\nGeneration complete');
break;
case 'error':
console.error('Error:', data.content);
eventSource.close();
break;
}
};Python Example (with requests):
import requests
import json
response = requests.post(
'http://localhost:8000/api/query/stream',
json={
'question': 'Explain the chain of command',
'mode': 'simple'
},
stream=True # Enable streaming
)
for line in response.iter_lines():
if line:
# SSE format: "data: {json}\n\n"
if line.startswith(b'data: '):
event = json.loads(line[6:]) # Skip "data: " prefix
if event['type'] == 'token':
print(event['content'], end='', flush=True)
elif event['type'] == 'sources':
print(f"\n\nSources: {len(event['content'])} documents")
elif event['type'] == 'metadata':
print(f"\nProcessing time: {event['content']['processing_time_ms']}ms")
elif event['type'] == 'done':
print("\n\nComplete")
breakClear Conversation
POST /api/conversation/clear
Clear conversation history and reset context.
Response:
{
"success": true,
"message": "Conversation memory cleared successfully"
}cURL Example:
curl -X POST http://localhost:8000/api/conversation/clearPython Example:
import requests
response = requests.post('http://localhost:8000/api/conversation/clear')
result = response.json()
print(result['message'])Rate Limiting
Apollo enforces rate limiting to prevent abuse:
- Limit: 30 requests per 60 seconds per IP address
- Response when exceeded: HTTP 429 with error message
Rate Limit Headers:
Currently not implemented. Consider adding in production:
X-RateLimit-Limit: 30
X-RateLimit-Remaining: 25
X-RateLimit-Reset: 1640995200Security Features
Prompt Injection Detection
Apollo automatically detects and logs potential prompt injection attempts:
Detected patterns:
ignore previous instructionsreveal your system promptyou are now...- Role markers:
[INST],###system, etc.
Behavior: Logged but not blocked (configurable in production)
Input Sanitization
All queries are sanitized:
- Null bytes removed
- Control characters stripped
- Length validated (max 10,000 characters)
CORS Configuration
For production, configure CORS in docker-compose.yml:
environment:
- CORS_ORIGINS=https://yourdomain.com,https://app.yourdomain.comPerformance Optimization
Caching
Apollo includes multi-layer caching:
- Embedding Cache: Stores vector embeddings for repeated queries
- Response Cache: Caches complete answers for identical questions
- Collection Metadata Cache: Optimized document metadata loading
Cache Hit Example:
{
"metadata": {
"processing_time_ms": 124, // <-- 100x faster!
"cache_hit": true,
"cached_at": "2025-01-15T10:30:00Z"
}
}GPU Acceleration
When GPU is available, Apollo automatically:
- Uses CUDA for embedding generation (10x faster)
- Accelerates vector search operations
- Batches embedding computations
GPU vs CPU Performance:
| Operation | GPU Time | CPU Time | Speedup |
|---|---|---|---|
| Embedding (512 tokens) | 0.5s | 5.2s | 10.4x |
| Vector search (100k docs) | 0.1s | 3.4s | 34x |
| End-to-end query | 8-15s | 30-60s | 3-4x |
Code Examples
Conversation Flow
import requests
BASE_URL = "http://localhost:8000"
# Start conversation with context
response1 = requests.post(
f"{BASE_URL}/api/query",
json={
"question": "What are the beard grooming standards?",
"mode": "simple",
"use_context": True # Enable context
}
)
print(response1.json()['answer'])
# Follow-up question (references previous context)
response2 = requests.post(
f"{BASE_URL}/api/query",
json={
"question": "What about mustaches?", # Implicit reference
"mode": "simple",
"use_context": True # Use previous context
}
)
print(response2.json()['answer'])
# Clear when switching topics
requests.post(f"{BASE_URL}/api/conversation/clear")Batch Queries
questions = [
"Can I grow a beard?",
"What are the fitness standards?",
"Explain the chain of command"
]
results = []
for question in questions:
response = requests.post(
"http://localhost:8000/api/query",
json={"question": question, "mode": "simple"}
)
results.append(response.json())
# Process results
for i, result in enumerate(results):
print(f"\nQ{i+1}: {questions[i]}")
print(f"A: {result['answer'][:200]}...")
print(f"Time: {result['metadata']['processing_time_ms']}ms")Error Handling
import requests
from requests.exceptions import RequestException
def query_apollo_safe(question: str, mode: str = "simple"):
try:
response = requests.post(
"http://localhost:8000/api/query",
json={"question": question, "mode": mode},
timeout=120 # 2 minute timeout
)
response.raise_for_status()
return response.json()
except requests.exceptions.Timeout:
print("Error: Request timed out after 2 minutes")
return None
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
print("Error: Rate limit exceeded. Wait before retrying.")
elif e.response.status_code == 503:
print("Error: Service unavailable. System may be starting up.")
else:
print(f"HTTP Error {e.response.status_code}: {e.response.text}")
return None
except RequestException as e:
print(f"Network error: {e}")
return None
# Usage
result = query_apollo_safe("Can I grow a beard?")
if result:
print(result['answer'])Ready to integrate Apollo? Check out the interactive demos to explore Apollo’s capabilities.
Next Steps
- Quick Start - Get Apollo running in 5 minutes
- Architecture - Understand how Apollo works
- Benchmarks - Performance comparisons