2025/05/11

Quivr with local LLM



🧠 使用 Llama 模型部署 Quivr 的操作指引


根據 Quivr 的原始碼結構,可以確認它支援多個大型語言模型(LLM)供應商,包括 Meta 的 Llama 模型。不過目前主要透過 API 呼叫來連接,而不是直接整合本地模型。


🔧 Quivr 的 LLM 支援概覽

  • DefaultModelSuppliers 列舉中包含對 Meta 模型的支援:config.py:68–74

  • 也有明確的 Llama 模型相關設定區段:config.py:180–193


🚀 部署 Quivr 搭配 Llama 的步驟


步驟 1:安裝 Quivr

pip install quivr-core


步驟 2:設定 Llama 的 API 存取


由於 Quivr 是透過 API 呼叫 LLM,因此你需要建立一個對 Llama 模型的 API 存取端點,有兩種方法:


✅ 方法 A:使用託管式 Llama API(例如 Meta 提供)

  1. 取得 Llama API 的金鑰

  2. 設定環境變數:

export META_API_KEY=your_api_key_here


✅ 方法 B:本機部署(建議用於離線環境)


由於 Quivr 不直接支援本機 LLM,你可以使用 Ollama 或其他 OpenAI 相容伺服器來建立本地端點:

ollama pull llama3

然後啟動 Ollama 的 OpenAI 相容 API:

ollama serve


步驟 3:設定 Quivr 使用 Llama 模型


撰寫一個 Python 腳本,如下:

from quivr_core import Brain
from quivr_core.rag.entities.config import LLMEndpointConfig, DefaultModelSuppliers
from uuid import uuid4

# 如果你使用託管式 API(例如 Meta)
llm_config = LLMEndpointConfig(
    supplier=DefaultModelSuppliers.META,
    model="llama-3",
    temperature=0.7
)

# 如果你使用 Ollama 在本機啟動 OpenAI 相容 API
# llm_config = LLMEndpointConfig(
#     supplier=DefaultModelSuppliers.OPENAI,
#     model="llama3",
#     llm_base_url="http://localhost:11434/v1"
# )

brain = Brain.from_files(
    name="My Llama Brain",
    file_paths=["document1.pdf", "document2.txt"],
    llm_config=llm_config
)

run_id = uuid4()
answer = brain.ask(
    run_id=run_id,
    question="這些文件中有哪些資訊?"
)

print(answer)


步驟 4:進階配置(可選)


若你想自訂檢索流程,可建立自定義 RAG workflow:

from quivr_core.rag.entities.config import RetrievalConfig, WorkflowConfig, NodeConfig

workflow_config = WorkflowConfig(
    name="Llama RAG Workflow",
    nodes=[
        NodeConfig(name="START", edges=["filter_history"]),
        NodeConfig(name="filter_history", edges=["retrieve"]),
        NodeConfig(name="retrieve", edges=["generate_rag"]),
        NodeConfig(name="generate_rag", edges=["END"]),
    ]
)

retrieval_config = RetrievalConfig(
    workflow_config=workflow_config,
    k=20,
    reranker_config={"top_n": 5, "supplier": "cohere"},
    llm_config=llm_config
)

answer = brain.ask(
    run_id=uuid4(),
    question="這些文件的重點是什麼?",
    retrieval_config=retrieval_config
)


⚠️ 注意事項與限制

  • API 為主:Quivr 是透過 API 與 LLM 溝通,而非原生整合本機模型

  • 本地替代方案:使用本機模型需透過如 Ollama 的 OpenAI 相容層

  • 上下文限制:使用 Llama 時,請留意模型的 token 上限(見 config.py:186–189

  • 分詞器差異:Quivr 根據模型自動選用適當的 tokenizer(見 config.py:189


💬 建立互動介面(選用)


若你想使用圖形化聊天介面,可以搭配 Chainlit 使用:

cd examples/chatbot
# 修改 main.py 使用 llama 的設定
rye sync
rye run chainlit run chainlit.py

這會啟動一個網頁介面,讓你透過瀏覽器與 Quivr + Llama 對話。


📌 小結與建議

  • ✅ 最穩定方式是使用 託管式 Llama API

  • ✅ 若需本機部署,建議使用 Ollama + OpenAI 相容端點

  • ✅ Quivr 的模組化設計讓你能輕鬆切換不同 LLM 供應商




Instructions for Deploying Quivr with Llama 

Looking at the Quivr codebase, I can see that you're interested in using Quivr (a RAG system) with Llama models. Quivr supports multiple LLM providers, including Meta's Llama models, though it primarily connects to them through API calls rather than direct local integration.

Understanding Quivr's LLM Support 

Quivr's DefaultModelSuppliers enum includes support for Meta models: config.py:68-74

The codebase also includes configuration for Llama models: config.py:180-193

Deployment Steps 

Here's how to deploy Quivr with Llama:

1. Install Quivr 

First, install the Quivr package:

pip install quivr-core

2. Set Up API Access 

Since Quivr connects to LLMs via API calls, you'll need to set up an API endpoint for Llama. There are two main approaches:

Option A: Use a Hosted Llama API

If you want to use a hosted Llama API (like through Meta's API):

  1. Get API credentials for the Llama model
  2. Set the environment variable:
    export META_API_KEY=your_api_key_here

Option B: Set Up a Local API Server (Recommended for Local Deployment)

Since Quivr doesn't directly support local LLMs, you'll need to set up a local server that exposes an OpenAI-compatible API:

  1. Install and set up Ollama or another local inference server
  2. Pull the Llama model you want to use:
    ollama pull llama3
  3. Start the Ollama server with OpenAI API compatibility

3. Configure Quivr to Use Llama 

Create a Python script to initialize Quivr with Llama:

from quivr_core import Brain  
from quivr_core.rag.entities.config import LLMEndpointConfig, DefaultModelSuppliers  
from uuid import uuid4  
  
# For hosted Llama API  
llm_config = LLMEndpointConfig(  
    supplier=DefaultModelSuppliers.META,  
    model="llama-3",  # Use the appropriate model name  
    temperature=0.7  
)  
  
# For local Llama via Ollama with OpenAI-compatible API  
# llm_config = LLMEndpointConfig(  
#     supplier=DefaultModelSuppliers.OPENAI,  # Use OpenAI supplier for compatibility  
#     model="llama3",  # Match the model name in Ollama  
#     llm_base_url="http://localhost:11434/v1",  # Ollama API endpoint  
# )  
  
# Create a Brain with your documents  
brain = Brain.from_files(  
    name="My Llama Brain",  
    file_paths=["document1.pdf", "document2.txt"],  
    llm_config=llm_config  
)  
  
# Ask a question  
run_id = uuid4()  
answer = brain.ask(  
    run_id=run_id,  
    question="What information is in these documents?"  
)  
  
print(answer)

4. Advanced Configuration 

For more advanced usage, you can customize the RAG workflow:

from quivr_core.rag.entities.config import RetrievalConfig, WorkflowConfig, NodeConfig  
  
# Create a custom workflow config  
workflow_config = WorkflowConfig(  
    name="Llama RAG Workflow",  
    nodes=[  
        NodeConfig(name="START", edges=["filter_history"]),  
        NodeConfig(name="filter_history", edges=["retrieve"]),  
        NodeConfig(name="retrieve", edges=["generate_rag"]),  
        NodeConfig(name="generate_rag", edges=["END"]),  
    ]  
)  
  
# Create retrieval config with the custom workflow  
retrieval_config = RetrievalConfig(  
    workflow_config=workflow_config,  
    k=20,  # Number of chunks to retrieve  
    reranker_config={"top_n": 5, "supplier": "cohere"},  
    llm_config=llm_config  
)  
  
# Use the custom config with the brain  
answer = brain.ask(  
    run_id=uuid4(),  
    question="What are the key points in these documents?",  
    retrieval_config=retrieval_config  
)

Considerations and Limitations 

  1. API-Based Integration: Quivr primarily connects to LLMs through API calls, not direct local integration.

  2. Local Deployment Workaround: For local Llama models, you need to use a compatibility layer like Ollama with an OpenAI-compatible API.

  3. Context Window: Be aware of Llama's context window limitations when configuring your RAG system: config.py:186-189

  4. Tokenizer: Quivr uses specific tokenizers for different models, including Llama: config.py:189

Example Chainlit Interface 

For a more interactive experience, you can use Chainlit to create a chat interface:

  1. Navigate to the examples directory:

    cd examples/chatbot
  2. Modify the main.py file to use Llama configuration

  3. Run the Chainlit interface:

    rye sync  
    rye run chainlit run chainlit.py

This will give you a web interface to interact with your Llama-powered Quivr brain.

Notes:

  • The most reliable approach is using a hosted Llama API
  • For local deployment, you'll need to set up an OpenAI-compatible API server
  • Quivr's modular design allows for swapping between different LLM providers


🤖 Quivr 整合 Llama 的兩種選項比較


根據 Quivr 的原始碼,你可以透過以下兩種方式將 Quivr 與 Llama 模型結合:

  1. 使用 託管式 Llama API(如 Meta 提供的服務)

  2. 架設 本機 OpenAI 相容 API 伺服器(如 Ollama)


以下是根據 Quivr 程式碼分析這兩種方式的比較與建議:


🔧 Quivr 對 Llama 模型的支援

  • Quivr 原生支援 Meta 的 Llama 模型(見 DefaultModelSuppliers 列舉):config.py:68–74

  • 程式中也定義了多種 Llama 模型的設定:config.py:180–193


✅ 選項一:使用託管式 Llama API


🛠️ 運作方式

  • 向 Llama 模型的 API 提供商(如 Meta)申請 API 金鑰

  • 將 Quivr 設定為使用 META 模型供應商,並提供金鑰

  • Quivr 將直接呼叫遠端的 API 服務


🧾 設定範例

llm_config = LLMEndpointConfig(
    supplier=DefaultModelSuppliers.META,
    model="llama-3",
    temperature=0.7
)
brain = Brain.from_files(
    name="My Llama Brain",
    file_paths=["document1.pdf", "document2.txt"],
    llm_config=llm_config
)

👍 優點

  • 官方支援:使用 Quivr 原生支援的 Meta 模型

  • 穩定可靠:雲端服務通常提供高可用性與效能

  • 易於擴充:可處理多用戶並發查詢,無本機硬體限制

  • 無需部署模型:不需本地安裝或管理 Llama

  • 完整上下文支援:支援 Llama-3 的 8192 token 上下文長度


👎 缺點

  • 需付費:API 通常按 token 或請求數計費

  • 隱私疑慮:資料需傳送至外部伺服器

  • 有限制:可能有流量上限或 API 使用限制

  • 依賴網路:必須保持網路連線


💻 選項二:架設本機 OpenAI 相容 API 伺服器


🛠️ 運作方式

  • 安裝像 OllamaLM Studio 或 LocalAI 等本地推論伺服器

  • 啟動伺服器並暴露 OpenAI 相容的 API 端點

  • 將 Quivr 設定為 OPENAI 供應商,並指向本機 API 位址


🧾 設定範例

llm_config = LLMEndpointConfig(
    supplier=DefaultModelSuppliers.OPENAI,
    model="llama3",
    llm_base_url="http://localhost:11434/v1"
)
brain = Brain.from_files(
    name="My Llama Brain",
    file_paths=["document1.pdf", "document2.txt"],
    llm_config=llm_config
)

👍 優點

  • 資料隱私:所有資料皆保留於本機

  • 免費使用:無須按 token 或請求計費

  • 可離線執行:無需依賴網路

  • 無流量限制:不受 API 限流限制

  • 高度自訂:可自行調整模型參數、量化等設定


👎 缺點

  • 屬於 workaround:是透過 OPENAI 介面整合,而非原生支援 Meta

  • 硬體需求高:需具備足夠記憶體 / GPU 來運行 Llama

  • 部署較複雜:需手動安裝與設定伺服器

  • 效能不穩定:依據本機硬體不同而異

  • 可能出現相容性問題:OpenAI 相容層不一定完全對應 Quivr 的所有功能


⚙️ 技術注意事項


🧮 分詞器(Tokenizer)差異

  • Quivr 為不同模型設計了對應的 tokenizer(config.py:186–189

  • 若使用本機伺服器,需確保 tokenizer 與 Quivr 設定相符,否則可能出現 token 錯誤


📏 上下文長度管理

  • Quivr 預設的 Llama 模型上下文上限如下:

    • Llama-3:8192 tokens

    • Llama-3.1:128,000 tokens

    • Code-Llama:16,384 tokens

  • 使用本地伺服器時,請確認其支援相同上下文長度


🧭 建議使用方式

使用情境

推薦方案

生產環境、穩定性優先

✅ 託管式 Llama API(原生支援、更可靠)

開發測試、重視隱私

✅ 本機伺服器(Ollama)具更多控制與隱私

混合模式

💡 開發用本機,正式上線改用託管 API


🚀 實作步驟總覽


託管式 Llama API:

  1. 取得 Meta API 金鑰

  2. 設定環境變數:export META_API_KEY=你的金鑰

  3. 使用 META 模型供應商建立 llm_config


本機 OpenAI 相容伺服器:

  1. 安裝 Ollama:

curl -fsSL https://ollama.com/install.sh | sh

  1. 下載模型:

ollama pull llama3

  1. 啟動伺服器(自動提供 OpenAI 相容端點)

  2. 設定 Quivr 使用 OPENAI,並連接 http://localhost:11434/v1


📌 備註

  • 本機部署屬於替代方案,並非 Quivr 原生設計支援

  • 兩種方式皆可運行 Quivr 的 RAG 系統,但各有優劣

  • 若你使用本機部署,tokenizer 設定 可能需調整以保持準確性




Comparing Llama Integration Options for Quivr 

Based on the codebase, you're asking about two different methods for deploying Quivr with Llama models:

  1. Using a hosted Llama API (like Meta's API service)
  2. Setting up a local OpenAI-compatible API server

Let me compare these approaches based on what I can see in the Quivr codebase.

Quivr's Support for Llama Models 

First, it's important to understand that Quivr has built-in support for Meta's Llama models through its DefaultModelSuppliers enum: config.py:68-74

The codebase also includes specific configurations for different Llama models: config.py:180-193

Option 1: Using a Hosted Llama API 

How It Works 

When using a hosted Llama API with Quivr:

  1. You obtain API credentials from a provider offering Llama models (like Meta)
  2. You configure Quivr to use the META supplier with your API key
  3. Quivr makes direct API calls to the hosted service

Configuration Example 

llm_config = LLMEndpointConfig(  
    supplier=DefaultModelSuppliers.META,  
    model="llama-3",  
    temperature=0.7  
)  
  
brain = Brain.from_files(  
    name="My Llama Brain",  
    file_paths=["document1.pdf", "document2.txt"],  
    llm_config=llm_config  
)

Advantages 

  • Official Support: This approach uses Quivr's built-in support for Meta models
  • Reliability: Hosted services typically offer better uptime and performance
  • Scalability: Can handle multiple concurrent requests without local resource constraints
  • No Local Setup: No need to install or manage the Llama model locally
  • Context Window Support: Access to the full context window capabilities of Llama models (8,192 tokens for Llama-3)

Disadvantages 

  • Cost: Hosted APIs typically charge per token or per request
  • Privacy Concerns: Your data is sent to external servers
  • API Limitations: May have rate limits or other restrictions
  • Internet Dependency: Requires internet connectivity

Option 2: Setting Up an OpenAI-Compatible API Server 

How It Works 

Since Quivr doesn't directly support local LLM integration, this approach involves:

  1. Setting up a local server (like Ollama, LM Studio, or LocalAI) that runs Llama locally
  2. Configuring this server to expose an OpenAI-compatible API
  3. Configuring Quivr to use the OPENAI supplier but pointing to your local server

Configuration Example 

llm_config = LLMEndpointConfig(  
    supplier=DefaultModelSuppliers.OPENAI,  # Using OpenAI supplier for compatibility  
    model="llama3",  # Match the model name in your local server  
    llm_base_url="http://localhost:11434/v1",  # Local API endpoint  
)  
  
brain = Brain.from_files(  
    name="My Llama Brain",  
    file_paths=["document1.pdf", "document2.txt"],  
    llm_config=llm_config  
)

Advantages 

  • Privacy: All data stays on your local machine
  • No Usage Costs: Once set up, there are no per-token or per-request charges
  • No Internet Dependency: Can work offline
  • No Rate Limits: Not subject to API rate limiting
  • Customization: Can fine-tune model parameters and quantization

Disadvantages 

  • Workaround Approach: Uses Quivr's OpenAI integration rather than native Meta support
  • Hardware Requirements: Requires sufficient local hardware (RAM, GPU) to run Llama
  • Setup Complexity: More complex initial setup
  • Performance Variability: Performance depends on your hardware
  • Potential Compatibility Issues: May encounter issues with the OpenAI API compatibility layer

Technical Considerations 

Tokenizer Differences 

Quivr uses specific tokenizers for different Llama models: config.py:186-189

When using the OpenAI-compatible approach, you might encounter tokenization differences if the local server doesn't use the same tokenizer.

Context Window Management 

Quivr is configured with specific context window limits for Llama models:

  • Llama-3: 8,192 tokens
  • Llama-3.1: 128,000 tokens
  • Code-Llama: 16,384 tokens

When using a local server, you'll need to ensure it's configured with compatible context window settings.

Recommendation 

For production use: The hosted Llama API approach is more reliable and uses Quivr's native support for Meta models.

For development, testing, or privacy-sensitive applications: The local OpenAI-compatible server approach gives you more control and privacy.

Hybrid approach: You could use the local server for development and testing, then switch to the hosted API for production by simply changing the configuration.

Implementation Steps 

For Hosted Llama API: 

  1. Obtain API credentials from Meta or another Llama provider
  2. Set the environment variable: export META_API_KEY=your_api_key_here
  3. Configure Quivr with the META supplier and appropriate model name

For Local OpenAI-Compatible Server: 

  1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull the Llama model: ollama pull llama3
  3. Start Ollama (it automatically exposes an OpenAI-compatible API)
  4. Configure Quivr with the OPENAI supplier but point to your local Ollama endpoint

Notes:

  • The OpenAI-compatible approach is a workaround since Quivr doesn't directly support local LLMs
  • Both approaches will work with Quivr's RAG system, but with different tradeoffs
  • The tokenizer configuration may need adjustment depending on your specific setup