Implement RAG (Retrieval-Augmented Generation) with App Platform for LKE

Updated December 9, 2025 Originally authored by Sander Rodenhuis

Traducciones al Español
Estamos traduciendo nuestros guías y tutoriales al Español. Es posible que usted esté viendo una traducción generada automáticamente. Estamos trabajando con traductores profesionales para verificar las traducciones de nuestro sitio web. Este proyecto es un trabajo en curso.

Create a Linode account to try this guide with a $ credit.

This credit will be applied to any valid services used during your first days.

This guide extends the LLM (Large Language Model) inference architecture built in our Deploy an LLM for AI Inference with App Platform for LKE guide by deploying a RAG (Retrieval-Augmented Generation) pipeline that indexes a custom data set. RAG is a particular method of context augmentation that attaches relevant data as context when users send queries to an LLM.

Follow the steps in this tutorial to enable Kubeflow Pipelines and deploy a RAG pipeline using App Platform for LKE. The data set you use may vary depending on your use case. For example purposes, this guide uses a sample data set from Akamai Techdocs that includes documentation about all Akamai Cloud services.

If you prefer a manual installation rather than one using App Platform for LKE, see our Deploy a Chatbot and RAG Pipeline for AI Inference on LKE guide.

Diagram

Components

Infrastructure

Linode GPUs (NVIDIA RTX 4000): Akamai has several high-performance GPU virtual machines available, including NVIDIA RTX 4000 (used in this tutorial) and Quadro RTX 6000. NVIDIA’s Ada Lovelace architecture in the RTX 4000 VMs are adept at many AI tasks, including inference and image generation.
Linode Kubernetes Engine (LKE): LKE is Akamai’s managed Kubernetes service, enabling you to deploy containerized applications without needing to build out and maintain your own Kubernetes cluster.
App Platform for LKE: A Kubernetes-based platform that combines developer and operations-centric tools, automation, self-service, and management of containerized application workloads. App Platform for LKE streamlines the application lifecycle from development to delivery and connects numerous CNCF (Cloud Native Computing Foundation) technologies in a single environment, allowing you to construct a bespoke Kubernetes architecture.

Additional Software

Open WebUI Pipelines: A self-hosted UI-agnostic OpenAI API plugin framework that brings modular, customizable workflows to any UI client supporting OpenAI API specs.
PGvector: Vector similarity search for Postgres. This tutorial uses a Postgres database with a vector extension to store embeddings generated by LlamaIndex and make them available to queries sent to the Llama 3.1 8B LLM.
KServe: Serves machine learning models. The architecture in this guide uses the Llama 3 LLM deployed using the Hugging Face runtime server with KServe, which then serves it to other applications, including the chatbot UI.
intfloat/e5-mistral-7b-instruct LLM: The intfloat/e5-mistral-7b-instruct model is used as the embedding LLM in this guide.
Kubeflow Pipelines: Used to deploy pipelines, reusable machine learning workflows built using the Kubeflow Pipelines SDK. In this tutorial, a pipeline is used to run a pipeline to process the dataset and store embeddings in the PGvector database.

Prerequisites

Complete the deployment in the Deploy an LLM for AI Inference with App Platform for LKE guide. Your LKE cluster should include the following minimum hardware requirements:
- 3 8GB Dedicated CPUs with autoscaling turned on
- A second node pool consisting of at least 2 RTX4000 Ada x1 Medium GPU plans
Python3 and the venv Python module installed on your local machine
Object Storage configured. Make sure to configure Object Storage as described here before Kubeflow Pipelines is enabled.

Set Up Infrastructure

Before continuing, sign into the App Platform web console as platform-admin or any other account that uses the platform-admin role.

Add the hf-e5-mistral-7b-instruct Helm Chart to the Catalog

Click on Catalog in the left menu.
Select Add Helm Chart.

Under Git Repository URL, add the URL to the hf-e5-mistral-7b-instruct Helm chart:

https://github.com/linode/apl-examples/blob/main/inference/kserve/hf-e5-mistral-7b-instruct/Chart.yaml

Click Get Details to populate the Helm chart details.
Uncheck the Allow teams to use this chart option. In the next step, you’ll configure the RBAC of the catalog to make this Helm chart available for the team models to use.
Click Add Chart.

Now configure the RBAC of the catalog:

Select view > platform.
Select App in the left menu.
Click on the Gitea app.
In the list of repositories, click on otomi/charts.
At the bottom, click on the file rbac.yaml.
Change the RBAC for the hf-e5-mistral-7b-instruct Helm chart as shown below:
```
hf-e5-mistral-7b-instruct:
  - team-models
```

Create a Workload to Deploy the Model

Select view > team and team > models in the top bar.
Select Catalog from the menu.
Select the hf-e5-mistral-7b-instruct chart.
Click on Values.
Provide a name for the workload. This guide uses the workload name mistral-7b.
Use the default values and click Submit.

Create a Workload to deploy a PGvector cluster

Select view > team and team > demo in the top bar.
Select Catalog from the menu.
Select the pgvector-cluster chart.
Click on Values.
Provide a name for the workload. This guide uses the workload name pgvector.
Use the default values and click Submit.

Note that the pgvector-cluster chart also creates a database in the PGvector cluster with the name app.

Set Up Kubeflow Pipelines

Enable Kubeflow Pipelines

Select view > platform in the top bar.
Select Apps in the left menu.
Enable the Kubeflow Pipelines app by hovering over the app icon and clicking the power on button. It may take a few minutes for the apps to enable.

Generate the Pipeline YAML File

Follow the steps below to create a Kubeflow pipeline file. This YAML file describes each step of the pipeline workflow.

On your local machine, create a virtual environment for Python:
python3 -m venv . source bin/activate
Install the Kubeflow Pipelines package in the virtual environment:
pip install kfp

Create a file named doc-ingest-pipeline.py with the following contents.

This script configures the pipeline that downloads the Markdown data set to be ingested, reads the content using LlamaIndex, generates embeddings of the content, and stores the embeddings in the PGvector database.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
from kfp import dsl

@dsl.component(
    base_image='nvcr.io/nvidia/ai-workbench/python-cuda117:1.0.3',
    packages_to_install=['psycopg2-binary', 'llama-index', 'llama-index-vector-stores-postgres',
                        'llama-index-embeddings-openai-like', 'llama-index-llms-openai-like', 'kubernetes']
)
def doc_ingest_component(url: str, table_name: str) -> None:
    print(">>> doc_ingest_component")

    from urllib.request import urlopen
    from io import BytesIO
    from zipfile import ZipFile

    http_response = urlopen(url)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path='./md_docs')

    from llama_index.core import SimpleDirectoryReader

    # load documents
    documents = SimpleDirectoryReader("./md_docs/", recursive=True, required_exts=[".md"]).load_data()

    from llama_index.embeddings.openai_like import OpenAILikeEmbedding
    from llama_index.core import Settings

    Settings.embed_model = OpenAILikeEmbedding(
        model_name="mistral-7b-instruct",
        api_base="http://mistral-7b.team-models.svc.cluster.local/openai/v1",
        api_key="EMPTY",
        embed_batch_size=50,
        max_retries=3,
        timeout=180.0
    )

    from llama_index.core import VectorStoreIndex, StorageContext
    from llama_index.vector_stores.postgres import PGVectorStore
    import base64
    from kubernetes import client, config

    def get_secret_credentials():
        try:
            config.load_incluster_config()  # For in-cluster access
            v1 = client.CoreV1Api()
            secret = v1.read_namespaced_secret(name="pgvector-app", namespace="team-demo")
            password = base64.b64decode(secret.data['password']).decode('utf-8')
            username = base64.b64decode(secret.data['username']).decode('utf-8')
            return username, password
        except Exception as e:
            print(f"Error getting secret: {e}")
            return 'app', 'changeme'

    pg_user, pg_password = get_secret_credentials()

    vector_store = PGVectorStore.from_params(
        database="app",
        host="pgvector-rw.team-demo.svc.cluster.local",
        port=5432,
        user=pg_user,
        password=pg_password,
        table_name=table_name,
        embed_dim=4096
    )
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_documents(
        documents, storage_context=storage_context
    )

@dsl.pipeline
def doc_ingest_pipeline(url: str, table_name: str) -> None:
    comp = doc_ingest_component(url=url, table_name=table_name)

from kfp import compiler

compiler.Compiler().compile(doc_ingest_pipeline, 'pipeline.yaml')

Run the script to generate a pipeline YAML file called pipeline.yaml:
python3 doc-ingest-pipeline.py
This file is uploaded to Kubeflow in the following section.
Exit the Python virtual environment:
deactivate

Run the Pipeline Workflow

Select view > team and team > demo in the top bar.
Select Apps.
Click on the kubeflow-pipelines app.
The UI opens the Pipelines section. Click Upload pipeline.
Under Upload a file, select the pipeline.yaml file created in the previous section, and click Create.
Select Runs from the left menu, and click Create run.
Under Pipeline, choose the pipeline pipeline.yaml you just created.
For Run Type choose One-off.
Use linode_docs for the table_name

To use the sample Linode Docs data set in this guide, use the following GitHub URL for url-string:

 ```command
 https://github.com/linode/rag-datasets/raw/refs/heads/main/cloud-computing.zip
 ```

Click Start to run the pipeline. When completed, the run is shown with a green checkmark to the left of the run title.

Deploy the AI Agent

The next step is to use Open WebUI pipelines configured with an agent pipeline. This connects the data generated in the Kubernetes Pipeline with the LLM deployed in KServe. It also exposes an OpenAI API endpoint to allow for a connection with the chatbot.

The Open WebUI pipeline uses the PGvector database to load context related to the search. The pipeline sends it, and the query, to the Llama LLM instance within KServe. The LLM then sends back a response to the chatbot, and your browser displays an answer informed by the custom data set.

Create a configmap with the Agent Pipeline Files

The Agent pipeline files in this section are not related to the Kubeflow pipeline created in the previous section. Instead, the Agent pipeline instructs the agent how to interact with each component created thus far, including the PGvector data store, the embedding model and the Llama (foundation) model.

Select view > team and team > demo in the top bar.
Navigate to the Apps section, and click on Gitea.
In Gitea, navigate to the team-demo-argocd repository on the right.

Click the Add File dropdown, and select New File. Create a file with the name my-agent-pipeline-files.yaml with the following contents:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-agent-pipeline
data:
  agent-pipeline-requirements.txt: |
    psycopg2-binary
    llama-index
    llama-index-vector-stores-postgres
    llama-index-embeddings-openai-like
    llama-index-llms-openai-like
    opencv-python-headless
    kubernetes
  agent-pipeline.py: |
    import base64
    from llama_index.core import Settings, VectorStoreIndex
    from llama_index.core.llms import ChatMessage
    from llama_index.llms.openai_like import OpenAILike
    from llama_index.embeddings.openai_like import OpenAILikeEmbedding
    from llama_index.vector_stores.postgres import PGVectorStore
    from kubernetes import client, config as k8s_config

    # LLM configuration
    LLM_MODEL = "meta-llama-3-1-8b"
    LLM_API_BASE = "http://llama-3-1-8b.team-models.svc.cluster.local/openai/v1"
    LLM_API_KEY = "EMPTY"
    LLM_MAX_TOKENS = 512

    # Embedding configuration
    EMBEDDING_MODEL = "mistral-7b-instruct"
    EMBEDDING_API_BASE = "http://mistral-7b.team-models.svc.cluster.local/openai/v1"
    EMBED_BATCH_SIZE = 10
    EMBED_DIM = 4096

    # Database configuration
    DB_NAME = "app"
    DB_TABLE_NAME = "linode_docs"
    DB_SECRET_NAME = "pgvector-app"
    DB_SECRET_NAMESPACE = "team-demo"

    # RAG configuration
    SIMILARITY_TOP_K = 3
    SYSTEM_PROMPT = """You are a helpful AI assistant for Linode."""

    class Pipeline:
      def __init__(self):
        self.name = "my-agent"
        self.kb_index = None  # Store the KB index for creating chat engines
        self.system_prompt = SYSTEM_PROMPT  # Store system prompt for LLM-only mode

      async def on_startup(self):
        Settings.llm = OpenAILike(
          model=LLM_MODEL,
          api_base=LLM_API_BASE,
          api_key=LLM_API_KEY,
          max_tokens=LLM_MAX_TOKENS,
          is_chat_model=True,
          is_function_calling_model=True
        )
        Settings.embed_model = OpenAILikeEmbedding(
          model_name=EMBEDDING_MODEL,
          api_base=EMBEDDING_API_BASE,
          embed_batch_size=EMBED_BATCH_SIZE,
          max_retries=3,
          timeout=180.0
        )
        self.kb_index = self._build_vector_index()

      def _build_vector_index(self):
        """Builds a vector index from database."""
        db_credentials = self._get_db_credentials()

        vector_store = PGVectorStore.from_params(
          database=DB_NAME,
          host=db_credentials["host"],
          port=db_credentials["port"],
          user=db_credentials["username"],
          password=db_credentials["password"],
          table_name=DB_TABLE_NAME,
          embed_dim=EMBED_DIM,
        )
        return VectorStoreIndex.from_vector_store(vector_store)

      def _get_db_credentials(self):
        """Get database credentials from Kubernetes secret."""
        k8s_config.load_incluster_config()
        v1 = client.CoreV1Api()
        secret = v1.read_namespaced_secret(
          name=DB_SECRET_NAME,
          namespace=DB_SECRET_NAMESPACE,
        )
        return {
          "username": base64.b64decode(secret.data["username"]).decode("utf-8"),
          "password": base64.b64decode(secret.data["password"]).decode("utf-8"),
          "host": base64.b64decode(secret.data["host"]).decode("utf-8"),
          "port": int(base64.b64decode(secret.data["port"]).decode("utf-8")),
        }

      def _convert_to_chat_history(self, messages):
        """Convert request messages to ChatMessage objects for chat history.

        Args:
          messages: List of message dicts with 'role' and 'content'

        Returns:
          List of ChatMessage objects excluding the last message (current message)
        """
        chat_history = []
        if messages and len(messages) > 1:
            for msg in messages[:-1]:  # Exclude current message
                chat_history.append(ChatMessage(role=msg['role'], content=msg['content']))
        return chat_history

      def pipe(self, user_message, model_id, messages, body):
        try:
            if self.kb_index is None:
              yield "Error: Knowledge base not initialized. Please check system configuration."
              return

            chat_history = self._convert_to_chat_history(messages)

            # Create chat engine on-demand (stateless)
            chat_engine = self.kb_index.as_chat_engine(
                chat_mode="condense_plus_context",
                streaming=True,
                similarity_top_k=SIMILARITY_TOP_K,
                system_prompt=self.system_prompt
            )
            # Get streaming response
            streaming_response = chat_engine.stream_chat(user_message, chat_history=chat_history)
            for token in streaming_response.response_gen:
                yield token
        except Exception as e:
          print(f"\nDEBUG: Unexpected error: {type(e).__name__}: {str(e)}")
          yield "I apologize, but I encountered an unexpected error while processing your request. Please try again."
          return

Optionally add a title and any notes to the change history, and click Commit Changes.
Go to Apps, and open the Argocd application. Navigate to the team-demo application to see if the configmap has been created. If it is not ready yet, click Refresh if needed.

Deploy the open-webui Pipeline and Web Interface

Update the Kyverno Policy open-webui-policy.yaml created in the previous tutorial ( Deploy an LLM for AI Inference with App Platform for LKE) to mutate the open-webui pods that will be deployed.

Add the pipelines Helm Chart to the Catalog

Select view > team and team > admin in the top bar.
Click on Catalog in the left menu.
Select Add Helm Chart.

Under Github URL, add the URL to the open-webui pipelines Helm chart:

https://github.com/open-webui/helm-charts/blob/pipelines-0.4.0/charts/pipelines/Chart.yaml

Click Get Details to populate the pipelines Helm chart details.
Leave Allow teams to use this chart selected.
Click Add Chart.

Create a Workload for the pipelines Helm Chart

Select view > team and team > demo in the top bar.
Select Workloads.
Click on Create Workload.
Select the pipelines Helm chart from the catalog.
Click on Values.
Provide a name for the workload. This guide uses the workload name my-agent.

Add in or change the following chart values. Make sure to set the name of the workload in the nameOverride field.

You may need to uncomment some fields by removing the # sign in order to make them active. Remember to be mindful of indentations:

nameOverride: linode-docs-pipeline
resources:
  requests:
    cpu: "1"
    memory: "512Mi"
  limits:
    cpu: "3"
    memory: "2Gi"
ingress:
  enabled: false
extraEnvVars:
  - name: PIPELINES_REQUIREMENTS_PATH
    value: "/opt/agent-pipeline-requirements.txt"
  - name: PIPELINES_URLS
    value: "file:///opt/agent-pipeline.py"
volumeMounts:
  - name: config-volume
    mountPath: "/opt"
volumes:
  - name: config-volume
    configMap:
      name: my-agent-pipeline

Click Submit.

Add a new Role and a RoleBinding for the Agent

The agent pipeline requires access to the PGvector database. For configure this, the ServiceAccount of the Agent needs access to the pgvector-app secret that includes the database credentials. Create the Role and RoleBinding by following the steps below.

Select view > platform in the top bar.
Select Apps in the left menu.
In the Apps section, select the Gitea app.
In Gitea, navigate to the team-demo-argocd repository.

Click the Add File dropdown, and select New File. Create a file named my-agent-rbac.yaml with the following contents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pgvector-app-secret-reader
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["pgvector-app"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pgvector-app-secret-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pgvector-app-secret-reader
subjects:
  - kind: ServiceAccount
    name: my-agent
    namespace: team-demo

Create a Workload to Install the open-webui Helm Chart

Select view > team and team > demo in the top bar.
Select Workloads.
Click on Create Workload.
Select the open-webui Helm chart from the catalog. This Helm chart should have been added in the previous Deploy an LLM for AI Inference with App Platform for LKE guide.
Click on Values.
Provide a name for the workload. This guide uses the name my-agent-ui.

Edit the chart to include the below values, and set the name of the workload in the nameOverride field.

nameOverride: my-agent-ui
ollama:
  enabled: false
pipelines:
  enabled: false
persistence:
  enabled: false
replicaCount: 1
openaiBaseApiUrl: "http://my-agent.team-demo.svc.cluster.local:9099"
extraEnvVars:
  - name: WEBUI_AUTH
    value: false
  - name: OPENAI_API_KEY
    value: "0p3n-w3bu!"

Click Submit.

Publicly expose the my-agent-ui Service

Select Services.
Click Create Service.
In the Service Name dropdown list, select the my-agent-ui service.
Click Create Service.

In the list of available Services, click on the URL of the my-agent-ui to navigate to the Open WebUI.

More Information

You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.

This page was originally published on March 25, 2025.

Join the conversation.

Read other comments or post your own below. Comments must be respectful, constructive, and relevant to the topic of the guide. Do not post external links or advertisements. Before posting, consider if your comment would be better addressed by contacting our Support team or asking on our Community Site.

The Disqus commenting system for Linode Docs requires the acceptance of Functional Cookies, which allow us to analyze site usage so we can measure and improve performance. To view and create comments for this article, please update your Cookie Preferences on this website and refresh this web page. Please note: You must have JavaScript enabled in your browser.

Compute

Storage

Networking

Databases

Services

Solutions

Pricing

Library

Technical Resources

Community

Marketplace

What's New

Search Results

No Results

Filters

Implement RAG (Retrieval-Augmented Generation) with App Platform for LKE

Diagram

Components

Infrastructure

Additional Software

Prerequisites

Set Up Infrastructure

Add the hf-e5-mistral-7b-instruct Helm Chart to the Catalog

Create a Workload to Deploy the Model

Create a Workload to deploy a PGvector cluster

Set Up Kubeflow Pipelines

Enable Kubeflow Pipelines

Generate the Pipeline YAML File

Run the Pipeline Workflow

Deploy the AI Agent

Create a configmap with the Agent Pipeline Files

Deploy the open-webui Pipeline and Web Interface

Add the pipelines Helm Chart to the Catalog

Create a Workload for the pipelines Helm Chart

Add a new Role and a RoleBinding for the Agent

Create a Workload to Install the open-webui Helm Chart

Publicly expose the my-agent-ui Service

More Information

Your Feedback Is Important

On this page