AI Inference with App Platform for LKE

LLMs (large language models) generate human-like text and perform language-based tasks by being trained on massive datasets. AI inference is the process to make predictions or decisions on new data outside of the training datasets. The LLM used in this deployment, Meta AI’s Llama 3, is an open-source, pre-trained LLM often used for tasks like responding to questions in multiple languages, coding, and advanced reasoning.

KServe is a Model Inference Framework for Kubernetes, built for highly-scalable use cases. KServe comes with multiple Model Serving Runtimes, including the Hugging Face serving runtime. The Hugging Face runtime supports the following machine learning (ML) tasks: text generation, Text2Text generation, token classification, sequence and text classification, and fill mask. KServe is integrated in App Platform for LKE, Akamai’s pre-built Kubernetes developer platform.

App Platform also integrates Istio and Knative, both of which are prerequisites for using KServe. App Platform automates the provisioning process of these applications.

This guide describes the steps required to: install KServe with App Platform, deploy the Meta Llama 3.1 8B model using the Hugging Face runtime server, and deploy a chatbot interface using Open WebUI. Once functional, use our Deploy a RAG Pipeline and Chatbot with App Platform for LKE guide to add and run a RAG pipeline and deploy an AI Agent that exposes an OpenAI compatible API.

If you prefer to manually install an LLM and RAG Pipeline on LKE rather than using Akamai App Platform, see our Deploy a Chatbot and RAG Pipeline for AI Inference on LKE guide.

Diagram

Components

Infrastructure

Linode GPUs (NVIDIA RTX 4000): Akamai has several GPU virtual machines available, including NVIDIA RTX 4000 (used in this tutorial) and Quadro RTX 6000. NVIDIA’s Ada Lovelace architecture in the RTX 4000 VMs are adept at many AI tasks, including inference and image generation.
Linode Kubernetes Engine (LKE): LKE is Akamai’s managed Kubernetes service, enabling you to deploy containerized applications without needing to build out and maintain your own Kubernetes cluster.
App Platform for LKE: A Kubernetes-based platform that combines developer and operations-centric tools, automation, self-service, and management of containerized application workloads. App Platform streamlines the application lifecycle from development to delivery and connects numerous CNCF (Cloud Native Computing Foundation) technologies in a single environment, allowing you to construct a bespoke Kubernetes architecture.

Software

Open WebUI: A self-hosted AI chatbot application that’s compatible with LLMs like Llama 3 and includes a built-in inference engine for RAG (Retrieval-Augmented Generation) solutions. Users interact with this interface to query the LLM.
Hugging Face: A data science platform and open-source library of data sets and pre-trained AI models. A Hugging Face account and access key is required to access the Llama 3 large language model (LLM) used in this deployment.
meta-llama/Llama-3.1-8B-Instruct LLM: The meta-llama/Llama-3.1-8B-Instruct model is used as the foundational LLM in this guide. You must review and agree to the licensing agreement before deploying.
KServe: Serves machine learning models. This tutorial installs the Llama 3 LLM to KServe, which then serves it to other applications, such as the chatbot UI.
Istio: An open source service mesh used for securing, connecting, and monitoring microservices.
Knative: Used for deploying and managing serverless workloads on the Kubernetes platform.

Prerequisites

A Cloud Manager account is required to use Akamai’s cloud computing services, including LKE.
A Hugging Face account is used for pulling Meta AI’s Llama 3 model.
Access granted to Meta AI’s Llama 3 model is required. To request access, navigate to Hugging Face’s Llama 3-8B Instruct LLM link, read and accept the license agreement, and submit your information.

Set Up Infrastructure

Provision an LKE Cluster

We recommend provisioning an LKE cluster with App Platform enabled and the following minimum requirements:

3 8GB Dedicated CPUs with autoscaling turned on.
A second node pool consisting of at least 2 RTX4000 Ada x1 Medium GPU plans.

Once your LKE cluster is provisioned and the App Platform portal is available, complete the following steps to continue setting up your infrastructure.

Sign into the App Platform web UI using the platform-admin account, or another account that uses the platform-admin role. Instructions for signing into App Platform for the first time can be found in our Getting Started with Akamai App Platform guide.

Enable Knative an KServe

Select view > platform in the top bar.
Select Apps in the left menu.
Enable the Knative and KServe apps by hovering over each app icon and clicking the power on button. It may take a few minutes for the apps to enable.
Enabled apps move up and appear in color towards the top of the available app list.

Create Teams

Teams are isolated tenants on the platform to support development and DevOps teams, projects or even DTAP. A team gets access to the App Platform portal, including access to self-service features and all shared apps available on the platform.

For this guide you will need to create 2 teams. One team that offers access to LLMs as a shared service and one team that consumes the LLMs.

First, create a team to run the LLMs:

Select view > platform.
Select Teams in the left menu.
Click Create Team.
Provide a Name for the team. This guide uses the team name models.
Under Resource Quota, change the Compute Resource Quota to 50 Cores and 64 Gi
Under Network Policies, disable Egress Control and Ingress Control.
See Appendix 1 and 2 to learn what to do when Egress Control and Ingress Control should be enabled because of compliance.
Click Create Team.

Now create a team to run the apps that are to consume the LLMs:

Click Create Team.
Provide a Name for the team. This guide uses the team name demo.
Under Network Policies, disable Egress Control and Ingress Control.
Click Create Team.

Install the NVIDIA GPU Operator

The NVIDIA GPU Operator automates the management of NVIDIA software components needed for provisioning the GPUs, including drivers, the Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, and others.

Select view > team and team > admin in the top bar.
Select Shell in the left menu. Wait for the shell session to load.

In the provided shell session, install the NVIDIA GPU operator using Helm:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v24.9.1

Add the open-webui Helm Chart to the catalog

Click on Catalog in the left menu.
Select Add Helm Chart.

Under Git Repository URL, add the URL to the open-webui Helm chart:

https://github.com/open-webui/helm-charts/blob/open-webui-5.20.0/charts/open-webui/Chart.yaml

Click Get Details to populate the open-webui Helm chart details.
Leave the Allow teams to use this chart option selected.
Click Add Chart.

Add the hf-meta-llama-3-1-8b-instruct Helm Chart to the catalog

Click on Catalog in the left menu.
Select Add Helm Chart.

Under Git Repository URL, add the URL to the hf-meta-llama-3-1-8b-instruct Helm chart:

https://github.com/linode/apl-examples/blob/main/inference/kserve/hf-meta-llama-3-1-8b-instruct/Chart.yaml

Click Get Details to populate the Helm chart details.
Uncheck the Allow teams to use this chart option. In the next step, configure the RBAC of the catalog to make this Helm chart available to the models team.
Click Add Chart.

Now, configure the RBAC of the catalog:

Select view > platform.
Select App in the left menu.
Click on the Gitea app.
In the list of repositories, click on otomi/charts.
At the bottom, click on the file rbac.yaml.
Change the RBAC for the hf-meta-llama-3.1-8b-instruct Helm chart as shown below:
```
hf-meta-llama-3.1-8b-instruct:
  - team-models
```

Create a Hugging Face Access Token

Navigate to the Hugging Face Access Tokens page.
Click Create new token.
Under Token type, select “Write” access.
Enter a name for your token, and click Create token.
Save your access token information.

See the Hugging Face user documentation on User access tokens for additional information.

Request Access to Llama 3

If you haven’t done it already, request access to the Llama 3 LLM model. To do this, go to Hugging Face’s Llama 3-8B Instruct LLM link, read and agree the license agreement, and submit your information. You must wait for access to be granted in order to proceed.

Deploy the Llama Model

Create a Sealed Secret

Sealed Secrets are encrypted Kubernetes secrets stored in the Values repository. When a sealed secret is created in the console, the Kubernetes secret appears in the team’s namespace.

Select view > team and team > models in the top bar.
Select Sealed Secrets from the menu.
Click Create SealedSecret.
Add the name hf-secret.
Select type kubernetes.io/opaque from the type dropdown menu.
Add Key: HF_TOKEN.
Add your Hugging Face access token in the Value field: HUGGING_FACE_TOKEN
Click Submit. The sealed secret may take a few minutes to become ready.

Create a Workload to Deploy the Model

Select view > team and team > models in the top bar.
Select Catalog from the menu.
Select the hf-meta-llama-3-1-8b-instruct chart.
Click on Values.
Provide a name for the workload. This guide uses the workload name llama-3-1-8b.
Use the default values and click Submit.

Check the Status of Your Workload

It may take a few minutes for the llama-3-1-8b workload to become ready. To check the status of the workload build, open a shell session by selecting Shell in the left menu, and use the following command to check the status of the pods with kubectl:

kubectl get pods -n team-models

NAME                                                       READY   STATUS    RESTARTS   AGE
llama-3-1-8b-predictor-00001-deployment-68d58ccfb4-jg6rw   0/3     Pending   0          22s
tekton-dashboard-5f57787b8c-gswc2                          2/2     Running   0          1h

Wait for the workload to be ready before proceeding.

Deploy and Expose the AI Interface

Create a Workload to Deploy the AI Interface

Select view > team and team > demo in the top bar.
Select Catalog from the menu.
Select the open-webui chart.
Click on Values.
Provide a name for the workload. This guide uses the workload name llama3-ui.

Add the following values and change the nameOverride value to the name of your workload, llama3-ui:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Change the nameOverride to match the name of the Workload
nameOverride: llama3-ui
ollama:
  enabled: "false"
pipelines:
  enabled: "false"
replicaCount: "1"
persistence:
  enabled: "false"
openaiBaseApiUrl: http://llama-3-1-8b.team-models.svc.cluster.local/openai/v1
extraEnvVars:
  - name: "WEBUI_AUTH"
    value: "false"

Click Submit.

Expose the AI Interface

Select Services from the menu.
Click Create Service.
In the Service Name dropdown menu, select the llama3-ui service.
Click Create Service.

Access the Open Web User Interface

Once the AI user interface is ready, you should be able to access the web UI for the Open WebUI chatbot.

Click on Services in the menu.
In the list of available services, click on the URL for the llama3-ui service. This should bring you to the chatbot user interface.

Next Steps

See our Deploy a RAG Pipeline and Chatbot with App Platform for LKE guide to expand on the architecture built in this guide. This tutorial deploys a RAG (Retrieval-Augmented Generation) pipeline that indexes a custom data set and attaches relevant data as context when users send the LLM queries.

appendix 1: Ingress control

When we created the teams demo and models, we turned off the Ingress Control. Ingress Control controls internal access to pods. When Ingress Control is enabled, pods in the team namespace are not accessible to other pods (in the same team namespace or in other team namespaces). For the simplicity of this guide, Ingress Control was turned off. If you don’t want to disable Ingress Control for all the workloads in a team, then you can turn Ingress Control on and create Inbound Rules in the team’s network policies. Follow these steps to create inbound policies to control access to the models hosted in the team models:

Select view > team and team > models in the top bar.
Select Network Policies in the left menu.
Click Create Inbound Rule*
Add a name for the rule (like model-access)
Under Sources, select the workload (in this case the llama3-ui workload) and select a pod label.
Under Target, select the workload (in this case the llama-3-1-8b workload) and select a pod label.
Click Create Inbound Rule

Note that in some cases, the Target pod needs to be restarted if it already had accepted connections before the inbound rule was created.

appendix 2: Egress control

When we created the teams demo and models, we turned off the Egress Control. Egress Control is implemented using Istio Service Entries and Istio sidecar injection is enabled by default. Egress Control controls pod access to public URLs. Because the Hugging Face models need to be downloaded from an external repository and open-webui installs multiple binaries from external sources, both the LLM pod and open-webui need to have access to multiple public URLs. For the simplicity of this guide we turned the Egress Control off. If you don’t want to disable Egress Control for all the workloads in a team, then you can turn Egress Control on and create Outbound Rules in the team’s network policies or turn of the sidecar injection for a specific workloads (pods). There are several ways to do this:

Add the label sidecar.istio.io/inject: "false" to the workload using the Chart Values
Enable Kyverno and create a Kyverno Policy that mutates the a pod so that it has the sidecar.istio.io/inject: "false" label.

The open-webui Helm chart used in this guide does not support adding additional labels to pods. The following instructions and example show how to use Kyverno to mutate the open-webui pods and add the sidecar.istio.io/inject: "false" label.

Select view > platform in the top bar.
Select Apps in the left menu.
In the Apps section, enable the Kyverno app.
In the Apps section, select the Gitea app.
In Gitea, navigate to the team-demo-argocd repository.

Click the Add File dropdown, and select New File. Create a file named open-webui-policy.yaml with the following contents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: kyverno.io/v1
kind: Policy
metadata:
  name: disable-sidecar-injection
  annotations:
    policies.kyverno.io/title: Disable Istio sidecar injection
spec:
  rules:
  - name: disable-sidecar-injection
    match:
      any:
      - resources:
          kinds:
          - StatefulSet
          - Deployment
          selector:
            matchLabels:
              ## change the value to match the name of the Workload
              app.kubernetes.io/instance: "llama3-ui"
    mutate:
      patchStrategicMerge:
        spec:
          template:
            metadata:
              labels:
                sidecar.istio.io/inject: "false"

Optionally add a title and any notes to the change history. Then, click Commit Changes.
Check to see if the policy has been created in Argo CD:
1. Go to Apps and open the Argocd application.
2. Using the search feature, go to the team-demo application to see if the policy has been created. If it isn’t there yet, view the team-demo application in the list of Applications, and click Refresh if needed.

More Information

You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.