Skip to main content
Version: 4.5

Katonic MLOps

Node pool requirementsโ€‹

Katonic requires a minimum of three-node pools, one to host the Katonic Platform, one to host Compute workloads and one for storage. Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.

  1. Master pool requirements

    • Boot Disk: Min 128GB

    • Min Nodes: 1

    • Max Nodes: 3

    • Spec: 2 CPU / 8GB

    • Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512

  1. Platform pool requirements

    • Boot Disk: Min 128GB

    • Min Nodes: 2

    • Max Nodes: 3

    • Spec: 4 CPU / 16GB

    • Taints: katonic.ai/node-pool=platform:NoSchedule

    • Labels: katonic.ai/node-pool=platform

    • Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512

  1. Compute pool requirements

    • Boot Disk: Min 128GB

    • Recommended Min Nodes: 1

    • Max Nodes: Set as necessary to meet demand and resourcing needs

    • Recommended min spec: 8 CPU / 32GB

    • Labels: katonic.ai/node-pool=compute

    • Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512

> **Note**: When **backup_enabled = True**, then  **compute_nodes.min_count** should be set to **2**.
  1. Deployment pool requirements

    • Boot Disk: Min 128GB

    • Recommended Min Nodes: 1

    • Max Nodes: Set as necessary to meet demand and resourcing needs

    • Recommended min spec: 8 CPU / 32GB

    • Taints: katonic.ai/node-pool=deployment:NoSchedule

    • Labels: katonic.ai/node-pool=deployment

    • Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512

  2. Optional GPU compute pool

    • Boot Disk: recommended 512GB

    • Recommended Min Nodes: 0

    • Max Nodes: Set as necessary to meet demand and resourcing needs

    • Recommended min Spec: 8 CPU / 16GB / One or more Nvidia GPU Device

    • Nodes must be pre-configured with the appropriate Nvidia driver, Nvidia-docker2, and set the default docker runtime to Nvidia.

    • Taints: nvidia.com/gpu=gpu-{GPU-type}

    • Labels: katonic.ai/node-pool=gpu-{GPU-type}

    • Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512

Note: For example we can use GPU type as v100, A30, A100

Katonic Platform Installationโ€‹

Installation processโ€‹

The Katonic platform runs on Kubernetes. To simplify the deployment and configuration of Katonic services, Katonic provides an install automation tool called the katonic-installer that will deploy Katonic into your compatible cluster. The katonic-installer is a Python application delivered in a Docker container, and can be run locally or as a job inside the target cluster.

Prerequisitesโ€‹

The install automation tools are delivered as a Docker image, and must run on an installation workstation that meets the following requirements:

  • Docker

  • Kubectl

  • Access to quay.io and credentials for an installation service account with access to the Katonic installer image and upstream image repositories. Throughout these instructions, these credentials will be referred to as $QUAY_USERNAME and $QUAY_PASSWORD. Contact your Katonic account team if you need new credentials.

The hosting cluster must have access to the following domains through the Internet to retrieve component and dependency images for online installation:

Alternatively, you can configure the katonic-installer to point to a private docker registry and application registry for offline installation. please reach out to your account manager if you would like an offline/private installation.

1. Create a new directory to go ahead with the installation.โ€‹

mkdir katonic
cd katonic

2. Custom certificatesโ€‹

Katonic Platform is accessed using HTTPS protocol, for that you need to pass to files listed below to secure the Katonic Platform using custom certificates.

  1. PEM encoded public key certificate (file name must end with .crt extension).

The private key is associated with the given certificate (file name must end with a .key extension).

  1. Put these files in the katonic directory.

3. Pull the katonic-installer imageโ€‹

  1. Log in to quay.io with the credentials described in the requirements previous section.
docker login quay.io
  1. Find the image URI for the version of the katonic-installer you want to use from the release notes.

  2. Pull the image to your local machine.

docker pull quay.io/katonic/katonic-installer:v4.5

4. Initializeโ€‹

Initialize the installer application to generate a template configuration file named katonic.yml.

Note: This command must be entered inside the katonic directory

docker run -it --rm --name generating-yaml -v $(pwd):/install quay.io/katonic/katonic-installer:v4.5 init on-premise katonic_mlops kubernetes_already_exists private

Edit the configuration file with all necessary details about the target cluster, storage systems, and hosting domain. Read the configuration reference for more information about available keys, and consult the configuration examples for guidance on getting started.

PARAMETERDESCRIPTIONVALUE
katonic_platform_versionIt has the value by default regarding the Katonic Platform Version.katonic_mlops
deploy_onKatonic MLOps can be deployed onOn-Premise
4enable_exposing_genai_applications_to_internetset "True" if opting for exposing genai applications to internet
5public_domain_for_genai_applicationsPublic FQDN of domain for genai applications that will be exposed to the internet
private_bucket_limitSet the private bucket size.eg. 10GB
minio_storageSet the value to amount of storage required in file manager /16eg. 20Gi
workspace_timeout_intervalSet timeout interval hourseg. 1
backup_enabledEnabling of the backup (For On-premise Katonic Installer only support AWS S3 bucket)True or False
s3_bucket_nameName of the s3 bucketkatonic-backup
s3_bucket_regionRegion of the s3 bucketus-east-1
backup_scheduleScheduling of the backup0 0 1 * *
backup_expirationExpiration of the backup2160h0m0s
use_custom_domainCustom domain name enablingTrue or False
custom_domain_nameCustom domain nameeg. app.katonic.ai
use_katonic_domainKatonic domain name enablingTrue or False
katonic_domain_prefixKatonic domain name prefixeg. tesla
enable_pre_checksSet this to True if you want to perform the Pre-checksTrue / False
enable_acceleratorSet "True" to enable acceleratorsFalse
enable_playgroundSet "True" to enable playgroundFalse
AD_Group_ManagementSet "True" to enable functionality that provides you ability to sign in using Azure ADFalse
AD_CLIENT_IDClient ID of App registered for SSO in client's Azure or Identity Provider
AD_CLIENT_SECRETClient Secret of App registered for SSO in client's Azure or any other Identity Provider
AD_AUTH_URLAuthorization URL endpoint of app registered for SSO.
AD_TOKEN_URLToken URL endpoint of app registered for SSO.
quay_usernameUsername for quay
quay_passwordPassword for quay
adminUsernameEmail for admin usereg. john@katonic.ai
adminPasswordPassword for admin userat least 1 special character at least 1 upper case letter at least 1 lower case letter minimum 8 characters
adminFirstNameAdmin first nameeg. john
adminLastNameAdmin last nameeg. musk

5. Installing Katonic MLOps Platformโ€‹

docker run -it --rm --name install-katonic -v /root/.kube:/root/.kube -v $(pwd):/inventory quay.io/katonic/katonic-installer:v4.5

Installation Verificationโ€‹

The installation process can take up to 45 minutes to fully complete. The installer will output verbose logs, and commands to take kubectl access of deployed cluster and surface any errors it encounters. After installation, you can use the following commands to check whether all applications are in a running state or not.

kubectl get pods --all-namespace

This will show the status of all pods being created by the installation process. If you see any pods enter a crash loop or hang in a non-ready state, you can get logs from that pod by running:

kubectl logs $POD_NAME --namespace $NAMESPACE_NAME

If the installation completes successfully, you should see a message that says:

TASK [platform-deployment : Credentials to access Katonic MLOps Platform] *******************************ok: [localhost] => {
"msg": [
"Platform Domain: $domain_name",
"Username: $adminUsername",
"Password: $adminPassword"
]
}

However, the application will only be accessible via HTTPS at that FQDN if you have configured DNS for the name to point to an istio ingress load balancer with the appropriate SSL certificate that forwards traffic to your platform nodes.

Post Installation Stepsโ€‹

Domainโ€‹

You can identify a domain for your cluster. This allows you to use any domain as the location for the cluster. For example, you could set the domain for the cluster as katonic.company.com.

For this option to work, you will need to set the required DNS routing rules between the domain and the IP address of the cluster after the katonic-installer has finished running.

You will need to create a CNAME/A listing for .<your_domain> with the IP address of the auto scaler for the cluster. Make sure you include the wildcard.

The domain is the same domain you entered as <your_domain> in the katonic-installer

To get the IP address of the cluster run the following command has been deployed:

kubectl get svc istio-ingressgateway -n istio-system | awk '{print $4}' | tail -n +2

Test and troubleshootโ€‹

To verify the successful installation of Katonic, perform the following tests:

  • If you encounter a 500 or 502 error, take access of your cluster and execute the following command:

    kubectl rollout restart deploy nodelog-deploy -n application
  • If you have any file manager-related issues:

    kubectl rollout restart sts minio
    kubectl rollout status sts minio
    kubectl rollout restart deploy minio-console
    kubectl rollout status deploy minio-console
  • Login to the Katonic application and ensure that all the navigation panel options are operational. If this test fails, please verify that Keycloak was set up properly.

  • Create a new project and launch a Jupyter/JupyterLab workspace. If this test fails, please check that the default environment images have been loaded in the cluster.

  • Publish an app with Flask or Shiny. If this test fails, please verify that the environment images have Flask and Shiny installed.