> ## Documentation Index
> Fetch the complete documentation index at: https://learn.getodin.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Terragrunt Deployment Guide

> Step-by-step commands for deploying the complete Odin AI Platform EKS infrastructure using Terragrunt, including dry run validation and actual deployment.

This guide walks through the full deployment of the Odin AI Platform EKS infrastructure on AWS using Terragrunt. It covers tool installation, environment setup, and a phased deployment sequence designed to ensure proper dependency ordering across all infrastructure components.

Deployments are organized into nine phases:

1. **State Management** — Bootstraps the S3 bucket used to store Terraform state for the environment.
2. **EKS Infrastructure** — Provisions the VPC, subnets, NAT gateways, IAM roles, and the EKS cluster and managed node groups.
3. **Storage & Load Balancing** — Deploys the EBS CSI driver for persistent volumes and the AWS Load Balancer Controller for ALB ingress.
4. **Karpenter Autoscaling** — Sets up dynamic node provisioning with Spot instance support and interruption handling via SQS and EventBridge.
5. **KEDA Autoscaling** — Deploys KEDA for pod-level autoscaling based on CPU and memory thresholds.
6. **Data Services** — Provisions Supabase (self-hosted or Cloud), ElastiCache Redis, and Amazon MQ RabbitMQ.
7. **Odin Services** — Deploys the Odin AI Platform application stack (Web, FastAPI, Celery, Automator) via Helm.
8. **SigNoz Observability** — Deploys distributed tracing, metrics, and log aggregation via SigNoz and the k8s-infra agent.
9. **Final Deployment** — Runs a full `terragrunt apply` to reconcile any remaining resources.

Before starting, complete the prerequisites checklist with the customer and ensure all `<YOUR_*>` placeholders in the environment template are filled in. Several values — including the VPC ID, EKS cluster endpoint, and Redis and RabbitMQ endpoints — are only available after specific phases complete, so the guide flags exactly when to capture and apply them.

***

## Prerequisites

* AWS CLI configured with appropriate permissions
* Terraform (>= 1.0)
* Terragrunt (latest version)
* `kubectl` for Kubernetes management
* `helm` for Helm chart management

***

## Installation Guide

### Installing Terragrunt

**macOS (Homebrew)**

```bash theme={null}
brew install terragrunt
```

**Linux (apt)**

```bash theme={null}
# Add HashiCorp GPG key
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
 
# Add HashiCorp repository
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
 
# Update and install
sudo apt update
sudo apt install terragrunt
```

**Windows (Chocolatey)**

```bash theme={null}
choco install terragrunt
```

### Installing kubectl

**macOS (Homebrew)**

```bash theme={null}
brew install kubectl
```

**Linux**

```bash theme={null}
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
```

**Windows (Chocolatey)**

```bash theme={null}
choco install kubernetes-cli
```

### Installing Helm

**macOS (Homebrew)**

```bash theme={null}
brew install helm
```

**Linux**

```bash theme={null}
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
```

**Windows (Chocolatey)**

```bash theme={null}
choco install kubernetes-helm
```

### Verifying Installation

```bash theme={null}
terragrunt --version
terraform --version
kubectl version --client
helm version
```

### AWS CLI Configuration

```bash theme={null}
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
 
# Configure AWS credentials
aws configure
 
# Verify configuration
aws sts get-caller-identity
```

***

## Creating a New Environment

### Step 1: Copy the Environment Template

The `env-template-folder` contains pre-structured files with `<YOUR_*>` placeholders ready to be filled in. Copy it entirely to create your new environment folder.

```bash theme={null}
# Navigate to the terragrunt environments directory
cd terragrunt/environments
 
# Copy the full template folder to a new environment (replace 'your-env-name')
cp -r env-template-folder your-env-name
 
# The folder structure is ready:
# your-env-name/
# ├── terragrunt.hcl               # Core cluster configuration
# ├── state/
# │   └── terragrunt.hcl           # S3 state bucket configuration
# └── values/
#     ├── infrastructure.yaml      # AWS Load Balancer Controller
#     ├── karpenter-values.yaml    # Karpenter controller settings
#     ├── karpenter-nodeclasses.yaml  # EC2NodeClass definitions
#     ├── karpenter.yaml           # Karpenter NodePool definitions
#     ├── keda.yaml                # KEDA autoscaler
#     ├── aws-ebs-csi-driver.yaml  # EBS CSI driver
#     ├── odin-services.yaml       # Odin application services
#     ├── supabase.yaml            # Supabase (if self-hosting)
#     ├── ha-supabase-db.yaml      # Supabase HA DB (if self-hosting)
#     ├── cloudnative-pg.yaml      # CloudNativePG operator (if self-hosting)
#     ├── signoz.yaml              # SigNoz observability (optional)
#     └── signoz-k8s-infra.yaml    # SigNoz k8s metrics (optional)
```

### Step 2: Verify All Placeholders Are Present

```bash theme={null}
cd your-env-name
 
# List all placeholders that need to be filled in
grep -r "<YOUR_" . --include="*.hcl" --include="*.yaml" | sort
```

All placeholders follow the `<YOUR_*>` convention. The steps below walk through filling them in file by file.

### Step 3: Provision SSL Certificates (AWS ACM)

Before setting environment variables you need the certificate ARNs. Use the AWS Console to request SSL certificates in AWS Certificate Manager (ACM) for all domains your environment will serve.

**Option A: Single wildcard certificate (recommended)**

A single wildcard certificate covers all subdomains with one ARN. For example, if your base domain is `app.example.com`, a single `*.app.example.com` certificate covers:

| Service   | Domain                        |
| --------- | ----------------------------- |
| Web       | `app.example.com`             |
| FastAPI   | `api-app.example.com`         |
| Automator | `automations-app.example.com` |
| Supabase  | `supabase-app.example.com`    |
| SigNoz    | `signoz-app.example.com`      |

**Option B: Per-service certificates**

Request one certificate per domain if you cannot use a wildcard. Repeat the steps below for each domain: `<YOUR_WEB_DOMAIN>`, `<YOUR_API_DOMAIN>`, `<YOUR_AUTOMATOR_DOMAIN>`, `<YOUR_SUPABASE_DOMAIN>` (only if `ENABLE_SUPABASE=true`), `<YOUR_SIGNOZ_DOMAIN>` (only if `ENABLE_SIGNOZ=true`).

**Requesting a certificate in the AWS Console**

1. Open the AWS Certificate Manager console
2. Switch to the correct region (top-right) — must match `<YOUR_AWS_REGION>`
3. Click **Request a certificate** → **Request a public certificate** → **Next**
4. Under **Fully qualified domain name**, enter the wildcard (e.g., `*.app.example.com`) or a specific domain
5. Set **Validation method** to **DNS validation**
6. Click **Request** — the certificate is created in `Pending validation` state

**Adding the DNS CNAME validation record**

ACM generates a CNAME record that you must add to your DNS provider to prove domain ownership. Get the values from the ACM Console by opening the certificate and expanding the domain under **Domains**.

| DNS Field         | Value                                                                     |
| ----------------- | ------------------------------------------------------------------------- |
| Record type       | `CNAME`                                                                   |
| Name / Host       | e.g., `_fa187f22ac17bce6f508bf3c56439c61.signoz-app.example.com.`         |
| Value / Points to | e.g., `_c7c97325fe38061e168e232d122c7ff3.jkddzztszm.acm-validations.aws.` |

<Info>
  Include the trailing dot (`.`) at the end of the CNAME values if your DNS provider requires it.
</Info>

**Cloudflare**

1. Log in to Cloudflare → select your domain → go to **DNS** → **Records** → **Add record**
2. Set **Type** to `CNAME`
3. Paste the ACM CNAME name into **Name** and the ACM CNAME value into **Target**
4. Set **Proxy status** to **DNS only** (grey cloud icon) — the certificate will not validate through the Cloudflare proxy
5. Click **Save**

**Route 53**

1. Open the Route 53 console → **Hosted zones** → select your zone → **Create record**
2. Set **Record type** to `CNAME`
3. Paste the ACM CNAME name into **Record name** (subdomain portion only) and the value into **Value**
4. Set TTL to `300` and click **Create records**

<Tip>
  In ACM you can also click **Create records in Route 53** to have ACM add the record automatically if the hosted zone is in the same account.
</Tip>

Once DNS propagates (typically 1–5 minutes), the certificate status changes to **Issued**. Copy the ARN from the top of the certificate — it looks like `arn:aws:acm:<region>:<account-id>:certificate/<uuid>`. Keep the ARN(s) handy for the next step.

### Step 4: Set Environment Variables

Set these shell environment variables before running any Terragrunt commands. They are read directly by `terragrunt.hcl` via `get_env()`.

```bash theme={null}
export AWS_REGION="<YOUR_AWS_REGION>"          # e.g., "eu-west-2", "us-east-2"
export CLUSTER_NAME="<env_folder_name>"          # e.g., "env-template-folder"
 
# Domain configuration
export WEB_DOMAIN="<YOUR_WEB_DOMAIN>"                    # e.g., "app.example.com"
export FASTAPI_DOMAIN="<YOUR_API_DOMAIN>"                # e.g., "api-app.example.com"
export AUTOMATOR_DOMAIN="<YOUR_AUTOMATOR_DOMAIN>"        # e.g., "automations-app.example.com"
export SUPABASE_DOMAIN="<YOUR_SUPABASE_DOMAIN>"          # e.g., "supabase-app.example.com"
export SIGNOZ_DOMAIN="<YOUR_SIGNOZ_DOMAIN>"              # e.g., "signoz-app.example.com"
 
# SSL Certificate ARNs — Option A: Single wildcard certificate (recommended)
export WILDCARD_CERTIFICATE_ARN="arn:aws:acm:<YOUR_AWS_REGION>:<YOUR_AWS_ACCOUNT_ID>:certificate/<YOUR_WILDCARD_CERT_ID>"
 
# SSL Certificate ARNs — Option B: Per-service certificates
export WEB_CERTIFICATE_ARN="arn:aws:acm:<YOUR_AWS_REGION>:<YOUR_AWS_ACCOUNT_ID>:certificate/<YOUR_WEB_CERT_ID>"
export FASTAPI_CERTIFICATE_ARN="arn:aws:acm:<YOUR_AWS_REGION>:<YOUR_AWS_ACCOUNT_ID>:certificate/<YOUR_API_CERT_ID>"
export AUTOMATOR_CERTIFICATE_ARN="arn:aws:acm:<YOUR_AWS_REGION>:<YOUR_AWS_ACCOUNT_ID>:certificate/<YOUR_AUTOMATOR_CERT_ID>"
export SUPABASE_CERTIFICATE_ARN="arn:aws:acm:<YOUR_AWS_REGION>:<YOUR_AWS_ACCOUNT_ID>:certificate/<YOUR_SUPABASE_CERT_ID>"
export SIGNOZ_CERTIFICATE_ARN="arn:aws:acm:<YOUR_AWS_REGION>:<YOUR_AWS_ACCOUNT_ID>:certificate/<YOUR_SIGNOZ_CERT_ID>"
 
# Service enablement flags
export ENABLE_ALB_CONTROLLER="true"
export ENABLE_AWS_SERVICES="true"      # Set to "true" to enable ElastiCache and AmazonMQ
 
# Supabase stack (self-hosted) — enable all three together if self-hosting Supabase
export ENABLE_CNPG="true"             # CloudNativePG operator  (namespace: cnpg-system)
export ENABLE_HA_SUPABASE_DB="true"   # Supabase HA database    (namespace: ha-supabase-db)
export ENABLE_SUPABASE="true"         # Supabase application    (namespace: supabase)
 
export ENABLE_SIGNOZ="true"           # Set to "true" to enable SigNoz observability
export SSL_TERMINATION="alb"
```

#### Spot Instances & Stateful Workloads

Spot instances are configured per NodePool in `values/karpenter.yaml`, not via environment variables. Each NodePool declares its own capacity strategy:

| NodePool            | `workload-type` label                        | Capacity Type             | Rationale                                                                  |
| ------------------- | -------------------------------------------- | ------------------------- | -------------------------------------------------------------------------- |
| `general`           | `general`                                    | Spot → On-Demand fallback | Cost-optimised for stateless batch/background workloads                    |
| `compute-intensive` | `compute-intensive`                          | Spot → On-Demand fallback | Cost-optimised for CPU-bound workloads                                     |
| `memory-intensive`  | `memory-intensive`                           | On-Demand → Spot fallback | Stability prioritised for high-memory pods                                 |
| `gpu`               | `gpu`                                        | Spot → On-Demand fallback | Cost-optimised for AI/ML batch workloads                                   |
| `application`       | `application`                                | On-Demand only            | Stable user-facing services (Supabase, Kong, etc.) — no Spot interruptions |
| `database`          | `database` / `node-type: database-dedicated` | On-Demand only            | Stateful — Spot interruption is unsafe for databases                       |

The `application` NodePool uses m/c instance families (generation 5+) with On-Demand only. Supabase service pods are pinned here via `nodeSelector: workload-type: "application"` to guarantee they are never interrupted by a Spot reclamation event.

The `database-dedicated` NodePool never uses Spot. It uses `consolidationPolicy: WhenEmpty` so Karpenter will not evict a node that still has a running pod, making it safe for stateful workloads such as PostgreSQL and CloudNativePG replicas.

**Guidelines for stateful applications on Spot:**

* Do not schedule databases, persistent queues, or any pod with a `PersistentVolumeClaim` on Spot NodePools.
* Use a `nodeSelector` targeting `node-type: database-dedicated` with the matching `database-workload: "true"` toleration for database pods.
* Use `nodeSelector: workload-type: "application"` for user-facing stateless services that must remain available without interruption.
* For background workloads (Web, API, Celery, Automator), the `general` Spot NodePool is appropriate — Karpenter's SQS interruption handler drains Spot nodes gracefully before AWS reclaims them, and KEDA's minimum replica count (≥ 2) ensures availability during node replacement.
* To disable Spot globally, remove `"spot"` from the values list in every NodePool inside `values/karpenter.yaml`.

**How Karpenter handles Spot interruption warnings:**

AWS gives a 2-minute interruption notice before terminating a Spot instance. Karpenter uses EventBridge and SQS to act on this automatically:

```
AWS Spot Interruption Event
        │
        ▼
Amazon EventBridge (CloudWatch Events)
  Rule: EC2 Spot Instance Interruption Warning
        │
        ▼
   SQS Queue (Karpenter interruption queue)
        │
        ▼
  Karpenter Controller (polls SQS continuously)
        │
        ├── Cordons the node (no new pods scheduled)
        ├── Drains existing pods (respects PodDisruptionBudgets)
        ├── Provisions a replacement node in parallel
        └── Pods reschedule onto the new node before the 2-min window closes
```

This is configured in the `karpenter` block in `terragrunt.hcl`:

```hcl theme={null}
karpenter = {
  spot_interruption_handling = true   # creates the SQS queue and EventBridge rule
  enable_spot_instances       = true   # allows Spot in NodePool capacity requirements
}
```

### Step 5: Update Environment-Specific File Values

Do a find-and-replace across all files in your new env folder for the following placeholders:

| Placeholder             | Description                   | Example                  |
| ----------------------- | ----------------------------- | ------------------------ |
| `<YOUR_ENV_NAME>`       | Unique environment identifier | `app-eks-prod`           |
| `<YOUR_AWS_REGION>`     | AWS region of the cluster     | `eu-west-2`, `us-east-2` |
| `<YOUR_AWS_ACCOUNT_ID>` | 12-digit AWS account ID       | `123456789012`           |
| `<YOUR_ENVIRONMENT>`    | Environment tag value         | `prod`, `staging`, `dev` |
| `<YOUR_PROJECT>`        | Project tag value             | `odin`, `ekb`            |

```bash theme={null}
# Run from your new env folder to find all remaining placeholders
grep -r "<YOUR_" .
```

#### 5.1 `terragrunt.hcl` — Core cluster configuration

```bash theme={null}
nano terragrunt.hcl
```

| Field                                      | Placeholder                | Notes                                                                           |
| ------------------------------------------ | -------------------------- | ------------------------------------------------------------------------------- |
| `cluster_name`                             | `<YOUR_ENV_NAME>`          | Must match EKS cluster name                                                     |
| `cluster_region`                           | `<YOUR_AWS_REGION>`        | AWS region                                                                      |
| `aws_account_id`                           | `<YOUR_AWS_ACCOUNT_ID>`    | 12-digit account ID                                                             |
| `vpc_cidr`                                 | `<YOUR_VPC_CIDR>`          | e.g., `192.168.0.0/16`                                                          |
| `availability_zones`                       | `<YOUR_REGION>a/b/c`       | 3 AZs in your region                                                            |
| `tags.Environment`                         | `<YOUR_ENVIRONMENT>`       | e.g., `prod`                                                                    |
| `tags.Project`                             | `<YOUR_PROJECT>`           | e.g., `odin`                                                                    |
| `aws_services.amazon_mq.rabbitmq.username` | `<YOUR_RABBITMQ_USERNAME>` | RabbitMQ admin username (only when `ENABLE_AWS_SERVICES=true`)                  |
| `aws_services.amazon_mq.rabbitmq.password` | `<YOUR_RABBITMQ_PASSWORD>` | Min 12 chars; must include uppercase, lowercase, digits, and special characters |

#### 5.2 `state/terragrunt.hcl` — S3 state bucket

```bash theme={null}
nano state/terragrunt.hcl
```

| Field         | Placeholder                            | Notes                   |
| ------------- | -------------------------------------- | ----------------------- |
| `bucket_name` | `odin-terraform-state-<YOUR_ENV_NAME>` | Must be globally unique |
| `region`      | `<YOUR_AWS_REGION>`                    | Same region as cluster  |

#### 5.3 `values/infrastructure.yaml` — AWS Load Balancer Controller

<Warning>
  Obtain the VPC ID **after** the EKS cluster is created before deploying the AWS Load Balancer Controller.
</Warning>

```bash theme={null}
# Get VPC ID after EKS cluster is created
aws eks describe-cluster --name <YOUR_ENV_NAME> \
  --query "cluster.resourcesVpcConfig.vpcId" --output text
```

```bash theme={null}
nano values/infrastructure.yaml
```

| Field                                                   | Placeholder                                | Notes                       |
| ------------------------------------------------------- | ------------------------------------------ | --------------------------- |
| `clusterName`                                           | `<YOUR_ENV_NAME>`                          | EKS cluster name            |
| `region`                                                | `<YOUR_AWS_REGION>`                        | AWS region                  |
| `vpcId`                                                 | `<YOUR_VPC_ID>`                            | Required before ALB deploy  |
| `serviceAccount.annotations.eks.amazonaws.com/role-arn` | `<YOUR_AWS_ACCOUNT_ID>`, `<YOUR_ENV_NAME>` | IAM role for ALB controller |

#### 5.4 `values/karpenter-values.yaml` — Karpenter controller

<Warning>
  Obtain the EKS cluster endpoint **after** the EKS cluster is created and before deploying Karpenter.
</Warning>

```bash theme={null}
# Get cluster endpoint after EKS cluster is created
aws eks describe-cluster --name <YOUR_ENV_NAME> \
  --query "cluster.endpoint" --output text
```

```bash theme={null}
nano values/karpenter-values.yaml
```

| Field                                                   | Placeholder                                | Notes                            |
| ------------------------------------------------------- | ------------------------------------------ | -------------------------------- |
| `serviceAccount.annotations.eks.amazonaws.com/role-arn` | `<YOUR_AWS_ACCOUNT_ID>`, `<YOUR_ENV_NAME>` | IAM role for Karpenter           |
| `env.CLUSTER_NAME`                                      | `<YOUR_ENV_NAME>`                          | EKS cluster name                 |
| `env.CLUSTER_ENDPOINT`                                  | `<YOUR_EKS_CLUSTER_ENDPOINT>`              | Required before Karpenter deploy |
| `settings.aws.defaultInstanceProfile`                   | `<YOUR_ENV_NAME>`                          | Karpenter node instance profile  |

#### 5.5 `values/karpenter-nodeclasses.yaml` — Karpenter node classes

```bash theme={null}
nano values/karpenter-nodeclasses.yaml
```

| Field                                            | Placeholder          | Notes                               |
| ------------------------------------------------ | -------------------- | ----------------------------------- |
| All `kubernetes.io/cluster/<YOUR_ENV_NAME>` tags | `<YOUR_ENV_NAME>`    | Cluster tag for subnet/SG selectors |
| `user_data` bootstrap cluster name               | `<YOUR_ENV_NAME>`    | Node bootstrap script               |
| `tags.Environment`                               | `<YOUR_ENVIRONMENT>` | e.g., `prod`                        |
| `tags.Project`                                   | `<YOUR_PROJECT>`     | e.g., `odin`                        |

#### 5.6 `values/aws-ebs-csi-driver.yaml` — EBS CSI Driver

```bash theme={null}
nano values/aws-ebs-csi-driver.yaml
```

| Field                                                              | Placeholder                                | Notes                           |
| ------------------------------------------------------------------ | ------------------------------------------ | ------------------------------- |
| `controller.serviceAccount.annotations.eks.amazonaws.com/role-arn` | `<YOUR_AWS_ACCOUNT_ID>`, `<YOUR_ENV_NAME>` | IAM role for EBS CSI controller |
| `node.serviceAccount.annotations.eks.amazonaws.com/role-arn`       | `<YOUR_AWS_ACCOUNT_ID>`, `<YOUR_ENV_NAME>` | IAM role for EBS CSI node       |
| `controller.env.AWS_DEFAULT_REGION`                                | `<YOUR_AWS_REGION>`                        | AWS region                      |
| `controller.env.AWS_REGION`                                        | `<YOUR_AWS_REGION>`                        | AWS region                      |
| `node.env.AWS_DEFAULT_REGION`                                      | `<YOUR_AWS_REGION>`                        | AWS region                      |
| `node.env.AWS_REGION`                                              | `<YOUR_AWS_REGION>`                        | AWS region                      |

#### 5.7 `values/karpenter.yaml` — Karpenter NodePools

```bash theme={null}
nano values/karpenter.yaml
```

| Field                                        | Placeholder                                              | Notes                          |
| -------------------------------------------- | -------------------------------------------------------- | ------------------------------ |
| `*.labels.Environment`                       | `<YOUR_ENVIRONMENT>`                                     | Applied to all NodePool labels |
| `*.requirements topology.kubernetes.io/zone` | `["<YOUR_REGION>a", "<YOUR_REGION>b", "<YOUR_REGION>c"]` | AZs for all NodePools          |

Node class names (`general`, `compute-intensive`, `memory-intensive`, `gpu`, `database`) must match entries in `karpenter-nodeclasses.yaml`.

#### 5.8 `values/keda.yaml` — KEDA Autoscaler

No environment-specific placeholders required. Resource limits and replica counts are pre-configured with sensible defaults. Review and adjust if needed.

#### 5.9 `values/supabase.yaml` — Supabase application (only if `ENABLE_SUPABASE=true`)

<Warning>
  All keys below must be generated consistently and shared with `ha-supabase-db.yaml`. Generate them once and use the same values in both files.
</Warning>

```bash theme={null}
# Generate JWT secret
openssl rand -hex 32
 
# Generate anon/service role JWTs (requires Supabase CLI)
brew install supabase/tap/supabase
supabase gen-keys
 
# Generate passwords and tokens
openssl rand -hex 24       # for passwords
openssl rand -base64 64    # for secretKeyBase
```

```bash theme={null}
nano values/supabase.yaml
```

| Field                                 | Placeholder                                | Notes                                                       |
| ------------------------------------- | ------------------------------------------ | ----------------------------------------------------------- |
| `secret.jwt.anonKey`                  | `<YOUR_SUPABASE_ANON_KEY>`                 | Must match `ha-supabase-db.yaml` `anonKey`                  |
| `secret.jwt.serviceKey`               | `<YOUR_SUPABASE_SERVICE_ROLE_KEY>`         | Must match `ha-supabase-db.yaml` `serviceRoleKey`           |
| `secret.jwt.secret`                   | `<YOUR_SUPABASE_JWT_SECRET>`               | Must match `ha-supabase-db.yaml` `jwtSecret`                |
| `secret.db.password`                  | `<YOUR_SUPABASE_DB_PASSWORD>`              | Must match `ha-supabase-db.yaml` `postgresPassword`         |
| `secret.analytics.publicAccessToken`  | `<YOUR_SUPABASE_ANALYTICS_PUBLIC_TOKEN>`   | Internal Logflare token                                     |
| `secret.analytics.privateAccessToken` | `<YOUR_SUPABASE_ANALYTICS_PRIVATE_TOKEN>`  | Internal Logflare token                                     |
| `secret.dashboard.username`           | `<YOUR_SUPABASE_DASHBOARD_USERNAME>`       | Studio UI login                                             |
| `secret.dashboard.password`           | `<YOUR_SUPABASE_DASHBOARD_PASSWORD>`       | Studio UI login                                             |
| `secret.realtime.secretKeyBase`       | `<YOUR_SUPABASE_REALTIME_SECRET_KEY_BASE>` | Phoenix secret key                                          |
| `secret.meta.cryptoKey`               | `<YOUR_SUPABASE_META_CRYPTO_KEY>`          | `openssl rand -hex 32`                                      |
| `secret.s3.keyId`                     | `<YOUR_MINIO_KEY_ID>`                      | Must match `secret.minio.user` (`openssl rand -hex 16`)     |
| `secret.s3.accessKey`                 | `<YOUR_MINIO_ACCESS_KEY>`                  | Must match `secret.minio.password` (`openssl rand -hex 32`) |
| `secret.minio.user`                   | `<YOUR_MINIO_KEY_ID>`                      | Same value as `secret.s3.keyId`                             |
| `secret.minio.password`               | `<YOUR_MINIO_ACCESS_KEY>`                  | Same value as `secret.s3.accessKey`                         |

#### 5.10 `values/ha-supabase-db.yaml` — Supabase HA Database (only if `ENABLE_HA_SUPABASE_DB=true`)

<Warning>
  Secrets here must match `supabase.yaml`. Use the same generated values for `postgresPassword`, `jwtSecret`, `anonKey`, and `serviceRoleKey`.
</Warning>

```bash theme={null}
nano values/ha-supabase-db.yaml
```

| Field                                  | Placeholder                        | Notes                                              |
| -------------------------------------- | ---------------------------------- | -------------------------------------------------- |
| `secrets.inline.postgresPassword`      | `<YOUR_SUPABASE_DB_PASSWORD>`      | Must match `supabase.yaml` `secret.db.password`    |
| `secrets.inline.authenticatorPassword` | `<YOUR_SUPABASE_DB_PASSWORD>`      | Must be identical to `postgresPassword`            |
| `secrets.inline.pgbouncerPassword`     | `<YOUR_SUPABASE_DB_PASSWORD>`      | Must be identical to `postgresPassword`            |
| `secrets.inline.jwtSecret`             | `<YOUR_SUPABASE_JWT_SECRET>`       | Must match `supabase.yaml` `secret.jwt.secret`     |
| `secrets.inline.anonKey`               | `<YOUR_SUPABASE_ANON_KEY>`         | Must match `supabase.yaml` `secret.jwt.anonKey`    |
| `secrets.inline.serviceRoleKey`        | `<YOUR_SUPABASE_SERVICE_ROLE_KEY>` | Must match `supabase.yaml` `secret.jwt.serviceKey` |

Storage class (`ebs-csi-gp2`), instance counts, and resource limits are pre-configured. Adjust `postgres.storage.size` and `postgres.walStorage.size` for your expected data volume.

#### 5.11 `values/cloudnative-pg.yaml` — CloudNativePG Operator (only if `ENABLE_CNPG=true`)

No environment-specific placeholders required. This deploys the CNPG operator controller only. Default settings (3 replicas, resource limits) are suitable for most environments.

#### 5.12 `values/odin-services.yaml` — Odin application services

<Warning>
  Redis and RabbitMQ endpoints are only available after Terraform creates those AWS resources. Certificate ARNs must be provisioned in ACM before deployment.
</Warning>

```bash theme={null}
nano values/odin-services.yaml
```

**General settings:**

| Field                  | Placeholder                     | Notes                                                                                                 |
| ---------------------- | ------------------------------- | ----------------------------------------------------------------------------------------------------- |
| `server`               | `<YOUR_WEB_DOMAIN>`             | Main web domain                                                                                       |
| `toolkitEncryptionKey` | `<YOUR_TOOLKIT_ENCRYPTION_KEY>` | Generate: `python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"` |

**Supabase (`dataServiceConfig`) — self-hosted (`ENABLE_SUPABASE=true`):**

| Field                        | Placeholder                                                          | Source                                             |
| ---------------------------- | -------------------------------------------------------------------- | -------------------------------------------------- |
| `supabase.projectUrl`        | `http://supabase-kong:8000`                                          | Fixed — internal Supabase Kong                     |
| `supabase.key`               | `<YOUR_SUPABASE_SERVICE_ROLE_KEY>`                                   | Same as `secret.jwt.serviceKey` in `supabase.yaml` |
| `supabase.postgres.user`     | `postgres`                                                           | Fixed for self-hosted                              |
| `supabase.postgres.host`     | `ha-supabase-db-postgres-pooler-rw.ha-supabase-db.svc.cluster.local` | Fixed — DB Pool service within the cluster         |
| `supabase.postgres.password` | `<YOUR_SUPABASE_DB_PASSWORD>`                                        | Same as `secret.db.password` in `supabase.yaml`    |
| `supabase.projectId`         | *(leave empty)*                                                      | Not used in self-hosted mode                       |

**Supabase (`dataServiceConfig`) — Supabase Cloud (`ENABLE_SUPABASE=false`):**

| Field                        | Placeholder                        | Source                                                                      |
| ---------------------------- | ---------------------------------- | --------------------------------------------------------------------------- |
| `supabase.projectUrl`        | `<YOUR_SUPABASE_PROJECT_URL>`      | Supabase dashboard → Project Settings → API                                 |
| `supabase.key`               | `<YOUR_SUPABASE_SERVICE_ROLE_KEY>` | Supabase dashboard → API → `service_role` key                               |
| `supabase.postgres.user`     | `<YOUR_SUPABASE_DB_USER>`          | Supabase dashboard → Project Settings → Database                            |
| `supabase.postgres.host`     | `<YOUR_SUPABASE_DB_HOST>`          | Supabase dashboard → Database (e.g., `aws-0-eu-west-2.pooler.supabase.com`) |
| `supabase.postgres.password` | `<YOUR_SUPABASE_DB_PASSWORD>`      | Supabase dashboard → Project Settings → Database                            |
| `supabase.projectId`         | `<YOUR_SUPABASE_PROJECT_ID>`       | From your Supabase project URL                                              |

**Redis:**

| Field        | Placeholder                                          | Notes                               |
| ------------ | ---------------------------------------------------- | ----------------------------------- |
| `redis.url`  | `rediss://<YOUR_REDIS_HOST>:6379?ssl_cert_reqs=none` | After Terraform creates ElastiCache |
| `redis.host` | `<YOUR_REDIS_HOST>`                                  | ElastiCache primary endpoint        |

```bash theme={null}
# Get Redis endpoint after Terraform apply
aws elasticache describe-cache-clusters \
  --show-cache-node-info \
  --query "CacheClusters[?starts_with(CacheClusterId,'<YOUR_ENV_NAME>')].CacheNodes[0].Endpoint.Address" \
  --output text
```

**RabbitMQ:**

| Field               | Placeholder                                                                           | Notes                            |
| ------------------- | ------------------------------------------------------------------------------------- | -------------------------------- |
| `rabbitmq.url`      | `amqps://<YOUR_RABBITMQ_USERNAME>:<YOUR_RABBITMQ_PASSWORD>@<YOUR_RABBITMQ_HOST>:5671` | After Terraform creates AmazonMQ |
| `rabbitmq.host`     | `<YOUR_RABBITMQ_HOST>`                                                                | AmazonMQ broker endpoint         |
| `rabbitmq.username` | `<YOUR_RABBITMQ_USERNAME>`                                                            | Set in `terragrunt.hcl`          |
| `rabbitmq.password` | `<YOUR_RABBITMQ_PASSWORD>`                                                            | Set in `terragrunt.hcl`          |

```bash theme={null}
# Get RabbitMQ endpoint after Terraform apply
aws mq list-brokers \
  --query "BrokerSummaries[?BrokerName=='odin-rabbitmq'].BrokerId" --output text | \
  xargs -I{} aws mq describe-broker --broker-id {} \
  --query "BrokerInstances[0].Endpoints[0]" --output text
```

**SSL / Certificate ARNs:**

| Field                                        | Placeholder                        | Notes                               |
| -------------------------------------------- | ---------------------------------- | ----------------------------------- |
| `ssl.services.web.domain`                    | `<YOUR_WEB_DOMAIN>`                | e.g., `app.example.com`             |
| `ssl.services.web.certificateArn`            | `<YOUR_WEB_CERTIFICATE_ARN>`       | ACM certificate ARN                 |
| `ssl.services.fastapiBackend.domain`         | `<YOUR_API_DOMAIN>`                | e.g., `api-app.example.com`         |
| `ssl.services.fastapiBackend.certificateArn` | `<YOUR_API_CERTIFICATE_ARN>`       | ACM certificate ARN                 |
| `ssl.services.automator.domain`              | `<YOUR_AUTOMATOR_DOMAIN>`          | e.g., `automations-app.example.com` |
| `ssl.services.automator.certificateArn`      | `<YOUR_AUTOMATOR_CERTIFICATE_ARN>` | ACM certificate ARN                 |
| `ssl.services.supabase.domain`               | `<YOUR_SUPABASE_DOMAIN>`           | e.g., `supabase-app.example.com`    |
| `ssl.services.supabase.certificateArn`       | `<YOUR_SUPABASE_CERTIFICATE_ARN>`  | ACM certificate ARN                 |

```bash theme={null}
# List ACM certificates in your region
aws acm list-certificates --region <YOUR_AWS_REGION> \
  --query "CertificateSummaryList[*].[DomainName,CertificateArn]" --output table
```

**Web frontend Supabase keys — self-hosted (`ENABLE_SUPABASE=true`):**

| Field                         | Placeholder                        | Source                                             |
| ----------------------------- | ---------------------------------- | -------------------------------------------------- |
| `web.supabase.url`            | `https://<YOUR_SUPABASE_DOMAIN>`   | External URL routed via ALB ingress                |
| `web.supabase.anonKey`        | `<YOUR_SUPABASE_ANON_KEY>`         | Same as `secret.jwt.anonKey` in `supabase.yaml`    |
| `web.supabase.serviceRoleKey` | `<YOUR_SUPABASE_SERVICE_ROLE_KEY>` | Same as `secret.jwt.serviceKey` in `supabase.yaml` |
| `web.supabase.clientanonKey`  | `<YOUR_SUPABASE_SERVICE_ROLE_KEY>` | Same as `secret.jwt.serviceKey` in `supabase.yaml` |

**Web frontend Supabase keys — Supabase Cloud (`ENABLE_SUPABASE=false`):**

| Field                         | Placeholder                        | Source                                        |
| ----------------------------- | ---------------------------------- | --------------------------------------------- |
| `web.supabase.url`            | `<YOUR_SUPABASE_PROJECT_URL>`      | Supabase dashboard → Project Settings → API   |
| `web.supabase.anonKey`        | `<YOUR_SUPABASE_ANON_KEY>`         | Supabase dashboard → API → `anon` key         |
| `web.supabase.serviceRoleKey` | `<YOUR_SUPABASE_SERVICE_ROLE_KEY>` | Supabase dashboard → API → `service_role` key |
| `web.supabase.clientanonKey`  | `<YOUR_SUPABASE_CLIENT_ANON_KEY>`  | Same as `service_role` key                    |

#### 5.13 `values/signoz.yaml` — SigNoz Observability (only if `ENABLE_SIGNOZ=true`)

```bash theme={null}
nano values/signoz.yaml
```

| Field                                                                  | Placeholder                                                                  | Notes                          |
| ---------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------------------------------ |
| `global.clusterName`                                                   | `<YOUR_ENV_NAME>`                                                            | EKS cluster name               |
| `signoz.ingress.annotations.alb.ingress.kubernetes.io/certificate-arn` | `<YOUR_AWS_REGION>`, `<YOUR_AWS_ACCOUNT_ID>`, `<YOUR_SIGNOZ_CERTIFICATE_ID>` | ACM certificate for SigNoz     |
| `signoz.ingress.hosts[0].host`                                         | `<YOUR_SIGNOZ_DOMAIN>`                                                       | e.g., `signoz-app.example.com` |

#### 5.14 `values/signoz-k8s-infra.yaml` — SigNoz K8s Metrics (only if `ENABLE_SIGNOZ=true`)

```bash theme={null}
nano values/signoz-k8s-infra.yaml
```

| Field                | Placeholder       | Notes                                |
| -------------------- | ----------------- | ------------------------------------ |
| `global.clusterName` | `<YOUR_ENV_NAME>` | EKS cluster name for metric labeling |

The OTel collector endpoint (`signoz-otel-collector.monitoring.svc.cluster.local:4317`) is pre-configured assuming both SigNoz and k8s-infra are deployed in the `monitoring` namespace. No change needed unless you use a custom release name.

#### Deployment Ordering Reminder

Some values are only available after certain infrastructure has been deployed. Follow this order:

1. **Before any deployment** — Set: `<YOUR_ENV_NAME>`, `<YOUR_AWS_REGION>`, `<YOUR_AWS_ACCOUNT_ID>`, `<YOUR_ENVIRONMENT>`, `<YOUR_PROJECT>`, `<YOUR_VPC_CIDR>`, all domain names, all certificate ARNs, all Supabase values, `<YOUR_TOOLKIT_ENCRYPTION_KEY>`, RabbitMQ username/password
2. **After EKS cluster created** — Set: `<YOUR_VPC_ID>` (`infrastructure.yaml`), `<YOUR_EKS_CLUSTER_ENDPOINT>` (`karpenter-values.yaml`)
3. **After `terraform apply` for AWS services** — Set: `<YOUR_REDIS_HOST>`, `<YOUR_RABBITMQ_HOST>` (`odin-services.yaml`)

### Step 6: Verify No Placeholders Remain

```bash theme={null}
grep -r "<YOUR_" . --include="*.hcl" --include="*.yaml"
```

Expected output should be empty, or contain only references to resources about to be created (VPC, Redis, MQ, EKS). If any placeholders remain, refer to the Step 5 sub-sections above.

**Files checklist:**

| File                                | Step | Required                             |
| ----------------------------------- | ---- | ------------------------------------ |
| `terragrunt.hcl`                    | 5.1  | Always                               |
| `state/terragrunt.hcl`              | 5.2  | Always                               |
| `values/infrastructure.yaml`        | 5.3  | Always                               |
| `values/karpenter-values.yaml`      | 5.4  | Always                               |
| `values/karpenter-nodeclasses.yaml` | 5.5  | Always                               |
| `values/karpenter.yaml`             | 5.7  | Always                               |
| `values/keda.yaml`                  | 5.8  | Always                               |
| `values/aws-ebs-csi-driver.yaml`    | 5.6  | Always                               |
| `values/odin-services.yaml`         | 5.12 | Always                               |
| `values/cloudnative-pg.yaml`        | 5.11 | Only if `ENABLE_CNPG=true`           |
| `values/ha-supabase-db.yaml`        | 5.10 | Only if `ENABLE_HA_SUPABASE_DB=true` |
| `values/supabase.yaml`              | 5.9  | Only if `ENABLE_SUPABASE=true`       |
| `values/signoz.yaml`                | 5.13 | Only if `ENABLE_SIGNOZ=true`         |
| `values/signoz-k8s-infra.yaml`      | 5.14 | Only if `ENABLE_SIGNOZ=true`         |

***

## Phase 1: State Management Setup

**Purpose:** S3 bucket creation for Terraform state.

Each environment's state management module creates an S3 bucket with the pattern `odin-terraform-state-{environment-name}`, configures encryption, versioning, and public access blocking, and uses local state for the state module itself (bootstrap pattern).

```bash theme={null}
cd terragrunt/environments/{your-env-name}/state
terragrunt init
terragrunt plan
terragrunt apply
```

***

## Phase 2: EKS Infrastructure Deployment

**Purpose:** Core networking (VPC, subnets, NAT gateway), IAM roles and policies, EKS cluster and managed node groups.

### 2.1 Dry Run — EKS Infrastructure

**Core Infrastructure**

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt plan -target="aws_vpc.main" \
  -target="aws_internet_gateway.main" \
  -target="aws_subnet.public" \
  -target="aws_subnet.private" \
  -target="aws_eip.nat" \
  -target="aws_nat_gateway.main" \
  -target="aws_route_table.public" \
  -target="aws_route_table.private" \
  -target="aws_route_table_association.public" \
  -target="aws_route_table_association.private"
```

**IAM Roles and Policies**

```bash theme={null}
terragrunt plan -target="aws_iam_role.cluster" \
  -target="aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy" \
  -target="aws_iam_openid_connect_provider.eks" \
  -target="aws_iam_role.node" \
  -target="aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy" \
  -target="aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy" \
  -target="aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly"
```

**EKS Cluster and Node Groups**

```bash theme={null}
terragrunt plan -target="aws_eks_cluster.main" \
  -target="aws_eks_node_group.main" \
  -target="kubernetes_secret.regcred"
```

#### Using a Custom / Private Docker Registry

By default, Odin AI Platform images are pulled from Docker Hub using a secret named `regcred`. If the customer hosts images in a different registry, follow these steps before deploying `odin-services`.

**Step 1 — Create the `imagePullSecret` in the target namespace**

```bash theme={null}
# Generic private registry (Docker Hub, Quay, self-hosted, etc.)
kubectl create secret docker-registry regcred \
  --namespace default \
  --docker-server=<YOUR_REGISTRY_HOST> \
  --docker-username=<YOUR_REGISTRY_USERNAME> \
  --docker-password=<YOUR_REGISTRY_PASSWORD> \
  --docker-email=<YOUR_EMAIL>
 
# AWS ECR — token expires every 12h; refresh via a CronJob or use ECR pull-through cache
aws ecr get-login-password --region <YOUR_AWS_REGION> | \
  kubectl create secret docker-registry regcred \
    --namespace default \
    --docker-server=<YOUR_AWS_ACCOUNT_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com \
    --docker-username=AWS \
    --docker-password-stdin
```

**Step 2 — Set the secret name in `values/odin-services.yaml`**

```yaml theme={null}
# values/odin-services.yaml
imagePullSecrets:
  - name: regcred          # must match the secret name created above
  # - name: customer-registry-secret  # add additional registries if needed
```

**Step 3 — Update image references**

```yaml theme={null}
web:
  image: <YOUR_REGISTRY_HOST>/<YOUR_ORG>/web:<TAG>
 
fastapiBackend:
  image: <YOUR_REGISTRY_HOST>/<YOUR_ORG>/server:<TAG>
```

**Step 4 — Verify pull access before full deployment**

```bash theme={null}
kubectl run registry-test \
  --image=<YOUR_REGISTRY_HOST>/<YOUR_ORG>/web:<TAG> \
  --overrides='{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}' \
  --restart=Never --rm -it -- echo "Pull successful"
```

### 2.2 Deploy EKS Infrastructure

**Step 1: Core Infrastructure**

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt apply -target="aws_vpc.main" \
  -target="aws_internet_gateway.main" \
  -target="aws_subnet.public" \
  -target="aws_subnet.private" \
  -target="aws_eip.nat" \
  -target="aws_nat_gateway.main" \
  -target="aws_route_table.public" \
  -target="aws_route_table.private" \
  -target="aws_route_table_association.public" \
  -target="aws_route_table_association.private"
```

<Warning>
  After this step, update `vpcId` in `values/infrastructure.yaml` before deploying the AWS Load Balancer Controller.
</Warning>

**Step 2: EKS Cluster and IAM Roles and Policies**

```bash theme={null}
terragrunt apply -target="aws_iam_role.cluster" \
  -target="aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy" \
  -target="aws_iam_openid_connect_provider.eks" \
  -target="aws_iam_role.node" \
  -target="aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy" \
  -target="aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy" \
  -target="aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly"
```

**Step 3: Node Groups and Addons**

```bash theme={null}
terragrunt apply -target="aws_eks_cluster.main" \
  -target="aws_eks_node_group.main" \
  -target="kubernetes_secret.regcred"
```

<Warning>
  After this step, update `CLUSTER_ENDPOINT` in `values/karpenter-values.yaml` before deploying Karpenter.
</Warning>

**Check EKS Cluster Connectivity**

```bash theme={null}
aws eks update-kubeconfig --region $AWS_REGION --name $CLUSTER_NAME
 
kubectl cluster-info
kubectl get nodes
kubectl get secret regcred -n default
```

***

## Phase 3: Storage and Load Balancing

**Purpose:** EBS CSI driver for persistent volumes, AWS Load Balancer Controller running on managed node group.

### 3.1 Dry Run — Storage and Load Balancing

**EBS CSI Driver**

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt plan -target="aws_iam_role.ebs_csi_driver" \
  -target="aws_iam_role_policy_attachment.ebs_csi_driver" \
  -target="helm_release.ebs_csi_driver"
```

**AWS Load Balancer Controller**

```bash theme={null}
terragrunt plan -target="aws_iam_role.aws_load_balancer_controller" \
  -target="aws_iam_role_policy_attachment.aws_load_balancer_controller" \
  -target="aws_iam_policy.aws_load_balancer_controller" \
  -target="helm_release.infrastructure"
```

### 3.2 Deploy Storage and Load Balancing

**Step 1: EBS CSI Driver**

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt apply -target="aws_iam_role.ebs_csi_driver" \
  -target="aws_iam_role_policy_attachment.ebs_csi_driver" \
  -target="helm_release.ebs_csi_driver"
```

**Verification**

```bash theme={null}
helm list -n kube-system | grep ebs
kubectl get pods -n kube-system | grep ebs-csi
kubectl get storageclass
aws iam get-role --role-name $CLUSTER_NAME-ebs-csi-driver-role --region $AWS_REGION
kubectl get sa -n kube-system | grep ebs-csi
kubectl describe sa ebs-csi-controller-sa -n kube-system
```

**Step 2: AWS Load Balancer Controller**

```bash theme={null}
terragrunt apply -target="aws_iam_role.aws_load_balancer_controller" \
  -target="aws_iam_role_policy_attachment.aws_load_balancer_controller" \
  -target="aws_iam_policy.aws_load_balancer_controller" \
  -target="helm_release.infrastructure"
```

**Verification**

```bash theme={null}
helm list -n infrastructure
kubectl get pods -n infrastructure | grep aws-load-balancer-controller
kubectl get sa -n infrastructure
kubectl describe sa aws-load-balancer-controller -n infrastructure
aws iam get-role --role-name $CLUSTER_NAME-aws-load-balancer-controller --region $AWS_REGION
kubectl logs -n infrastructure -l app.kubernetes.io/name=aws-load-balancer-controller
kubectl get ingressclass
```

***

## Phase 4: Karpenter Autoscaling

**Purpose:** IAM roles for Karpenter, Spot interruption handling, Karpenter controller and node pools.

### 4.1 Dry Run — Karpenter

**Karpenter IAM Resources**

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt plan -target="aws_iam_role.karpenter_controller" \
  -target="aws_iam_policy.karpenter_controller" \
  -target="aws_iam_role_policy_attachment.karpenter_controller" \
  -target="aws_iam_role.karpenter_node" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEKSWorkerNodePolicy" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEKS_CNI_Policy" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEC2ContainerRegistryReadOnly" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEBSCSIDriverPolicy" \
  -target="aws_iam_instance_profile.karpenter_node"
```

**EC2 Spot Service-Linked Role (if spot instances are enabled)**

```bash theme={null}
terragrunt plan -target="aws_iam_service_linked_role.ec2_spot[0]"
```

**Karpenter Spot Interruption (if enabled in `terragrunt.hcl`)**

```bash theme={null}
terragrunt plan -target="aws_sqs_queue.karpenter_interruption_queue" \
  -target="aws_sqs_queue_policy.karpenter_interruption_queue" \
  -target="aws_cloudwatch_event_rule.karpenter_interruption" \
  -target="aws_cloudwatch_event_target.karpenter_interruption"
```

**Karpenter Helm Charts**

```bash theme={null}
terragrunt plan -target="helm_release.karpenter"
```

**Karpenter NodePools and EC2NodeClasses**

```bash theme={null}
terragrunt plan -target="kubernetes_manifest.karpenter_nodepool" \
  -target="kubernetes_manifest.karpenter_nodeclass" \
  -target="kubernetes_config_map.aws_auth"
```

<Info>
  An expected error may appear during plan: `API did not recognize GroupVersionKind from manifest (CRD may not be installed)`. This is safe to ignore — Kubernetes validates resources against the live API at plan time, before CRDs are installed.
</Info>

### 4.2 Deploy Karpenter

**Step 1: Karpenter IAM Resources**

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt apply -target="aws_iam_role.karpenter_controller" \
  -target="aws_iam_policy.karpenter_controller" \
  -target="aws_iam_role_policy_attachment.karpenter_controller" \
  -target="aws_iam_role.karpenter_node" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEKSWorkerNodePolicy" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEKS_CNI_Policy" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEC2ContainerRegistryReadOnly" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEBSCSIDriverPolicy" \
  -target="aws_iam_instance_profile.karpenter_node"
```

**Verification**

```bash theme={null}
aws iam get-role --role-name $CLUSTER_NAME-karpenter-controller --region $AWS_REGION
aws iam get-role --role-name $CLUSTER_NAME-karpenter-node --region $AWS_REGION
aws iam get-instance-profile --instance-profile-name $CLUSTER_NAME-karpenter-node --region $AWS_REGION
aws iam list-attached-role-policies --role-name $CLUSTER_NAME-karpenter-node --region $AWS_REGION
```

**Step 2: EC2 Spot Service-Linked Role (if spot instances are enabled)**

<Warning>
  The EC2 Spot service-linked role is account-wide (only one per AWS account) and must exist before Karpenter can launch Spot instances.
</Warning>

**Option A: Let Terraform create it (recommended for new deployments)**

```bash theme={null}
terragrunt apply -target="aws_iam_service_linked_role.ec2_spot[0]"
```

**Option B: Import if the role already exists**

```bash theme={null}
# Check if the role exists
aws iam get-role --role-name AWSServiceRoleForEC2Spot --region $AWS_REGION
 
# If it doesn't exist, create it manually
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com --region $AWS_REGION
 
# Import into Terraform (replace ACCOUNT_ID with your 12-digit AWS account ID)
terragrunt import 'aws_iam_service_linked_role.ec2_spot[0]' \
  arn:aws:iam::ACCOUNT_ID:role/aws-service-role/spot.amazonaws.com/AWSServiceRoleForEC2Spot
```

<details>
  <summary>Troubleshooting Spot Instance Creation Issues</summary>

  **Check Karpenter logs for Spot-related errors:**

  ```bash theme={null}
  kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f | grep -i spot
  ```

  Common errors: `AuthFailure.ServiceLinkedRoleCreationNotPermitted`, `UnfulfillableCapacity`, `InsufficientInstanceCapacity`.

  **Verify service-linked role exists:**

  ```bash theme={null}
  aws iam get-role --role-name AWSServiceRoleForEC2Spot --region $AWS_REGION
  ```

  **Verify IAM policy includes Spot permission:**

  ```bash theme={null}
  aws iam get-policy-version \
    --policy-arn $(aws iam list-policies --query 'Policies[?PolicyName==`YOUR_CLUSTER_NAME-karpenter-controller`].Arn' --output text) \
    --version-id $(aws iam get-policy --policy-arn $(aws iam list-policies --query 'Policies[?PolicyName==`YOUR_CLUSTER_NAME-karpenter-controller`].Arn' --output text) --query 'Policy.DefaultVersionId' --output text) \
    --region $AWS_REGION | grep -i "CreateServiceLinkedRole"
  ```

  **Check Spot instance availability:**

  ```bash theme={null}
  aws ec2 describe-spot-price-history \
    --instance-types r6a.4xlarge r6a.large \
    --product-descriptions "Linux/UNIX" \
    --region $AWS_REGION \
    --max-items 10
  ```

  **Check NodePool capacity types:**

  ```bash theme={null}
  kubectl get nodepool -o yaml | grep -A 5 "capacity-type"
  ```

  **Verify Spot vs On-Demand nodes:**

  ```bash theme={null}
  kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name)\t\(.metadata.labels."karpenter.sh/capacity-type")\t\(.metadata.labels."node.kubernetes.io/instance-type")"'
  ```
</details>

**Step 3: Karpenter Spot Interruption (if enabled in `terragrunt.hcl`)**

```bash theme={null}
terragrunt apply -target="aws_sqs_queue.karpenter_interruption_queue" \
  -target="aws_sqs_queue_policy.karpenter_interruption_queue" \
  -target="aws_cloudwatch_event_rule.karpenter_interruption" \
  -target="aws_cloudwatch_event_target.karpenter_interruption"
```

**Verification**

```bash theme={null}
aws events describe-rule --name $CLUSTER_NAME-karpenter-interruption --region $AWS_REGION
aws events list-targets-by-rule --rule $CLUSTER_NAME-karpenter-interruption --region $AWS_REGION
aws sqs get-queue-url --queue-name $CLUSTER_NAME-karpenter-interruption-queue --region $AWS_REGION
```

**Step 4: Karpenter Helm Chart**

```bash theme={null}
terragrunt apply -target="helm_release.karpenter"
```

**Verification**

```bash theme={null}
helm list -n kube-system | grep karpenter
kubectl get pods -n kube-system | grep karpenter
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter
kubectl describe sa karpenter -n kube-system
```

**Step 5: Karpenter Kubernetes Manifests**

```bash theme={null}
terragrunt apply -target="kubernetes_manifest.karpenter_nodepool" \
  -target="kubernetes_manifest.karpenter_nodeclass"
 
# Import the existing aws-auth ConfigMap
# Note: Use quotes to prevent zsh from interpreting brackets as glob patterns
terragrunt import 'kubernetes_config_map.aws_auth[0]' kube-system/aws-auth
 
# Then apply
terragrunt apply -target='kubernetes_config_map.aws_auth[0]'
```

**Verification**

```bash theme={null}
kubectl get nodepools -o wide
kubectl describe nodepool general
kubectl describe nodepool application
kubectl describe nodepool database
kubectl get ec2nodeclasses -o wide
kubectl get configmap aws-auth -n kube-system -o jsonpath='{.data.mapRoles}' | grep karpenter-node
kubectl get nodepools -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
kubectl get nodes -l karpenter.sh/nodepool --show-labels
kubectl get events -n kube-system --field-selector involvedObject.name=karpenter --sort-by='.lastTimestamp'
```

***

## Phase 5: KEDA Autoscaling

**Purpose:** KEDA for application-level autoscaling.

### 5.1 Dry Run — KEDA

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt plan -target="helm_release.keda"
```

### 5.2 Deploy KEDA

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt apply -target="helm_release.keda"
```

**Verification**

```bash theme={null}
helm list -n keda
kubectl get pods -n keda
kubectl get deployment -n keda
kubectl get crd | grep keda
kubectl get validatingwebhookconfigurations | grep keda
kubectl get svc -n keda
```

***

## Phase 6: Data Services

**Purpose:** Supabase (database), ElastiCache (Redis), RabbitMQ (message queue).

<Info>
  Deploy CloudNativePG operator first, then the HA Supabase DB cluster, then the Supabase application. The DB cluster must be ready before Supabase starts.
</Info>

### 6.1 Dry Run — Data Services

**Step 1: CloudNativePG operator (if enabled)**

```bash theme={null}
cd terragrunt/environments/your-env-name
ENABLE_CNPG=true terragrunt plan \
  --target='helm_release.additional_charts["cloudnative-pg"]'
```

**Step 2: HA Supabase DB (if enabled)**

```bash theme={null}
ENABLE_HA_SUPABASE_DB=true terragrunt plan \
  --target='helm_release.additional_charts["ha-supabase-db"]'
```

**Step 3: Supabase application (if enabled)**

```bash theme={null}
if [ "${ENABLE_SUPABASE:-false}" = "true" ]; then
  ENABLE_SUPABASE=true terragrunt plan \
    --target='helm_release.supabase[0]'
fi
```

**Step 4: AWS Services — ElastiCache and RabbitMQ (if enabled)**

```bash theme={null}
if [ "${ENABLE_AWS_SERVICES:-false}" = "true" ]; then
  terragrunt plan -target="aws_elasticache_subnet_group.redis" \
    -target="aws_security_group.redis" \
    -target="aws_elasticache_replication_group.redis" \
    -target="aws_security_group.rabbitmq" \
    -target="aws_mq_broker.rabbitmq"
fi
```

### 6.2 Deploy Data Services

**Step 1: CloudNativePG operator (if enabled)**

```bash theme={null}
cd terragrunt/environments/your-env-name
ENABLE_CNPG=true terragrunt apply --auto-approve \
  --target='helm_release.additional_charts["cloudnative-pg"]'
```

**Step 2: HA Supabase DB (if enabled)**

```bash theme={null}
ENABLE_HA_SUPABASE_DB=true terragrunt apply --auto-approve \
  --target='helm_release.additional_charts["ha-supabase-db"]'
```

**Verify PgBouncer pooler and credentials after deployment:**

```bash theme={null}
kubectl get svc -n ha-supabase-db | grep pooler
kubectl get secrets -n ha-supabase-db
kubectl get secret ha-supabase-db-authenticator-credentials -n ha-supabase-db \
  -o jsonpath='{.data.username}' | base64 -d && echo ""
```

Use the pooler ClusterIP (or `EXTERNAL-IP` if LoadBalancer) as the `SUPABASE_POSTGRES_HOST` value in `values/odin-services.yaml` and as `secret.db.postgresHost` in `values/supabase.yaml`.

**Step 3: Supabase application (if enabled)**

All Supabase service pods run exclusively on the Karpenter `application` NodePool (On-Demand only) to prevent Spot interruptions.

```bash theme={null}
if [ "${ENABLE_SUPABASE:-false}" = "true" ]; then
  ENABLE_SUPABASE=true terragrunt apply --auto-approve \
    --target='helm_release.supabase[0]'
fi
```

**Step 4: AWS Services — ElastiCache and RabbitMQ (if enabled)**

```bash theme={null}
if [ "${ENABLE_AWS_SERVICES:-false}" = "true" ]; then
  terragrunt apply \
    -target="aws_elasticache_subnet_group.redis" \
    -target="aws_security_group.redis" \
    -target="aws_elasticache_replication_group.redis" \
    -target="aws_security_group.rabbitmq" \
    -target="aws_mq_broker.rabbitmq"
fi
```

**Verification**

```bash theme={null}
# Get connection details from Terraform outputs
terragrunt output elasticache_endpoint
terragrunt output elasticache_port
terragrunt output rabbitmq_endpoint
terragrunt output rabbitmq_port
 
# Test Redis connectivity from EKS cluster
kubectl run redis-test --image=redis:7-alpine --restart=Never -- \
  sh -c "redis-cli -h <redis-endpoint> -p 6379 --tls --insecure ping && echo 'Redis connection successful'"
kubectl logs redis-test
kubectl delete pod redis-test
 
# Check Redis encryption status
aws elasticache describe-replication-groups \
  --replication-group-id $CLUSTER_NAME-redis \
  --region $AWS_REGION \
  --query 'ReplicationGroups[0].{AtRestEncryption:AtRestEncryptionEnabled,TransitEncryption:TransitEncryptionEnabled}'
```

<Warning>
  Before deploying Odin Services, update `values/odin-services.yaml` with the Redis endpoint, RabbitMQ endpoint, and all certificate ARNs obtained in this phase.
</Warning>

***

## Phase 7: Odin Services

**Purpose:** Application deployment via Helm.

<Info>
  Before deploying, temporarily scale down `fastapiBackend` to a single replica for the initial database migration run — set `replicaCount: 1`, `workers: 1`, and `keda.minReplicas: 1`. Once the migration completes successfully, revert these values to their production defaults before re-deploying.
</Info>

### 7.1 Dry Run — Odin Services

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt plan -target="helm_release.odin_services"
```

### 7.2 Deploy Odin Services

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt apply -target="helm_release.odin_services"
```

**Verification**

```bash theme={null}
kubectl get pods
kubectl get ingress  # Add the ALB endpoints to your DNS provider
```

***

## Phase 8: SigNoz Observability

**Purpose:** Logs and metrics monitoring.

### 8.1 Dry Run — SigNoz Charts

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt plan -target='helm_release.additional_charts["signoz"]'
terragrunt plan -target='helm_release.additional_charts["k8s-infra"]'
```

### 8.2 Deploy SigNoz Charts

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt apply -target='helm_release.additional_charts["signoz"]'
terragrunt apply -target='helm_release.additional_charts["k8s-infra"]'
```

**Verification**

```bash theme={null}
kubectl get pods -n monitoring
kubectl get ingress -n monitoring  # Add the ALB endpoints to your DNS provider
```

***

## Phase 9: Final Deployment

### 9.1 Complete Deployment

```bash theme={null}
cd terragrunt/environments/your-env-name
terragrunt apply
```

This final apply handles any remaining resources not explicitly targeted in previous phases.

### 9.2 Verify Deployment

```bash theme={null}
# Update kubeconfig
aws eks update-kubeconfig --region us-east-2 --name your-env-name
 
# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces
 
# Check Karpenter
kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter
 
# Check AWS Load Balancer Controller
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
 
# Check KEDA
kubectl get pods -n keda
 
# Check Odin Services
kubectl get pods -n default
kubectl get services -n default
kubectl get ingress -n default
 
# Check all Helm releases
helm list --all-namespaces
```

***

## Troubleshooting

**State lock issues**

```bash theme={null}
terragrunt force-unlock <lock-id>
```

**Karpenter not working**

```bash theme={null}
kubectl describe nodes
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter
```

**Load Balancer issues**

```bash theme={null}
kubectl describe ingress -n default
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
```

**Helm chart issues**

```bash theme={null}
helm status <release-name> -n <namespace>
helm rollback <release-name> <revision> -n <namespace>
```

***

## Cleanup

```bash theme={null}
# Destroy infrastructure
cd terragrunt/environments/your-env-name
terragrunt destroy -auto-approve
 
# Destroy state bucket (use with caution)
cd terragrunt/environments/your-env-name/state
terragrunt destroy -auto-approve
```

***

## Monitoring and Logging

```bash theme={null}
# AWS Resources
aws eks describe-cluster --name your-env-name --region us-east-2
aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/your-env-name,Values=owned"
 
# Kubernetes Resources
kubectl top nodes
kubectl top pods --all-namespaces
kubectl get events --sort-by=.metadata.creationTimestamp
```

***

## Quick Reference — All Deployment Commands

```bash theme={null}
# Phase 1: State Management
cd terragrunt/environments/your-env-name/state
terragrunt apply
 
# Phase 2: EKS Infrastructure
cd terragrunt/environments/your-env-name
 
terragrunt apply -target="aws_vpc.main" -target="aws_internet_gateway.main" \
  -target="aws_subnet.public" -target="aws_subnet.private" -target="aws_eip.nat" \
  -target="aws_nat_gateway.main" -target="aws_route_table.public" \
  -target="aws_route_table.private" -target="aws_route_table_association.public" \
  -target="aws_route_table_association.private" -auto-approve
 
terragrunt apply -target="aws_iam_role.cluster" \
  -target="aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy" \
  -target="aws_iam_openid_connect_provider.eks" -target="aws_iam_role.node" \
  -target="aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy" \
  -target="aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy" \
  -target="aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly" \
  -auto-approve
 
terragrunt apply -target="aws_eks_cluster.main" \
  -target="aws_eks_node_group.main" -target="kubernetes_secret.regcred" -auto-approve
 
# Phase 3: Storage and Load Balancing
terragrunt apply -target="aws_iam_role.ebs_csi_driver" \
  -target="aws_iam_role_policy_attachment.ebs_csi_driver" \
  -target="helm_release.ebs_csi_driver" -auto-approve
 
terragrunt apply -target="aws_iam_role.aws_load_balancer_controller" \
  -target="aws_iam_role_policy_attachment.aws_load_balancer_controller" \
  -target="aws_iam_policy.aws_load_balancer_controller" \
  -target="helm_release.infrastructure" -auto-approve
 
# Phase 4: Karpenter Autoscaling
terragrunt apply -target="aws_iam_role.karpenter_controller" \
  -target="aws_iam_policy.karpenter_controller" \
  -target="aws_iam_role_policy_attachment.karpenter_controller" \
  -target="aws_iam_role.karpenter_node" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEKSWorkerNodePolicy" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEKS_CNI_Policy" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEC2ContainerRegistryReadOnly" \
  -target="aws_iam_role_policy_attachment.karpenter_node_AmazonEBSCSIDriverPolicy" \
  -target="aws_iam_instance_profile.karpenter_node" -auto-approve
 
# Spot interruption handling (if spot_interruption_handling = true)
terragrunt apply -target="aws_sqs_queue.karpenter_interruption_queue" \
  -target="aws_sqs_queue_policy.karpenter_interruption_queue" \
  -target="aws_cloudwatch_event_rule.karpenter_interruption" \
  -target="aws_cloudwatch_event_target.karpenter_interruption" -auto-approve
 
terragrunt apply -target="helm_release.karpenter" -auto-approve
 
terragrunt apply -target="kubernetes_manifest.karpenter_nodepool" \
  -target="kubernetes_manifest.karpenter_nodeclass" \
  -target='kubernetes_config_map.aws_auth[0]' -auto-approve
 
# Phase 5: KEDA Autoscaling
terragrunt apply -target="helm_release.keda" -auto-approve
 
# Phase 6: Data Services
ENABLE_CNPG=true terragrunt apply --target='helm_release.additional_charts["cloudnative-pg"]' -auto-approve
ENABLE_HA_SUPABASE_DB=true terragrunt apply --target='helm_release.additional_charts["ha-supabase-db"]' -auto-approve
 
if [ "${ENABLE_SUPABASE:-false}" = "true" ]; then
  ENABLE_SUPABASE=true terragrunt apply --target='helm_release.supabase[0]' -auto-approve
fi
 
if [ "${ENABLE_AWS_SERVICES:-false}" = "true" ]; then
  terragrunt apply -target="aws_elasticache_subnet_group.redis" \
    -target="aws_security_group.redis" \
    -target="aws_elasticache_replication_group.redis" \
    -target="aws_security_group.rabbitmq" \
    -target="aws_mq_broker.rabbitmq" -auto-approve
fi
 
# Phase 7: Odin Services
terragrunt apply -target="helm_release.odin_services" -auto-approve
 
# Phase 8: SigNoz (if enabled)
terragrunt apply -target='helm_release.additional_charts["signoz"]' -auto-approve
terragrunt apply -target='helm_release.additional_charts["k8s-infra"]' -auto-approve
 
# Phase 9: Final Deployment
terragrunt apply -auto-approve
```

<Info>
  Replace `your-env-name` with your actual environment name throughout. Always run dry runs (`terragrunt plan`) first to validate your configuration before applying changes.
</Info>
