Deploying a RAG Chatbot on AWS with Terraform

Most people talk about RAG like the model is the system.

It isn’t.

The real system is everything around it: networking, private connectivity, containers, IAM, databases, deployment flows, and the boring infrastructure details that decide whether your app actually works in production.

This guide takes the original AWS + Terraform walkthrough and refines it into a clearer, sharper, SpaceSlam-style engineering breakdown.

Instead of just repeating Terraform blocks, we focus on why each piece exists, how the architecture fits together, and what matters if you want to build something real.

#What We Are Actually Building

We want to reproduce an end-to-end AWS setup for a RAG chatbot using Terraform.

That includes:

a custom VPC
public and private subnets
an Internet Gateway and route tables
Security Groups
an EC2 Bastion Host
an Amazon DocumentDB cluster
AWS App Runner for backend and frontend services
a VPC Connector so App Runner can reach private resources
a Bedrock interface VPC endpoint for private model access

The point is not just “infrastructure as code.”

The point is this: you want the whole system to be reproducible, reviewable, secure, and easy to evolve.

That is what Terraform gives you when used properly.

#High-Level Architecture

Here is the mental model:

The frontend is public.
The backend is deployed with App Runner.
The database lives in private subnets.
A bastion host gives controlled admin access.
Bedrock is reached privately through a VPC endpoint.
Terraform wires all of it together.

So the real story is not “deploying a chatbot.”

It is: building a production-shaped AI system where the app, the data plane, and the model access path are all explicitly designed.

#Repository Structure

A clean Terraform repo usually separates concerns by resource type.

.
├── provider.tf
├── variables.tf
├── terraform.tfvars
├── vpc.tf
├── security_groups.tf
├── documentdb.tf
├── data.tf
├── bastion_host.tf
├── app_runner_back.tf
└── outputs.tf

This matters because Terraform files are not just “files.” They are your infrastructure boundaries.

A structure like this makes it easier to reason about:

networking
security
compute
database
deployment
outputs

That becomes much more valuable as the project grows.

#Terraform Basics You Actually Need

Before touching the AWS resources, know these commands:

#`terraform init`

Initializes the working directory. It downloads providers and prepares Terraform state handling.

#`terraform plan`

Shows what Terraform is going to create, update, or destroy. This is your chance to catch mistakes before AWS catches your wallet.

#`terraform apply`

Executes the plan and provisions the resources.

#Terraform State

Terraform keeps track of what it created through a state file, usually terraform.tfstate.

That state is critical because Terraform does not “guess” infrastructure. It compares:

what your code says should exist
what the state says already exists

That comparison is what drives the plan.

#Resource Referencing

Terraform resources are referenced like this:

aws_vpc.chatbot_vpc.id

That means:

resource type: aws_vpc
local resource name: chatbot_vpc
attribute: id

#Variables

Variables are typically:

declared in variables.tf
assigned in terraform.tfvars
referenced as var.some_name

That lets you separate logic from environment-specific values.

1. VPC Network

This is the foundation.

If your network design is wrong, the rest of the architecture becomes a pile of workarounds.

#Create a Custom VPC

resource "aws_vpc" "chatbot_vpc" {
  cidr_block           = "172.16.0.0/16"
  instance_tenancy     = "default"
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = {
    Name = "terraform_chatbot_vpc"
  }
}

#Why it exists

A VPC is your private network boundary in AWS.

The CIDR block 172.16.0.0/16 defines the IP address space for everything inside that network.

#What matters

enable_dns_support = true lets resources resolve DNS names.
enable_dns_hostnames = true helps AWS-managed services and instances work more smoothly.
tagging is not optional if you want sane operations later.

#Public Subnet

resource "aws_subnet" "public_subnet_1" {
  vpc_id                  = aws_vpc.chatbot_vpc.id
  cidr_block              = "172.16.1.0/24"
  map_public_ip_on_launch = true
  availability_zone       = "eu-central-1a"

  tags = {
    Name = "terraform_chatbot_subnet_public1"
  }
}

#Why it exists

This is where resources that need direct internet reachability can live.

In this architecture, that mainly means the bastion host.

#Key detail

map_public_ip_on_launch = true means EC2 instances here automatically get public IPs.

#Private Subnets

resource "aws_subnet" "private_subnet_1" {
  vpc_id                  = aws_vpc.chatbot_vpc.id
  cidr_block              = "172.16.2.0/24"
  map_public_ip_on_launch = false
  availability_zone       = "eu-central-1a"

  tags = {
    Name = "terraform_chatbot_subnet_private1"
  }
}

resource "aws_subnet" "private_subnet_2" {
  vpc_id                  = aws_vpc.chatbot_vpc.id
  cidr_block              = "172.16.3.0/24"
  map_public_ip_on_launch = false
  availability_zone       = "eu-central-1b"

  tags = {
    Name = "terraform_chatbot_subnet_private2"
  }
}

#Why they exist

Private subnets are where you place resources that should not be directly exposed to the internet.

Here, that includes:

DocumentDB
App Runner private networking path
Bedrock interface endpoint

#Why two subnets matter

Using subnets across different availability zones improves availability and is often required by managed services.

#Internet Gateway

resource "aws_internet_gateway" "internet_gw" {
  vpc_id = aws_vpc.chatbot_vpc.id

  tags = {
    Name = "terraform_internet_gw"
  }
}

#Why it exists

Without an Internet Gateway, your public subnet is not really public.

It gives resources in public subnets a path to the internet.

#Route Tables

#Public Route Table

resource "aws_route_table" "public_route_table" {
  vpc_id = aws_vpc.chatbot_vpc.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.internet_gw.id
  }

  tags = {
    Name = "terraform_public_rtb"
  }
}

This sends all outbound traffic to the Internet Gateway.

#Private Route Table

resource "aws_route_table" "private_route_table" {
  vpc_id = aws_vpc.chatbot_vpc.id

  tags = {
    Name = "terraform_private_rtb"
  }
}

No default internet route means the private subnet stays private.

That is exactly what you want for database infrastructure.

#Route Table Associations

resource "aws_route_table_association" "public_assoc" {
  subnet_id      = aws_subnet.public_subnet_1.id
  route_table_id = aws_route_table.public_route_table.id
}

resource "aws_route_table_association" "private_assoc_1" {
  subnet_id      = aws_subnet.private_subnet_1.id
  route_table_id = aws_route_table.private_route_table.id
}

resource "aws_route_table_association" "private_assoc_2" {
  subnet_id      = aws_subnet.private_subnet_2.id
  route_table_id = aws_route_table.private_route_table.id
}

This is the final wiring step.

Subnets do not magically inherit the right routes. You explicitly attach them.

2. Security Groups

Now we define who is allowed to talk to what.

Security groups are not just firewall rules. They are part of the system design.

#DocumentDB Security Group

resource "aws_security_group" "documentdb_sg" {
  vpc_id      = aws_vpc.chatbot_vpc.id
  name        = "documentdb_security_group"
  description = "Security group for DocumentDB"

  ingress {
    from_port       = 27017
    to_port         = 27017
    protocol        = "tcp"
    security_groups = [aws_security_group.bastion_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "terraform_documentdb_security_group"
  }
}

#Why this is good

Instead of allowing access from a broad IP range, it allows access only from a trusted security group.

That is a much better pattern than throwing CIDRs everywhere.

#Port `27017`

That is the MongoDB-compatible port used by DocumentDB.

#Bastion Host Security Group

resource "aws_security_group" "bastion_sg" {
  vpc_id      = aws_vpc.chatbot_vpc.id
  name        = "bastion_security_group"
  description = "Security group for Bastion host"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = var.bastion_host_cidr
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "terraform_bastion_security_group"
  }
}

#Why this matters

A bastion host is already a controlled entry point. If you open SSH to the whole world, you just created a public attack surface with a fancy name.

Restrict it to a known CIDR.

3. Provisioning Amazon DocumentDB

Now we add the database layer.

For a RAG system, this is where your application might store:

embeddings
session data
chat history
metadata

#DocumentDB Subnet Group

resource "aws_docdb_subnet_group" "terraform_docdb_subnet_group" {
  name_prefix = "terraform_document_db_subnet_group"
  subnet_ids  = [
    aws_subnet.private_subnet_1.id,
    aws_subnet.private_subnet_2.id
  ]
}

#Why it exists

Managed databases need to know where they are allowed to launch. This subnet group tells DocumentDB to stay inside the private subnets.

#DocumentDB Cluster

resource "aws_docdb_cluster" "terraform_docdb" {
  cluster_identifier_prefix       = "terraform-docdb-cluster"
  db_subnet_group_name            = aws_docdb_subnet_group.terraform_docdb_subnet_group.name
  vpc_security_group_ids          = [aws_security_group.documentdb_sg.id]
  engine                          = "docdb"
  engine_version                  = "5.0.0"
  master_username                 = var.master_username
  master_password                 = var.master_password
  storage_encrypted               = true
  skip_final_snapshot             = true
}

#What matters here

it lives in the private subnet group
it uses the locked-down DocumentDB security group
storage is encrypted
credentials come from variables

#Important production note

Using raw Terraform variables for secrets is acceptable for demos. For production, use a more secure pattern such as:

AWS Secrets Manager
SSM Parameter Store
external secret injection via CI/CD

Also, skip_final_snapshot = true is fine for experimentation, not for serious environments.

#DocumentDB Cluster Instance

resource "aws_docdb_cluster_instance" "terraform_cluster_instances" {
  identifier         = "terraform-docdb-cluster-instance"
  cluster_identifier = aws_docdb_cluster.terraform_docdb.id
  instance_class     = "db.t3.medium"
}

The cluster defines the database control plane. The cluster instance is the actual compute capacity running it.

4. Bastion Host (EC2)

The bastion host is your controlled admin bridge into the private side of the system.

resource "aws_instance" "terraform_bastion_host" {
  ami                    = data.aws_ami.ubuntu_ami.id
  instance_type          = "t2.small"
  key_name               = aws_key_pair.ssh_bastion.key_name
  vpc_security_group_ids = [aws_security_group.bastion_sg.id]
  subnet_id              = aws_subnet.public_subnet_1.id

  user_data = <<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y gnupg curl wget
  EOF

  tags = {
    Name = "Terraform Bastion Host"
  }
}

#Why it exists

Because your DocumentDB cluster is private. You do not SSH into the database. You SSH into the bastion, then access private resources from there.

That is the correct pattern.

#SSH Key Pair

resource "aws_key_pair" "ssh_bastion" {
  key_name   = "ssh-key"
  public_key = file("id_bastion_bayard.pub")
}

#Important rule

Never commit private keys. Ever.

#Dynamic AMI Selection

Hardcoding an AMI ID is annoying and brittle. Use a data source instead.

data "aws_ami" "ubuntu_ami" {
  most_recent = true
  owners      = ["099720109477"]

  filter {
    name   = "architecture"
    values = ["x86_64"]
  }

  filter {
    name   = "root-device-type"
    values = ["ebs"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }

  filter {
    name   = "state"
    values = ["available"]
  }
}

#Why this is better

This keeps the bastion image current without editing Terraform every time AMIs change.

That is exactly the kind of small decision that separates “toy infra” from something maintainable.

5. App Runner, ECR, and IAM

Now we move from infrastructure foundation to application deployment.

The application has two parts:

backend
frontend

The images live in ECR. The services run in App Runner.

#ECR Image URIs

You will reference image URIs like:

<account-id>.dkr.ecr.<region>.amazonaws.com/chatbot-backend:latest
<account-id>.dkr.ecr.<region>.amazonaws.com/chatbot-frontend:latest

Those are what App Runner pulls when it deploys.

#IAM Role for App Runner

App Runner needs an IAM role so it can:

pull images from ECR
access AWS services at runtime
possibly call Bedrock

#Trust Policy

resource "aws_iam_role" "app_runner_build_role" {
  name = "app-runner-build-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = [
            "build.apprunner.amazonaws.com",
            "tasks.apprunner.amazonaws.com"
          ]
        }
        Action = "sts:AssumeRole"
      }
    ]
  })
}

#Why this matters

This does not grant permissions yet. It only says which AWS services are allowed to assume the role.

That is what a trust policy does.

#Attach Policies

#ECR Read Access

resource "aws_iam_role_policy_attachment" "ecr_readonly_attachment" {
  role       = aws_iam_role.app_runner_build_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
}

#Bedrock / DocumentDB Notes

The original article attaches broad managed policies like AmazonDocDBFullAccess and AmazonBedrockFullAccess. That works for demos, but it is too broad for good production security.

A better practice is:

minimal ECR pull permissions
narrow Bedrock invocation permissions
avoid unnecessary full-access managed policies
prefer application-level auth and network control for database access

This is one of the biggest differences between “it deploys” and “it is production-grade.”

6. App Runner Networking: Security Groups and VPC Connector

This is where the architecture becomes interesting.

By default, App Runner is not magically inside your private VPC world. If your database is private, you need a VPC connector.

#App Runner Security Group

resource "aws_security_group" "apprunner_sg" {
  name_prefix = "terraform-apprunner-sg-"
  vpc_id      = aws_vpc.chatbot_vpc.id

  egress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "terraform_apprunner_security_group"
  }
}

This allows HTTPS egress, which is needed for service communication such as Bedrock access.

#DocumentDB ↔ App Runner Rules

resource "aws_security_group_rule" "documentdb_to_apprunner" {
  type                     = "ingress"
  from_port                = 27017
  to_port                  = 27017
  protocol                 = "tcp"
  security_group_id        = aws_security_group.documentdb_sg.id
  source_security_group_id = aws_security_group.apprunner_sg.id
}

resource "aws_security_group_rule" "apprunner_to_documentdb" {
  type                     = "egress"
  from_port                = 27017
  to_port                  = 27017
  protocol                 = "tcp"
  security_group_id        = aws_security_group.apprunner_sg.id
  source_security_group_id = aws_security_group.documentdb_sg.id
}

#Why this is important

This is explicit private communication.

The backend can reach the database. The database only trusts the backend security group.

That is exactly the pattern you want.

#VPC Connector

resource "aws_apprunner_vpc_connector" "connector" {
  vpc_connector_name = "terraform-app-runner-connector"

  subnets = [
    aws_subnet.private_subnet_1.id,
    aws_subnet.private_subnet_2.id
  ]

  security_groups = [
    aws_security_group.apprunner_sg.id
  ]
}

#Why it exists

Without the VPC connector, App Runner cannot privately reach resources in your VPC.

This is the bridge between your managed container service and your internal data plane.

7. Deploying the Backend with App Runner

Now we define the backend service itself.

resource "aws_apprunner_service" "terraform_app_runner_backend" {
  service_name = "terraform-app-runner-backend"

  source_configuration {
    image_repository {
      image_configuration {
        port = "8000"
        runtime_environment_variables = {
          MONGODB_HOST     = aws_docdb_cluster.terraform_docdb.endpoint
          MONGODB_PORT     = var.docdb_port
          MONGODB_PASSWORD = var.docdb_password
          MONGODB_USERNAME = var.docdb_username
        }
      }
      image_identifier      = var.image_url
      image_repository_type = "ECR"
    }

    authentication_configuration {
      access_role_arn = aws_iam_role.app_runner_build_role.arn
    }

    auto_deployments_enabled = true
  }

  network_configuration {
    ingress_configuration {
      is_publicly_accessible = true
    }

    egress_configuration {
      egress_type       = "VPC"
      vpc_connector_arn = aws_apprunner_vpc_connector.connector.arn
    }
  }

  instance_configuration {
    cpu               = "1 vCPU"
    memory            = "2 GB"
    instance_role_arn = aws_iam_role.app_runner_build_role.arn
  }

  tags = {
    Name = "terraform-apprunner-service-backend"
  }
}

#What this really does

pulls your backend image from ECR
exposes port 8000
injects runtime environment variables
enables automatic redeploy on image updates
sends outbound traffic through the VPC connector
gives the service an IAM role at runtime

#Important idea

This is not “just deployment.” This is where your application runtime becomes part of your infrastructure model.

8. Private Bedrock Access with a VPC Endpoint

If your backend needs Bedrock, you do not want model calls bouncing through the public internet if you can avoid it.

Use an interface VPC endpoint.

#Bedrock Endpoint Security Group

resource "aws_security_group" "bedrock_endpoint_sg" {
  name_prefix = "bedrock-endpoint-sg-"
  vpc_id      = aws_vpc.chatbot_vpc.id

  ingress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.apprunner_sg.id]
  }

  tags = {
    Name = "Bedrock Endpoint Security Group"
  }
}

Only the App Runner service can talk to the endpoint over HTTPS.

#Bedrock Interface Endpoint

resource "aws_vpc_endpoint" "terraform_bedrock_endpoint" {
  vpc_id            = aws_vpc.chatbot_vpc.id
  service_name      = "com.amazonaws.eu-central-1.bedrock-runtime"
  vpc_endpoint_type = "Interface"

  security_group_ids = [
    aws_security_group.bedrock_endpoint_sg.id
  ]

  private_dns_enabled = true

  subnet_ids = [
    aws_subnet.private_subnet_1.id,
    aws_subnet.private_subnet_2.id
  ]
}

#Why this is powerful

With private DNS enabled, your backend can resolve the normal Bedrock endpoint name and still stay inside private AWS networking.

That is cleaner and more secure than exposing unnecessary public paths.

#Endpoint Policy

resource "aws_vpc_endpoint_policy" "terraform_bedrock_endpoint_policy" {
  vpc_endpoint_id = aws_vpc_endpoint.terraform_bedrock_endpoint.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Principal = "*",
        Action = [
          "bedrock:InvokeModel",
          "bedrock:InvokeModelWithResponseStream"
        ],
        Resource = "*"
      }
    ]
  })
}

#Production note

This is okay for a learning setup. In a real environment, you would tighten both the principals and the allowed resources.

9. What This Architecture Gets Right

This setup is strong because it separates concerns well.

#1. The database is private

That is the correct default.

#2. Admin access is controlled

The bastion host is a deliberate entry point, not accidental exposure.

#3. The app runtime is managed

App Runner gives you deployment convenience without manually managing EC2 for the app layer.

#4. Model access is treated like infrastructure

Bedrock is not just “an API call.” It is part of the network and security design.

#5. Terraform makes the whole thing reproducible

That means:

easier collaboration
easier review
easier environment replication
less console drift

10. What I Would Improve Further

If this were evolving from a demo into a stronger production system, I would improve these areas first.

#Secrets Management

Do not keep long-lived database credentials in plain Terraform variables if you can avoid it. Use Secrets Manager or Parameter Store.

#IAM Tightening

Avoid broad managed policies like full access unless absolutely necessary. Use least privilege.

#Remote State

Do not keep Terraform state only locally for collaborative environments. Use remote state with locking, such as:

S3 for state storage
DynamoDB for state locking

#Observability

Add logs, metrics, and tracing around:

backend latency
Bedrock calls
database performance
deployment events

#Environment Separation

Create separate environments for:

dev
staging
production

And keep variables, state, and naming consistent across them.

#CI/CD

Auto-deploy from ECR is useful, but you usually want a clearer pipeline that controls:

build
test
scan
release
promotion across environments

11. The Real Lesson

The biggest lesson from this architecture is simple:

a RAG app is not a prompt plus a model.

It is a system.

And once you treat it like a system, you start caring about the things that actually make AI applications real:

secure data access
network boundaries
deployment flows
secrets
observability
reproducibility
controlled model connectivity

That is the difference between AI demos and AI engineering.

Final Wrap-Up

This Terraform setup is a solid example of how to move from manual AWS clicks to a repeatable, infrastructure-as-code workflow for a RAG chatbot.

At a high level, it gives you:

a private database layer
a controlled operational access path
managed backend deployment
private connectivity to Bedrock
reproducible AWS infrastructure in code

And that is exactly the kind of architecture worth studying.

Because the model is only one part.

The system is the product.

#Takeaway

If you want to build real AI products, study this stack less like “Terraform syntax” and more like “system boundaries.”

Ask yourself:

What is public?
What must stay private?
Who is allowed to talk to what?
Where do secrets live?
How does the app reach the model?
How do I reproduce all of this safely?

That is how you stop building AI demos.

That is how you start building AI systems.

Deploying a RAG Chatbot on AWS with Terraform

#What We Are Actually Building

#High-Level Architecture

#Repository Structure

#Terraform Basics You Actually Need

#terraform init

#terraform plan

#terraform apply

#Terraform State

#Resource Referencing

#Variables

1. VPC Network

#Create a Custom VPC

#Why it exists

#What matters

#Public Subnet

#Why it exists

#Key detail

#Private Subnets

#Why they exist

#Why two subnets matter

#Internet Gateway

#Why it exists

#Route Tables

#Public Route Table

#Private Route Table

#Route Table Associations

2. Security Groups

#DocumentDB Security Group

#Why this is good

#Port 27017

#Bastion Host Security Group

#Why this matters

3. Provisioning Amazon DocumentDB

#DocumentDB Subnet Group

#Why it exists

#DocumentDB Cluster

#What matters here

#Important production note

#DocumentDB Cluster Instance

4. Bastion Host (EC2)

#Why it exists

#SSH Key Pair

#Important rule

#Dynamic AMI Selection

#Why this is better

5. App Runner, ECR, and IAM

#ECR Image URIs

#IAM Role for App Runner

#Trust Policy

#Why this matters

#Attach Policies

#ECR Read Access

#Bedrock / DocumentDB Notes

6. App Runner Networking: Security Groups and VPC Connector

#App Runner Security Group

#DocumentDB ↔ App Runner Rules

#Why this is important

#VPC Connector

#Why it exists

7. Deploying the Backend with App Runner

#What this really does

#Important idea

8. Private Bedrock Access with a VPC Endpoint

#Bedrock Endpoint Security Group

#Bedrock Interface Endpoint

#Why this is powerful

#Endpoint Policy

#Production note

9. What This Architecture Gets Right

#1. The database is private

#2. Admin access is controlled

#3. The app runtime is managed

#4. Model access is treated like infrastructure

#5. Terraform makes the whole thing reproducible

10. What I Would Improve Further

#Secrets Management

#IAM Tightening

#Remote State

#Observability

#`terraform init`

#`terraform plan`

#`terraform apply`

#Port `27017`