CTF infrastructure deployment: CTFd, kTF and Docker insulation from 50 to 2000 participants

Depov · Jun 16, 2026

DownUnderCTF 2023 served more than 2000 teams on 68 assignments, withstood a peak of 32 100 requests per second and cost $ 876 AUD in two weeks on Google Cloud - including a full-fledged test environment. During the competition, 4579 isolated installs of tasks were created (data from public writeup organizers). In the same period, I raised the infrastructure for the university CTF for 120 people - one VPS for 2000 rubles, CTFd in Docker Compose, Nginx at the front. Both events were stable. The difference is not in the “correctness” of the stack, but in the exact sense, when one server is enough, and when you can not survive without Kubernetes.

There are almost no Russian-language materials on the technical side of the CTF deployment. There are descriptions of network topogys on ESXi and guides for organizing events, but step-by-step instructions for CTF installed and configuring, kCTF configuration with nsjail insulation and monitoring under real load - no. Close the gap.
CTFd or kCTF: what to choose for the deployment of the CTF platform
The first error of novice organizers is to confuse the management platform (scoreboard) and the task environment (hallenge runtime). CTFd solves the first problem: shows tasks, takes flags, ranks. kCTF solves the second: deploys task containers with insulation between participants. They are not competitors - they work in a couple. DownUnderCTF uses both at the same time.

CTFd - Openssource web platform for Jeopardy-CTF. Project live: CTFd/CTFd repository on GitHub, thousands of stars, regular releases. DownUnderCTF 2023 used version 3.6.0 and noted a marked improvement in stability compared to previous releases. There is a commercial SaaS option - Hosted CTFd with automatic task-set deckboard via Docker-registry.

kCTF - Kubernetes-topian from Google to deploy CTF tasks with insulation through nsjail. Documentation describes it as "template deploying for tasks using Kubernetes that use nsjail isolation for between players""The taskup tent is used by Kubernetes, which uses nsjail to isolate between players". There is no sport here - for the frontend you still need a CTFD or analogue. The project is supported by Google, used in Google CTF.
Criteria for choosing and limiting each approach
The key question is: do I need per-connection isolation between participants? If the task involves RCE (pwn, web with shell), the participant can damage the environment to others - to drop the flag, kill the process, try to get out of the container (Escape to Host, T1611Privilege Escalation by MITRE AT&CK).

CTFd + Docker on VPS does not delay when more than 500 participants and the tasks will be eaten by CPU/RAM, when you need automatic rotation of fallen instances, or when there are tasks with a full-fledged RCE in the PWN category.

kCTF + GKE - too much when an event of 50-100 people at the university (overhead from Kubernetes is not justified), when there is no experience with kubectl and Terraform (the time for training will exceed the time to conduct the CTF), or when the tasks are static - OSINT, crypto without a server part.

According to the kCTF documentation, when using GKE, you need to control the GCP project quotas: each task with Load Balancer consumes an external IP, and with 30+ tasks, default quotas can be exhausted. Deploy kCTF in advance - to make sure the adequacy of quotas and increase the reputation of the account for automatic approval of requests.
CTFd Docker Compose: installation and adjustment
Adjustments to the environment
• OS: Ubuntu 22.04+ or Debian 12+ (recommended), CentOS 8+, any GNU/Linux with Docker Engine
• RAM: minimum 2 GB for CTFd + MySQL + Redis (up to 100 participants), recommended 4-8 GB for 200-500 participants
• CPU: 2 vCPU minimum, 4 vCPUs is recommended at >200 participants
• Disc: 20 GB SSD minimum (tasks-files + MySQL base)
• Network: public IP, open ports 80/443, domain with DNS-record
• Dependsence: Docker Engine 24+, Docker Compose v2, certbot for TLS
CTFd is distributed with ready docker-compose.yml. Basic configuration - three services: application, MySQL and Redis for caching. Clone: git clone https://github.com/CTFd/CTFd.git && cd CTFd.

Before launch - SECRET_KEY. This is the cryptographic key of the sessions. If the variable is not specified, CTFd will generate the key automatically and keep the situation in volume. .data) For an obvious task - pass SECRET_KEY through the section environment in docker-compose.yml. You can generate this way: python3 -c "import os; print(os.urandom(32).hex())". DATABASE_URL and REDIS_URL Also asked through environment (In the default compose is already set up.)

YAML:
# docker-compose.yml
services:
ctfd:
image: ctfd/ctfd:latest
ports:
- "8000:8000"
environment:
- SECRET_KEY=${SECRET_KEY}
- DATABASE_URL=mysql+pymysql://ctfd:ctfd@db/ctfd
- REDIS_URL=redis://cache:6379
depends_on:
- db
- cache
restart: always

Launch: docker compose up -d. The CTFD is available on the port of 8000. The first visit opens the master settings - the name of the competition, the administrator's account, mode (team/user).

DownUnderCTF on the scale of 2000+ commands abandoned the container database in favor of CloudSQL MySQL and Cloud Redis - a reasonable step when loaded. For events up to 500 participants container MySQL and Redis in the same compose file work without problems. I led the CTF for 300 people with such a configuration - not a single incident with the base.
Reverse proxy, TLS and CTFd plugins setup
Exposing a CTFd without TLS is a direct path to intercepting sessions. Nginx or Traefik front cover three tasks: TLS-termination, rate limiting, static caching.

Configuration Nginx: proxy_pass http://127.0.0.1:8000 with proxying of headings X-Forwarded-For, X-Forwarded-Proto, Host. TLS through Let's Encrypt: certbot --nginx -d ctf.example.com. In CTFd turn on REVERSE_PROXY=true through the variable environment or in config.py (meaning is processed through Werkzeug ProxyFix). Without this, the platform will not see real IP participants, and the rate limiting on IP will turn into a fiction.

Rate limiting is not an option, but a necessity. According to DownUnderCTF 2023, in the middle of the competition there was a surge of load - the actual DoS on the part of the participants. The cluster was auto-scaling and withstood, but after the rate of the limits on Cloudflare, the load returned to normal. Configuration in Nginx: limit_req_zone $binary_remote_addr zone=ctfd:10m rate=10r/s; in the block http and limit_req zone=ctfd burst=20 nodelay; in location /. CTFd as a public application (T1190, Initial Access) must be protected from automated overruns and flood-attacks (T1499.002, Service Exhaustion Flood)

Access to the CTFD admin panel (NIST CSF PR.AA-01 - management of authorized user data) - limit /admin by IP via Nginx allow/deny or through a VPN. Do not be lazy - on the real CTF I saw the participants trying to rub an admin-panel.

From CTFD plugins: dynamic score (decay scoring) is built into CTFd 3.x, first blood bonus and custom waist cushioning files are installed by copying to the directory CTFd/plugins/ and restarting the container through docker compose restart ctfd.
kCTF Kubernetes CTF platform and nsjail insulation
Architecture kCTF: three layers of insulation
kCTF is built on three levels, and the understanding of each is critical for the correct deployment:

Level 1 - nsjail. Sandbox, which operates in LISTEN mode and fortifying a separate environment for each TCP connection. According to kCTF documentation, nsjail uses Linux User Namespaces and requires explicitly listing the files available inside sandbox. Each compound is insulated - the participant with RCE can not affect the other. Bonus - support for Proof of Work (PoW): a computational task before the connection, which can be included on the fly when abuse is detected (Compute Hijacking, T1496.001 ATT&CK – for example, when someone decided to pause on your infrastructure).

Code:
# nsjail.cfg
mode: LISTEN
port: 1337
time_limit: 30
rlimit_as_type: SOFT
rlimit_cpu_type: SOFT
rlimit_nproc: 1
mount { src: "/chroot" dst: "/" is_bind: true }
mount { src: "/dev/null" dst: "/dev/null" is_bind: true }

Level 2 - Docker container. nsjail works inside Docker. Each Pod kCTF has two containers: challenge (the task itself) and healthcheck. Healthcheck is a separate container that drives out the exploit script and verifies that the task is solvable. PWN tasks are often used pwntools, although kCTF is not tied to a specific tool - any solve script is enough. If healthcheck falls, Kubernetes automatically restarts Pod. Participants will not get traffic to a broken task.

Level 3 - Kubernetes Deployment. Template for creating Pod's with a given number of replicas and limits on CPU/RAM. Without Load Balancer, tasks are not available from the outside - and this is an advantage: you can table and test the task before publication.
Deployment and management of the cluster
Creation of the cluster: kctf cluster create --domain-name ctf.example.com. According to the kCTF documentation, when using the subdomain, you need NS records indicating Google DNS so that the tasks are available at the appearance addresses *.subdomain.example.com.

For production, transfer at least part of the nodes to the usual: kctf cluster resize --min-nodes 1 --max-nodes 3 --num-nodes 1 --machine-type n2-standard-4 (without the spot flag). Spot nodes (replaceable obsolete Preemptible VM in GKE) are good for the peak reserve - they are 60-91% cheaper than usual, but can be displaced at any time. Imagine: the CTF final, the last hour, and the GCP takes your nodes. Don't do that.

Decording to the task: kctf chal create mychal && kctf chal start. Verification: kctf chal status - shows the state of Deployment and the result of healthcheck. If healthcheck falls, the participant sees the loss of the connection - the first thing that checks when complaining about the "unstable task".

To comply with the Kubernetes STIG (DISA) context: limit RBAC to a minimum of necessary rights, do not open API server for participants, apply Network Policy to segment traffic between the Pods of tasks. Tasks should not see each other over the network - otherwise you get a chain of compromise through the lateral movement between the patches.
Docker insulation of CTF tasks: shared and instanced models
Shared dlide - all participants are connected to one instance. Suitable for tasks without a state: crypto-server, web-stay without storage, tasks for analyzing protocols. One docker run, one port, a minimum of resources.

Instanted dlide Each participant receives a separate instance. Necessary for tasks with RCE, file write, configuration modification - everything that allows the participant to remove the flag, kill the process or break the task for the rest.

DownUnderCTF created 4579 isolated instances – most for beginner/a.as. This is an important detail: it is novice participants who often break common instances unintentionally. The man received a shell and began rm -rf / "for an experiment" is familiar?

For instanced-tasks without kCTF Klodd - integration for RCTF, deploying tasks on Kubernetes at the request of the participant. Requires a GKE-cluster. The Alternative - tiny-instancer for smaller proportions. When working on one host without a cluster - each task is launched through docker run --memory=256m --cpus=0.5 --pids-limit=50 --read-only with tmpfs for written directories.
Protection against container escape in CTF infrastructure
Container escape (Escape to Host, T1611) - a real threat, not theoretical. At the CTF infrastructure, participants are purposefully looking for vulnerabilities, and sandbox is a legitimate goal. Related Techniques by MITRE ATT&CK: Container Administration CommandT1609) - when accessing Docker socket, Build Image on Host (T1612) - with the possibility of assembling images, Container and Resource Discovery (T1613) - reconnaissance inside the container.

Protection checklist (cheated in practice):
1. Docker socket (/var/run/docker.sock) - never install in the task container. It's full control of the host. Mounting docker.sock is the most common mistake I see from novice organizers
2. --security-opt=no-new-privileges- prohibition of escalation privileges inside the container
3. Metadata endpoint - block 169.254.169.254through iptables at the host level. RCTF documentation directly warns: access to metadata on AWS/GCP gives cloud account accounts
4. A separate host is CTFd and tasks are not on one VPS. "One unbuts to reusing the same VPS as the CTF platform" "Avoid to use the same VPS in the same CTF platform."
5. Resource limits - --memory, --cpus, --pids-limitprevent fork bomb and exhaustion of the host resources (Service Exhaustion Flood, T1499.002)
6. Read-only filesystem - --read-onlywith tmpfs for /tmpand other recorded ways
7. nsjail inside Docker - even without kTF, wrapping through nsjail inside the container significantly reduces the attack surface
Key metrics and alerta
Platform (CTFd): HTTP QPS and latency (p50/p95/p99) - peaks at the start, when posting new tasks and before the final. DownUnderCTF recorded characteristic peaks in three moments: the start of the CTF and the first wave of tasks, the posting of the second wave, the final. HTTP 5xx error rate - CTFd has historically suffered from leakage of connections to the database, which was manifested as mass 500-errives. In version 3.6.0, it got better. Alert at 5xx rate >1% is the minimum threshold for response.

Tasks: CPU and memory per container (cAdvisor built into kubelet, for Docker - metrics through /metrics endpoint), the amount of active containers (a sharp increase can mean abuse), network I/O per container (annoy active traffic - potential breakout or mining).

Infrastructure: disk I/O and use (participants can fill the disk through write-primitives), node CPU/RAM for Kubernetes - the basis for auto-scaling.

Prometheus is connected as a service in a compose file or runs separately with a config to targets. Grafana uses Prometheus as a datasource community-dashboard for Docker and Kubernetes available in the Grafana registry.

Levels of protection against abuse: Nginx limit_req (10-20 r/s per IP - reasonable threshold), Cloudflare or similar CDN for L7 DDoS protection (for CTF with public access), Kubernetes Network Policy for segmentation of Pod's tasks, in the built rate limiting CTFd on attempts to score a flag submite (adjusted to admin panel).
CTF Infrastructure Deployment Checklist
Before the start of the competition - to go on the list. The format is suitable for transfer to the infrastructure team:
1. SECRET_KEYCTFD generated cryptographically, non-default value
2. TLS configured through Let's Encrypt, HTTP-to-HTTPS Redirect is active
3. Rate limiting is on reverse proxy - 10-20 r/s per IP
4. Docker softt not mounted in any task container
5. Metadata endpoint (16.9.254.169.254) blocked at the host level
6. Each container of the task has limits: --memory, --cpus, --pids-limit
7. MySQL base beccup is tuned - at least every 30 minutes during the event
8. Prometheus collects CTFD metrics and task containers
9. Alert on 5xx error rate >1% tuned to Alertmanager or Grafana
10. The loading test was conducted with the expected number of simultaneous users
11. For instanced-assignments: healthcheck confirms the solvency of the task
12. DNS TTL is reduced to 300 seconds - the ability to quickly migrate
13. CTFd and tasks on different hosts or in different namespace in k8s
14. Admin panel CTFD restricted by IP or via VPN

CTF infrastructure deployment: CTFd, kTF and Docker insulation from 50 to 2000 participants

Depov

Activist

Similar threads