Self-Hosted Deployment

Name: ScanRook
Author: ScanRook

ScanRook can be deployed entirely on your own infrastructure. Self-hosting gives you full control over your data, allows operation in air-gapped environments, and helps meet compliance requirements that prohibit sending artifacts or vulnerability data to third-party services. This guide covers the architecture, prerequisites, Kubernetes deployment, configuration, scaling, and offline operation.

Architecture overview

The ScanRook platform consists of five components that communicate via PostgreSQL and S3-compatible object storage.

Web application (Next.js) -- Dashboard, API routes, scan job management, user authentication, and SSE progress streaming.
Worker service (Go) -- Polls PostgreSQL for queued jobs, downloads artifacts from S3, executes the scanner binary, tails NDJSON progress, and uploads reports.
Scanner binary (Rust) -- Core scanning engine. Auto-detects file types, extracts package inventories, and enriches findings from OSV, NVD, and distro feeds. Bundled inside the worker container image.
PostgreSQL -- Job queue, scan events, user data, and optional CVE cache.
S3-compatible storage -- MinIO, AWS S3, or any S3-compatible service. Stores uploaded artifacts and scan report JSON files.

Browser                           Infrastructure
  |                                   |
  |-- presigned POST --> [ S3 (uploads bucket) ]
  |                                   |
  |-- POST /api/jobs --> [ Web (Next.js) ] --> [ PostgreSQL ]
  |                                   |              |
  |                                   |    polls scan_jobs (status=queued)
  |                                   |              |
  |                                   |      [ Worker (Go) ]
  |                                   |        |          |
  |                                   |  downloads from S3  executes scanner
  |                                   |        |          |
  |                                   |  tails NDJSON     [ Scanner (Rust) ]
  |                                   |        |
  |                                   |  inserts scan_events --> pg_notify
  |                                   |        |
  |                                   |  uploads report --> [ S3 (reports bucket) ]
  |                                   |
  |<-- SSE /api/jobs/[id]/events ---- [ Web ] <-- polls scan_events
  |<-- GET /api/jobs/[id]/report ---- [ Web ] <-- fetches from S3

Prerequisites

What you need before deploying ScanRook.

Kubernetes cluster (1.25+) or Docker Compose for single-node deployments
PostgreSQL 15+ -- managed service or self-hosted (e.g. CloudNativePG, Amazon RDS)
S3-compatible object storage -- MinIO (recommended for self-hosted), AWS S3, Google Cloud Storage with S3 compatibility, or DigitalOcean Spaces
4 GB RAM minimum (8 GB recommended for worker nodes running concurrent scans)
Domain name with TLS certificate-- for the web dashboard. Use cert-manager with Let's Encrypt or provide your own certificate.
Container registry access -- to pull ScanRook container images (ghcr.io/devinshawntripp/scanrook-web and ghcr.io/devinshawntripp/scanrook-worker)

Kubernetes deployment

Step-by-step instructions for deploying ScanRook on Kubernetes.

1. Create the namespace

kubectl create namespace scanrook

2. Create secrets

Store database credentials, S3 keys, and auth secrets. Replace the placeholder values with your actual credentials.

apiVersion: v1
kind: Secret
metadata:
  name: scanrook-secrets
  namespace: scanrook
type: Opaque
stringData:
  DATABASE_URL: "postgres://user:pass@db-host:5432/scanrook?sslmode=require"
  S3_ACCESS_KEY: "your-access-key"
  S3_SECRET_KEY: "your-secret-key"
  NEXTAUTH_SECRET: "generate-with-openssl-rand-base64-32"
  NVD_API_KEY: "your-nvd-api-key"  # optional but recommended

3. Create ConfigMap

Non-sensitive configuration shared by the web and worker deployments.

apiVersion: v1
kind: ConfigMap
metadata:
  name: scanrook-config
  namespace: scanrook
data:
  S3_ENDPOINT: "minio.scanrook.svc:9000"
  S3_USE_SSL: "false"
  S3_REGION: "us-east-1"
  UPLOADS_BUCKET: "uploads"
  REPORTS_BUCKET: "reports"
  NEXTAUTH_URL: "https://scanrook.example.com"
  SCANNER_PATH: "/usr/local/bin/scanrook"
  SCRATCH_DIR: "/scratch"
  WORKER_CONCURRENCY: "2"
  WORKER_STALE_JOB_TIMEOUT_SECONDS: "1800"
  HTTP_ADDR: ":8080"

4. Web deployment

Three replicas are recommended for high availability. The web application serves the dashboard and API routes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scanrook-web
  namespace: scanrook
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scanrook-web
  template:
    metadata:
      labels:
        app: scanrook-web
    spec:
      containers:
        - name: web
          image: ghcr.io/devinshawntripp/scanrook-web:latest
          ports:
            - containerPort: 3000
          envFrom:
            - configMapRef:
                name: scanrook-config
            - secretRef:
                name: scanrook-secrets
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: scanrook-web
  namespace: scanrook
spec:
  selector:
    app: scanrook-web
  ports:
    - port: 3000
      targetPort: 3000

5. Worker deployment

Workers execute scans. Scale the replica count based on your expected scan volume. Each worker runs WORKER_CONCURRENCY parallel jobs.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scanrook-worker
  namespace: scanrook
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scanrook-worker
  template:
    metadata:
      labels:
        app: scanrook-worker
    spec:
      containers:
        - name: worker
          image: ghcr.io/devinshawntripp/scanrook-worker:latest
          envFrom:
            - configMapRef:
                name: scanrook-config
            - secretRef:
                name: scanrook-secrets
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
          volumeMounts:
            - name: scratch
              mountPath: /scratch
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 30
      volumes:
        - name: scratch
          emptyDir:
            sizeLimit: 10Gi

6. Ingress

Expose the web application with TLS. This example uses nginx-ingress with cert-manager.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: scanrook-ingress
  namespace: scanrook
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - scanrook.example.com
      secretName: scanrook-tls
  rules:
    - host: scanrook.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: scanrook-web
                port:
                  number: 3000

Environment variable reference

All environment variables used by the web and worker components.

Variable	Purpose	Component	Required
`DATABASE_URL`	PostgreSQL connection string	Web, Worker	Yes
`S3_ENDPOINT`	S3-compatible object storage endpoint	Web, Worker	Yes
`S3_ACCESS_KEY`	S3 access key ID	Web, Worker	Yes
`S3_SECRET_KEY`	S3 secret access key	Web, Worker	Yes
`S3_USE_SSL`	Enable TLS for S3 connections	Web, Worker	No
`S3_REGION`	S3 region (e.g. us-east-1)	Web, Worker	No
`UPLOADS_BUCKET`	Bucket for uploaded artifacts	Web, Worker	Yes
`REPORTS_BUCKET`	Bucket for scan report JSON files	Web, Worker	Yes
`SCANNER_PATH`	Path to the scanrook binary inside the worker container	Worker	No
`SCRATCH_DIR`	Temporary directory for downloaded artifacts during scans	Worker	No
`WORKER_CONCURRENCY`	Number of parallel scan jobs per worker pod	Worker	No
`WORKER_STALE_JOB_TIMEOUT_SECONDS`	Seconds before a running job with no heartbeat is marked failed	Worker	No
`NEXTAUTH_URL`	Canonical URL of the web application (e.g. https://scanrook.example.com)	Web	Yes
`NEXTAUTH_SECRET`	Secret used to encrypt session tokens (generate with openssl rand -base64 32)	Web	Yes
`NVD_API_KEY`	NVD API key for higher rate limits during enrichment	Worker	No
`HTTP_ADDR`	Listen address for the worker health endpoint	Worker	No

Scaling

Tuning ScanRook for high-volume scan workloads.

ScanRook scales horizontally at the worker layer. Each worker pod polls PostgreSQL for queued jobs using SELECT ... FOR UPDATE SKIP LOCKED, so multiple workers can safely process jobs in parallel without conflicts.

Worker concurrency

The WORKER_CONCURRENCY environment variable controls how many scans a single worker pod runs in parallel. The default is 2. For worker pods with 2 GB+ memory, you can safely increase this to 3-4. Total cluster scan throughput is replicas x WORKER_CONCURRENCY.

Horizontal pod autoscaling

For dynamic scaling, use a Kubernetes HPA based on CPU utilization or a custom metric derived from the scan_jobs queue depth.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scanrook-worker-hpa
  namespace: scanrook
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scanrook-worker
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Recommendations for high-volume environments

Run 3+ worker replicas with WORKER_CONCURRENCY=1 each for better fault isolation
Use dedicated worker nodes with node selectors or taints to prevent scan workloads from competing with the web application
Enable PostgreSQL CVE caching via DATABASE_URL on the scanner to avoid redundant API calls across workers
Add Redis as a distributed cache layer for even faster lookups across multiple worker pods
Monitor the scan_jobs table for queue depth (jobs with status = 'queued') to detect backpressure

Air-gapped operation

Running ScanRook without internet access to external vulnerability databases.

ScanRook can operate in fully air-gapped environments by pre-seeding its vulnerability cache before deploying to the isolated network. The scanner checks its local file cache, then PostgreSQL, then Redis before making any external API calls. If the cache contains the needed data, no outbound requests are made.

Pre-seeding the cache

On a machine with internet access, use the scanrook db commands to warm the cache with vulnerability data for your target artifacts.

# On a machine with internet access:

# Warm cache for a specific artifact
scanrook db download --file ./myapp.tar

# Or update all sources broadly
scanrook db update --source all --file ./myapp.tar

# Check cache status
scanrook db check

# Package the cache directory for transfer
tar -czf scanrook-cache.tar.gz ~/.scanrook/cache/

Deploying the cache

Transfer the cache archive to your air-gapped environment and mount it into the worker pods. Set the SCANNER_CACHE environment variable to point to the mounted path.

# Extract the cache on the air-gapped host
tar -xzf scanrook-cache.tar.gz -C /opt/scanrook/

# Mount as a volume in the worker deployment
volumes:
  - name: vuln-cache
    hostPath:
      path: /opt/scanrook/cache
      type: Directory

# Reference in the container spec
volumeMounts:
  - name: vuln-cache
    mountPath: /cache
    readOnly: true

# Set the environment variable
env:
  - name: SCANNER_CACHE
    value: "/cache"

Disabling external enrichment

To prevent the scanner from attempting outbound connections (which would fail and slow down scans), explicitly disable enrichment sources that require internet access.

env:
  - name: SCANNER_NVD_ENRICH
    value: "0"
  - name: SCANNER_OSV_ENRICH
    value: "0"
  - name: SCANNER_SKIP_CACHE
    value: "0"  # ensure cache is used

With enrichment disabled and a pre-warmed cache, scans will use only cached vulnerability data. Periodically refresh the cache on an internet-connected machine and transfer updated archives to the air-gapped environment.