Skip to main content

Monitoring and alerting best practices

Effective monitoring lets you detect and respond to problems before they affect users. Plan monitoring setup as part of your initial deployment, not as optional follow-up work.

Monitor container health and resources

Track CPU and memory usage across all Retool containers. Sustained high CPU on api containers is an early sign that you need to scale horizontally. Memory pressure on workflows-worker or code-executor containers often precedes workflow failures.

Kubernetes deployments expose container metrics through the standard metrics API. Route these to your monitoring platform of choice (Prometheus, Datadog, Amazon CloudWatch, etc.) using your cluster's metrics pipeline. For each core container type, alert when CPU or memory consistently exceeds 80% of the container's allocated limit.

Forward container logs to a centralized system

Retool containers emit structured JSON logs that are useful for debugging errors and tracing request lifecycles. Forwarding these logs to a centralized logging system (such as ELK, Datadog, Splunk, or CloudWatch Logs) makes it possible to search and correlate them across containers.

You can include audit events in container logs by setting the LOG_AUDIT_EVENTS environment variable. This lets your observability tooling ingest user actions alongside application logs.

Refer to the container logs guide for details on accessing logs and using requestId to trace a request across services.

Enable Retool telemetry

Retool provides an observability agent that can forward metrics and health data to your own monitoring destination. Enabling telemetry gives you a structured signal from Retool's internal health checks without requiring you to instrument the application yourself.

Refer to the telemetry guide for setup instructions.

Watch for common warning signs

Beyond resource utilization, set up alerts for the following operational signals:

  • An increased error rate in api container logs indicates failed queries or backend errors reaching users.
  • Temporal task queue depth growing means workflows are queuing faster than workers can process them. Scale workflows-worker or check for stuck workflow runs.
  • Database connection pool exhaustion causes Retool to log errors when it cannot acquire a database connection. Increase your database connection limit or scale down replicas if connection count is the bottleneck.
  • Agent sandbox pods failing to start causes builders to lose their editing sessions. Check node capacity and your maxTotalJobs cap.

Set up health checks

Configure your load balancer and Kubernetes readiness probes to use Retool's health check endpoint. This ensures that traffic routes only to healthy containers and that Kubernetes removes unhealthy pods from rotation automatically.

The infrastructure scaling guide describes the multi-container architecture and load balancer setup.

Review audit logs regularly

Retool stores a record of user actions in audit logs. Review audit logs periodically to identify anomalous behavior such as unexpected resource access, unusual sign-in patterns, or privilege escalations. If you forward audit events to your logging system via LOG_AUDIT_EVENTS, you can alert on specific event types automatically.