Reliability (Self-hosted)

Reliability of the Retool Self-hosted focuses on the compute, storage, network and security services provided by the public cloud provider. Architecturally, these services should be expected to fail with appropriate fail-over mechanisms identified and configured depending on the organizations recovery objectives.

Best Practices

Leverage multiple availability zones

Cloud Computing providers support running applications and managed services across multiple availability zones. Kubernetes managed infrastructure must be configured in advance to have hosts running in multi-availability zones prior to installing Retool. Support for multi-availability zones is also supported with managed database services. If an AZ failure occurs, Amazon EKS will automatically shift managed nodes and pods to the availability zones that are healthy. If Amazon RDS detects an issue with the primary instance, it will perform a DNS switch to the secondary database instance permitting Retool administration to continue operating.

Deployment of Retool Applications to Multiple Availability Zones will result in increased costs including cross network communication, deployment of multiple pods, etc. The same is true with the use of Multi-AZ managed databases. Please visit your respective pricing calculators and cost management tools provided by your Cloud provider.

Identify Recovery Time Objective (RTO) and Recovery Point Objective (RPO) and consider disaster recovery (DR) strategies

Resource: https://disaster-recovery.workshop.aws/en/services/containers/eks/eks-cluster-multi-region.html
Resource: https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-multi-region
Resource: https://cloud.google.com/architecture/dr-scenarios-building-blocks

Cloud infrastructure requires architecting for failure with the most significant being, a region outage. Understanding an organization's RTO/RPO helps administrators and architects identify the DR strategies. Use the following Disaster Recovery Example to better understand how Retool can support DR.

Employ load balancing

Resource: https://kubernetes.io/docs/concepts/services-networking/ingress/
Resource: https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html

Load balancing provides a means to distribute traffic across managed nodes and pods within the Kubernetes cluster, which improves reliability of the application and is supported via Kubernetes Ingress. Retool Platform leverages an Ingress Controller to proxy traffic from the load balancer to a specified Kubernetes Service.

Use Kubernetes ReplicaSet

Resource: https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/
Resource: https://docs.retool.com/self-hosted/quickstarts/kubernetes/helm
Resource: https://docs.retool.com/self-hosted/quickstarts/kubernetes/manifests

Kubernetes ReplicaSet provides a declarative way of specifying the number of healthy pods that are required to be running at all times. If a pod fails its health check, the Kubernetes control plane will stop the pod and schedule a new pod for execution. ReplicaSet is provided as part of the Retool Manifest and Helm Deployment for Retool Self-hosted. This provides a more resilient architecture than VM deployments alone.

Leverage managed database services

Cloud provider databases provide automated snapshot, patching, replication (via multi-az configuration) improving the Retool Platform resilience/reliability. All Retool production instances should be configured to use these services helping improve your RTO/RPO measures considerably. The Retool database stores apps, workflows, resources and other settings so maintaining database health / hygene is important when attempting to recover from outages.

Best Practices​

Leverage multiple availability zones​

Identify Recovery Time Objective (RTO) and Recovery Point Objective (RPO) and consider disaster recovery (DR) strategies​

Employ load balancing​

Use Kubernetes ReplicaSet​

Leverage managed database services​