Reliability (Self-hosted)
Reliability of the Retool Self-hosted focuses on the compute, storage, network and security services provided by the public cloud provider. Architecturally, these services should be expected to fail with appropriate fail-over mechanisms identified and configured depending on the organizations recovery objectives.
Best Practices
Leverage multiple availability zones
Cloud Computing providers support running applications and managed services across multiple availability zones. Kubernetes managed infrastructure must be configured in advance to have hosts running in multi-availability zones prior to installing Retool. Support for multi-availability zones is also supported with managed database services. If an AZ failure occurs, Amazon EKS will automatically shift managed nodes and pods to the availability zones that are healthy. If Amazon RDS detects an issue with the primary instance, it will perform a DNS switch to the secondary database instance permitting Retool administration to continue operating.
Deployment of Retool Applications to Multiple Availability Zones will result in increased costs including cross network communication, deployment of multiple pods, etc. The same is true with the use of Multi-AZ managed databases. Please visit your respective pricing calculators and cost management tools provided by your Cloud provider.
Identify Recovery Time Objective (RTO) and Recovery Point Objective (RPO) and consider disaster recovery (DR) strategies
Resource: https://disaster-recovery.workshop.aws/en/services/containers/eks/eks-cluster-multi-region.html
Resource: https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-multi-region
Resource: https://cloud.google.com/architecture/dr-scenarios-building-blocks
Cloud infrastructure requires architecting for failure with the most significant being, a region outage. Understanding an organization's RTO/RPO helps administrators and architects identify the DR strategies. Use the following Disaster Recovery Example to better understand how Retool can support DR.
Employ load balancing
Resource: https://kubernetes.io/docs/concepts/services-networking/ingress/
Resource: https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html
Load balancing provides a means to distribute traffic across managed nodes and pods within the Kubernetes cluster, which improves reliability of the application and is supported via Kubernetes Ingress. Retool Platform leverages an Ingress Controller to proxy traffic from the load balancer to a specified Kubernetes Service.
Use Kubernetes ReplicaSet
Resource: https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/
Resource: https://docs.retool.com/self-hosted/quickstarts/kubernetes/helm
Resource: https://docs.retool.com/self-hosted/quickstarts/kubernetes/manifests
Kubernetes ReplicaSet provides a declarative way of specifying the number of healthy pods that are required to be running at all times. If a pod fails its health check, the Kubernetes control plane will stop the pod and schedule a new pod for execution. ReplicaSet is provided as part of the Retool Manifest and Helm Deployment for Retool Self-hosted. This provides a more resilient architecture than VM deployments alone.
Leverage managed database services
Cloud provider databases provide automated snapshot, patching, replication (via multi-az configuration) improving the Retool Platform resilience/reliability. All Retool production instances should be configured to use these services helping improve your RTO/RPO measures considerably. The Retool database stores apps, workflows, resources and other settings so maintaining database health / hygene is important when attempting to recover from outages.