#sre # Performance Tuning & Capacity Planning ## 🧠 Core Concepts - **Latency**: response time (p50, p95, p99) - **Throughput**: requests/sec (RPS, TPS) - **Bottlenecks**: CPU, RAM, Disk I/O, DB, Network - **Capacity Planning**: forecasting + headroom - **Horizontal Scaling**: more nodes - **Vertical Scaling**: bigger nodes --- ## ⚙️ AWS ### ALB - Layer 7 load balancing - Auto-scales, but use **LCU Reservation** for traffic spikes - Key metrics: TargetResponseTime, 5xx errors, ActiveConn ### Athena - Optimize with partitioning, Parquet/ORC, large files - Avoid scanning all data - Limit: ~20 concurrent queries ### Airflow - Tune Scheduler (dag_concurrency, scan interval) - Use Celery/K8s executor for scaling - Optimize DB (PGbouncer, autovacuum) ### Glue - Minimize shuffles, use parallelism, partition pruning - Choose right DPU type (G.1X, G.2X) ### S3 - >5500 GET/sec per prefix (scale with prefixes) - Use multipart upload, parallel reads - Watch for `503 SlowDown` on spike - Plan storage: lifecycle rules, cold storage --- ## ☸️ Kubernetes ### Resources - Set `requests` and `limits` for isolation & stability - Avoid under/overprovisioning ### HPA - Auto-scale pods based on CPU/memory/custom metrics - Use Prometheus Adapter for custom metrics ### Cluster Autoscaler - Scales node pools based on unschedulable pods - Ensure ASG limits and cooldowns are tuned ### Scheduling - Use affinity/anti-affinity, taints/tolerations - QoS: Guaranteed > Burstable > BestEffort --- ## 🐘 PostgreSQL ### Query Tuning - Use `EXPLAIN ANALYZE` - Add indexes, avoid seq scans, monitor slow logs ### Config - `shared_buffers` ≈ 25% RAM - `work_mem`, `effective_cache_size`, `max_connections` - Use connection pooling (PgBouncer) ### Vacuum - Monitor dead tuples - Tune autovacuum for large tables ### Scaling - Read replicas - Partitioning for huge tables - Monitor cache hit ratio, locks, bloat --- ## 📊 Monitoring / Observability ### Prometheus - Pull-based metric collection - Avoid high cardinality (`user_id`, etc.) - Watch `tsdb_head_series`, memory usage ### Grafana - Visualize metrics, build dashboards - Use alerting on key SLOs (latency, errors) ### ELK Stack - Shard sizing: ~30GB/shard, avoid too many - Use ILM to manage index lifecycle - Avoid indexing high-volume noisy logs - Use Beats/Logstash → Elasticsearch