#sre
# Performance Tuning & Capacity Planning
## 🧠 Core Concepts
- **Latency**: response time (p50, p95, p99)
- **Throughput**: requests/sec (RPS, TPS)
- **Bottlenecks**: CPU, RAM, Disk I/O, DB, Network
- **Capacity Planning**: forecasting + headroom
- **Horizontal Scaling**: more nodes
- **Vertical Scaling**: bigger nodes
---
## ⚙️ AWS
### ALB
- Layer 7 load balancing
- Auto-scales, but use **LCU Reservation** for traffic spikes
- Key metrics: TargetResponseTime, 5xx errors, ActiveConn
### Athena
- Optimize with partitioning, Parquet/ORC, large files
- Avoid scanning all data
- Limit: ~20 concurrent queries
### Airflow
- Tune Scheduler (dag_concurrency, scan interval)
- Use Celery/K8s executor for scaling
- Optimize DB (PGbouncer, autovacuum)
### Glue
- Minimize shuffles, use parallelism, partition pruning
- Choose right DPU type (G.1X, G.2X)
### S3
- >5500 GET/sec per prefix (scale with prefixes)
- Use multipart upload, parallel reads
- Watch for `503 SlowDown` on spike
- Plan storage: lifecycle rules, cold storage
---
## ☸️ Kubernetes
### Resources
- Set `requests` and `limits` for isolation & stability
- Avoid under/overprovisioning
### HPA
- Auto-scale pods based on CPU/memory/custom metrics
- Use Prometheus Adapter for custom metrics
### Cluster Autoscaler
- Scales node pools based on unschedulable pods
- Ensure ASG limits and cooldowns are tuned
### Scheduling
- Use affinity/anti-affinity, taints/tolerations
- QoS: Guaranteed > Burstable > BestEffort
---
## 🐘 PostgreSQL
### Query Tuning
- Use `EXPLAIN ANALYZE`
- Add indexes, avoid seq scans, monitor slow logs
### Config
- `shared_buffers` ≈ 25% RAM
- `work_mem`, `effective_cache_size`, `max_connections`
- Use connection pooling (PgBouncer)
### Vacuum
- Monitor dead tuples
- Tune autovacuum for large tables
### Scaling
- Read replicas
- Partitioning for huge tables
- Monitor cache hit ratio, locks, bloat
---
## 📊 Monitoring / Observability
### Prometheus
- Pull-based metric collection
- Avoid high cardinality (`user_id`, etc.)
- Watch `tsdb_head_series`, memory usage
### Grafana
- Visualize metrics, build dashboards
- Use alerting on key SLOs (latency, errors)
### ELK Stack
- Shard sizing: ~30GB/shard, avoid too many
- Use ILM to manage index lifecycle
- Avoid indexing high-volume noisy logs
- Use Beats/Logstash → Elasticsearch