#sre #monitoring # Latency: ![[latency.png]] - The time it takes to process a request. - Both successful and unsuccessful requests - Measured as the average time or median or 95th percentile (this is the value that does not exceed 95% of all other values) # Traffic: ![[traffic.png]] - The amount of work performed by the system. - Can be measured in requests per second, page visits per second, or other metrics - For a database, it can be the number of queries per second or the number of records per second TPS. - For the network, it can be the number of bytes per second or packets per second. - It helps to understand how much the system is loaded # Errors: ![[errors.png]] - The number of requests that failed. - These can be HTTP errors, database errors, network errors, or application errors. - It helps to understand what problems occur in the system # Saturation: ![[saturation.png]] - How close the system is to its resource limits - This can be the number of simultaneous requests, CPU usage, memory usage, and disk space usage. - It helps to understand how powerful the system needs to be How to apply SRE Golden Signals: - Measure these metrics in real-time - Set thresholds for each metric - Set alerts for each metric. - Set up monitoring for each metric. - Analyze these metrics to identify problems. - Solve problems that arise