<< back to Guides

SRE Overview

What is Site Reliability Engineering (SRE)?

Site Reliability Engineering (SRE) is an engineering discipline that applies software development principles to operations tasks. It aims to build scalable, reliable systems by automating manual operations, managing risks, and optimizing the balance between innovation and system stability.

Key areas of SRE include:

Who Does SRE?

SRE roles are typically filled by engineers who have a mix of skills in software development, systems administration, and operations. This includes:

Expectations in Modern Software Engineering Roles

Today, many software engineers are expected to contribute to reliability alongside their primary development tasks. This integration of responsibilities is sometimes referred to as "You build it, you run it."

Typical expectations include:

  1. Participation in On-Call Rotations: Engineers may join on-call schedules to support the systems they build.
  2. Incident Management: Responding to and learning from outages to improve system reliability.
  3. Monitoring & Alerting: Implementing metrics and dashboards for observability.
  4. Infrastructure Knowledge: Using tools like Kubernetes, Terraform, or CI/CD pipelines to deploy and manage applications.
  5. Collaboration with SRE Teams: Partnering with dedicated SREs to build robust, scalable systems.

Balancing Product Development with SRE Responsibilities

By blending product development and SRE principles, organizations aim to achieve faster development cycles without compromising system reliability.

Comparison of Approaches to Manage Operational Concerns in Software Engineering

Managing operational concerns in software engineering varies significantly depending on the team's size, maturity, and culture. Below is a comparison of the most common approaches:

Approach Description Advantages Disadvantages Best For
Traditional Operations Team A separate operations team handles all deployment, monitoring, and incident management tasks. - Clear division of responsibilities
- Operations expertise
- Siloed communication
- Slower deployments
- Less developer accountability
Large organizations with legacy systems
DevOps Developers and operators collaborate closely, sharing responsibilities for deployment and reliability. - Improved collaboration
- Faster feedback loops
- Continuous delivery focus
- Requires cultural change
- Can lack clarity in responsibilities
Organizations transitioning to modern workflows
SRE (Dedicated Team) A specialized team applies SRE principles to manage system reliability and automate operations. - Strong reliability focus
- Expertise in automation
- Reduces toil
- Can create a new silo
- Requires investment in SRE roles and tools
Companies with complex, high-scale systems
Integrated SRE/Dev Teams Product teams adopt SRE practices and handle their system's operational concerns. - Increased accountability
- Faster issue resolution
- Seamless feedback
- Higher cognitive load on developers
- Requires training and mindset shift
Small-to-medium teams with modern infrastructure
"You Build It, You Run It" Developers own end-to-end responsibility for their services, including operations and on-call duties. - Maximum ownership
- Clear accountability
- Faster iteration
- Burnout risk
- Requires robust tooling and culture
- Not scalable for complex systems
Startups or small teams

Key Factors to Consider

  1. Team Size and Structure: Smaller teams benefit from integrated approaches (e.g., DevOps, "You Build It, You Run It"), while larger organizations may require dedicated operations or SRE teams.
  2. System Complexity: Highly complex systems often require dedicated SRE teams to handle specialized concerns.
  3. Cultural Readiness: Success in DevOps or SRE models depends on cultural alignment and willingness to adapt.
  4. Operational Maturity: Teams with strong processes and automation can handle more integrated approaches.
  5. Business Needs: Fast-paced environments prioritize speed, while mission-critical systems prioritize reliability.

Recommendations by Context

Context Recommended Approach
Startup with small, nimble teams "You Build It, You Run It" or DevOps
Mid-sized company scaling up DevOps with elements of SRE
Enterprise with legacy systems Traditional Ops transitioning to DevOps or SRE
Large-scale, high-availability system Dedicated SRE team with close collaboration to Dev teams

Selecting the right model involves balancing operational needs with team capabilities, ensuring long-term sustainability and business alignment.

Key Concepts in SRE

Product: Change Velocity (features) and Service's SLO (stability)

  1. Structural Conflict:

    • Product development teams and SRE teams often have conflicting goals: development wants rapid innovation, while SRE seeks system stability.
    • The key conflict lies between pace of innovation and product stability.
  2. Error Budget:

    • The introduction of an error budget resolves this conflict.
    • 100% reliability is not practical. In fact, aiming for 100% availability is counterproductive as users can’t distinguish between 100% and 99.999% uptime.
    • Other systems (like ISPs, laptops, Wi-Fi) contribute to availability, so striving for perfection in one service offers minimal benefit.
  3. Defining the Right Reliability Target:

    • The reliability target is a business or product decision, not a technical one. Factors to consider:

      • How much downtime will users tolerate?
      • What alternatives do users have if the service is down?
      • How does availability affect user behavior and product usage?
    • The availability target defines the error budget:

      • E.g., 99.99% availability leaves a 0.01% error budget (downtime allowed).
  4. Spending the Error Budget:

    • The error budget can be spent in any way that doesn’t exceed the allocated downtime.

    • The development team aims to innovate and roll out features quickly, which can use up the error budget, but the trade-off is faster growth.

    • Phased rollouts and 1% experiments are examples of methods that can help manage risk while using the error budget for quick launches.

  5. Changing the Incentives:

    • The goal for SREs shifts from "zero outages" to managing outages in a way that maximizes feature velocity.
    • Outages are no longer seen as "bad", but as part of the process of innovation. Both the SRE and product development teams collaborate to manage these incidents.

Product development teams are cross-functional groups within an organization responsible for designing, building, testing, and delivering a product. These teams typically consist of members with diverse skills, such as engineers, designers, product managers, and marketers, working collaboratively to create and improve a product throughout its lifecycle. Their goal is to meet customer needs, ensure product quality, and align with business objectives.

SRE (Site Reliability Engineering) teams are responsible for ensuring the reliability, scalability, and performance of software systems in production environments. They bridge the gap between development and operations by applying engineering practices to infrastructure and operations, focusing on automation, monitoring, incident response, and proactive reliability improvements. SRE teams aim to maintain high availability and efficient system performance while balancing the speed of development with operational stability.

Monitoring in SRE

Monitoring is one of the primary means by which service owners keep track of a system’s health and availability. A thoughtful monitoring strategy is crucial for identifying issues in real-time.

Common Monitoring Approach

Valid Monitoring Output

  1. Alerts:

    • Alerts signify that a human needs to take immediate action in response to an issue either happening or imminent, to improve the situation.
  2. Tickets:

    • Tickets signify that action is needed, but not immediately. The system cannot resolve the situation automatically, but a human intervention within a few days is acceptable without causing damage.
  3. Logging:

    • Logs are for diagnostic or forensic purposes. They are not actively monitored, and no one is expected to look at them unless something else prompts them to do so.

Emergency Response in SRE

  1. Reliability Metrics:

    • Reliability is determined by Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR).
    • The most important metric for evaluating the effectiveness of emergency response is MTTR, which measures how quickly the response team can restore the system to health.
  2. Impact of Humans:

    • Humans add latency to emergency response. A system that requires fewer human interventions will generally have higher availability.
    • Systems designed to avoid emergencies that require human action will typically outperform those that need hands-on intervention.
  3. Playbooks for Effective Response:

    • Playbooks significantly improve MTTR. A pre-recorded set of best practices can speed up recovery by about 3x compared to a "winging it" strategy.
    • While smart engineers who can think on the fly are essential, clear troubleshooting steps and tips are valuable during high-stakes, time-sensitive situations.
  4. SRE Practices:

    • Company SRE relies on on-call playbooks and exercises like the “Wheel of Misfortune” to prepare engineers for effective emergency response.

Change Management in SRE

  1. Cause of Outages:

    • Roughly 70% of outages are caused by changes in a live system.
  2. Best Practices for Change Management:

    • SRE best practices in managing changes involve automation to achieve the following:
      • Progressive Rollouts: Gradually introducing changes to minimize impact on users.
      • Quick and Accurate Problem Detection: Identifying issues early to prevent large-scale impact.
      • Safe Rollbacks: Rolling back changes promptly and safely when problems arise.
  3. Benefits of Automation:

    • These practices help minimize the number of users and operations exposed to problematic changes.
    • By removing humans from the loop, the system avoids common human errors such as:
      • Fatigue
      • Familiarity/Contempt
      • Inattention to Repetitive Tasks
  4. Resulting Improvements:

    • The outcome is a combination of increased release velocity and improved safety in deployments.

Demand Forecasting and Capacity Planning

  1. Goal of Demand Forecasting and Capacity Planning:

    • Ensure sufficient capacity and redundancy to serve projected future demand with the required availability.
  2. Challenges:

    • A surprising number of services and teams fail to ensure that the required capacity is in place when needed.
    • The process must account for both organic growth (from natural product adoption) and inorganic growth (due to events like feature launches or marketing campaigns).
  3. Mandatory Steps in Capacity Planning:

    • Accurate Organic Demand Forecast: Predicting future demand well enough to acquire necessary capacity in time.
    • Incorporation of Inorganic Demand Sources: Including business-driven changes (e.g., launches, marketing) in the demand forecast.
    • Regular Load Testing: Correlating raw capacity (servers, disks, etc.) with actual service capacity through load testing.
  4. SRE's Role in Capacity Planning:

    • Since capacity is critical to service availability, the SRE team must be responsible for both capacity planning and provisioning.

Provisioning

  1. Provisioning Overview:

    • Provisioning combines both change management and capacity planning.
    • It must be done quickly and only when necessary, as capacity is expensive.
  2. Risks in Provisioning:

    • Provisioning involves activities like spinning up new instances, modifying existing systems (e.g., configuration files, load balancers, networking), and validating the performance of the new capacity.
    • These operations are riskier than frequent tasks like load shifting, which occurs multiple times per hour.
  3. Best Practices:

    • Treat provisioning with extra caution, given its complexity and the importance of ensuring new capacity works as needed.

Efficiency and Performance

  1. Importance of Resource Efficiency:

    • Efficient use of resources is critical when a service cares about its costs.
    • SRE is responsible for provisioning, and therefore also for managing resource utilization.
    • A well-managed provisioning strategy directly impacts the service’s total costs.
  2. Factors Affecting Efficiency:

    • Resource use is determined by:
      • Demand (load) on the service.
      • Capacity of the system.
      • Software efficiency.
  3. SRE's Role:

    • SREs predict demand, provision capacity, and improve software to ensure efficient service operation.
    • As load increases, the system can slow down, reducing capacity and efficiency.
    • When a system slows too much, it risks becoming unable to serve requests, equating to infinite slowness.
  4. Provisioning and Performance:

    • SREs target specific response speeds when provisioning to meet capacity goals.
    • Both SREs and product developers actively monitor and modify the service to improve performance, increase capacity, and enhance efficiency.

Managing Risk

  1. Importance of Reducing Failure:

    • Unreliable systems can erode users' confidence, so reducing the chance of failure is a priority.
    • Cost of reliability does not increase linearly—improving reliability incrementally can be much more expensive than the previous increment.
  2. Dimensions of Cost:

    • Redundant Resources: Costs associated with additional machine/compute resources and redundant equipment for maintenance, data durability, etc.
    • Opportunity Cost: The cost when engineering resources are used to build systems that reduce risk, instead of focusing on user-facing features or new products.
  3. SRE’s Approach to Risk:

    • SRE manages service reliability by managing risk.
    • Risk is treated as a continuum, where SREs aim to balance reliability improvements with the appropriate level of risk tolerance for each service.
  4. Risk Alignment with Business Needs:

    • SRE aligns the level of reliability with the business’s willingness to bear risk.
    • Services should be reliable enough to meet business goals but not overly reliable, as excessive reliability could waste opportunities for new features or reduce operational efficiency.
    • Availability targets (e.g., 99.99%) are viewed as both a minimum and a maximum, ensuring reliability is optimized without excessive investment.
  5. Goal:

    • Explicitly balancing reliability and cost, allowing for thoughtful risk-taking decisions that align with business priorities.

Measuring Service Risk

  1. Objective Metrics for Optimization:

    • At Company, we aim to identify an objective metric to represent the properties we want to optimize in a system.
    • By setting a target, we can assess current performance and track improvements or degradations over time.
  2. Challenges in Measuring Service Risk:

    • Service failures can lead to multiple impacts: user dissatisfaction, revenue loss, brand damage, and reputational harm, which are hard to measure.
    • To make the problem manageable, the focus is placed on unplanned downtime as a key measure of risk.
  3. Unplanned Downtime as a Risk Metric:

    • For most services, unplanned downtime is the most straightforward way to represent risk tolerance.
    • Downtime is captured through the desired service availability, often expressed in "nines" (e.g., 99.9%, 99.99%, 99.999%).
    • The formula for time-based availability is:
      availability = uptime / (uptime + downtime)
      
    • For example, a system with 99.99% availability can be down for up to 52.56 minutes in a year and still meet the target.
  4. Global Service Considerations:

    • At Company, time-based availability metrics are less meaningful due to globally distributed services and fault isolation.
    • Instead of uptime, availability is defined by the request success rate, calculated over a rolling window of time:
      availability = successful requests / total requests
      
  5. Request Success Rate:

    • For a system serving 2.5 million requests in a day with a target of 99.99% availability, it can afford up to 250 errors in that day.
    • Not all requests are equal—failing a new user sign-up is different from failing a background email polling request.
    • Availability, based on request success rate, is a reasonable approximation of unplanned downtime from the end-user perspective.
  6. Applicability to Non-serving Systems:

    • The request success rate metric can also apply to non-serving systems (e.g., batch, pipeline, storage, transactional systems).
    • In batch systems, availability can be calculated based on the success rate of records processed.
  7. Tracking Performance Against Availability Targets:

    • Quarterly availability targets are set for services and performance is tracked weekly or daily.
    • This allows the team to manage the service to a high-level availability objective, addressing any deviations as they arise.

Risk Tolerance of Services

  1. Defining Risk Tolerance:

    • The risk tolerance of a service refers to the acceptable level of risk or failure the service can endure before it impacts business goals.
    • In formal environments or safety-critical systems, risk tolerance is often defined directly in the product or service's definition.
    • At Company, services’ risk tolerance is not always clearly defined, so SREs work with product owners to translate business goals into explicit engineering objectives.
  2. Consumer vs Infrastructure Services:

    • Consumer Services (e.g., Company Maps, Docs) often have product teams responsible for defining risk tolerance and availability requirements.
    • Infrastructure Services (e.g., storage systems) typically lack dedicated product teams and must derive risk tolerance through collaboration with engineers.
  3. Identifying Risk Tolerance of Consumer Services:

    • Factors to consider:
      • Target availability level.
      • Impact of different types of failures.
      • Service cost and its relationship to risk tolerance.
      • Other important service metrics (e.g., latency, performance).
  4. Target Level of Availability:

    • The availability target depends on the service's function and market positioning:
      • What do users expect?
      • Does the service tie directly to revenue?
      • Is the service paid or free?
      • What do competitors offer?
      • Is it targeted at consumers or enterprises?
    • Company Apps for Work: Typically has high availability targets (e.g., 99.9%) due to its critical role in enterprises.
    • YouTube: Initially had a lower availability target due to its rapid growth phase and consumer-focused nature.
  5. Types of Failures:

    • The shape of failures is crucial:
      • Constant low-rate failures versus occasional full-site outages.
      • Partial failures (e.g., rendering issues) may be less impactful than full outages or data exposure failures (e.g., private data leaks).
      • Maintenance windows may be acceptable for some services, such as the Ads Frontend.
  6. Cost Considerations:

    • Cost is a key factor in determining availability targets, especially when failure translates directly into revenue loss (e.g., in Ads).
    • Cost/benefit analysis helps determine if increasing availability is worth the investment:
      • Example: Increasing availability from 99.9% to 99.99% might increase revenue by $900 (based on $1M in service revenue).
    • Services with no direct revenue-to-availability correlation may consider background error rates for Internet Service Providers (ISPs), which are typically between 0.01% and 1%.
  7. Other Important Service Metrics:

    • Latency is another important metric:
      • AdWords has a strict latency requirement because it must not slow down the search experience.
      • AdSense has a more relaxed latency requirement because it serves ads asynchronously, allowing for greater flexibility in provisioning and reducing operational costs.
    • Different services have different latency goals, which influence how they are engineered and provisioned.

Summary

Risk tolerance is a complex and multifaceted concept that depends on service type, business goals, and cost considerations. For consumer services, it's often easier to define, with clear product ownership and expectations. For infrastructure services, the lack of direct product teams requires more collaboration to assess and define appropriate risk tolerance.

Identifying the Risk Tolerance of Infrastructure Services

Identifying the risk tolerance of infrastructure services involves understanding how different users interact with the service and determining acceptable levels of performance, availability, and reliability based on those needs. The process is more complex than for consumer services because infrastructure components often serve multiple clients with varying requirements.

Key Considerations for Identifying Risk Tolerance in Infrastructure Services

1. Target Level of Availability

2. Types of Failures

3. Cost Considerations

4. Explicitly Defined Service Levels

Example: Frontend Infrastructure

  • Company’s frontend infrastructure, responsible for handling user requests, must deliver a high level of reliability since failures here directly affect user experience. Ensuring availability and quick response times is crucial, as lost requests cannot be retried once they fail.
  • These systems are designed with high reliability in mind to meet the strict needs of consumer-facing services. However, even in such critical systems, the level of reliability might still vary based on factors like traffic load, redundancy, and infrastructure configuration.

Conclusion

In summary, identifying the risk tolerance of infrastructure services is about balancing the varying needs of different users (low latency vs. throughput, high reliability vs. cost-effectiveness) and offering flexible service levels that allow clients to make informed decisions about reliability and cost trade-offs. This approach helps manage infrastructure resources efficiently while ensuring that different types of users can rely on the service according to their specific requirements.

Forming Your Error Budget

To make informed decisions about the amount of risk a service can tolerate, teams use an error budget, which is based on the Service Level Objective (SLO). The error budget provides an objective metric that specifies how much unreliability is acceptable within a given time period (usually a quarter). This approach removes subjective negotiations between teams and helps manage service reliability.

Key Practices for Forming an Error Budget:

  1. Defining the SLO:

    • Product Management defines an SLO, which sets expectations for how much uptime a service should have per quarter.
  2. Measuring Uptime:

    • The actual uptime of the service is measured by a neutral third party, typically a monitoring system.
  3. Calculating the Error Budget:

    • The difference between the expected uptime (SLO) and the actual uptime is the error budget, which quantifies the allowed unreliability for the quarter.
  4. Using the Error Budget:

    • As long as the actual uptime is above the SLO, meaning there is still error budget remaining, new releases can be pushed.

Example:

  • If a service has an SLO to successfully serve 99.999% of all queries in a quarter, the error budget allows a failure rate of 0.001%.
  • If a problem causes a failure of 0.0002% of the queries, it consumes 20% of the service’s quarterly error budget.

This structured approach helps teams assess the service's reliability objectively and manage risk appropriately within defined limits.

Benefits of Error Budgets

The main benefit of an error budget is that it provides a common incentive that aligns both product development and SRE (Site Reliability Engineering) teams to focus on finding the right balance between innovation and reliability.

Key Benefits and Practices:

  1. Managing Release Velocity:

    • Many products use an error budget control loop to manage release velocity. As long as the system’s SLOs are met, releases can continue.
    • If SLO violations occur often enough to use up the error budget, releases are temporarily halted to invest additional resources in system testing and development to improve resilience, performance, etc.
  2. More Subtle Approaches:

    • More effective techniques than the simple on/off control can be used, such as slowing down releases or rolling them back when the error budget is close to being used up.
  3. Self-Policing for Product Development:

    • If product development wants to reduce testing or increase push velocity, but SRE is resistant, the error budget serves as a guiding metric for decision-making.
    • When the error budget is large, product developers can take more risks. When the budget is nearly drained, they will push for more testing or slower release velocity to avoid stalling their launch.
    • This leads to product development teams becoming self-policing, as they manage their own risk based on the available error budget.
  4. Handling Network Outages or Datacenter Failures:

    • Events such as network outages or datacenter failures that reduce the measured SLO will also eat into the error budget. As a result, new pushes may be reduced for the rest of the quarter.
    • The entire team supports this reduction, as everyone shares the responsibility for uptime.
  5. Balancing Reliability and Innovation:

    • The error budget also helps identify the costs of setting overly high reliability targets, which can lead to inflexibility and slow innovation.
    • If the team faces challenges in launching new features, they may decide to loosen the SLO (and thus increase the error budget) to promote more innovation.

Service Level Terminology

Indicators (SLI)

An SLI (Service Level Indicator) is a carefully defined quantitative measure of some aspect of the level of service that is provided.

Common SLIs include:

SLIs are often aggregated, with raw data collected over a measurement window and then turned into a rate, average, or percentile.

Sometimes only a proxy SLI is available, especially if the desired metric is hard to obtain or interpret. For example, client-side latency is often more relevant to users, but measuring it may only be possible at the server.

Another important SLI for SREs is availability: the fraction of time a service is usable. Availability is often defined as the fraction of well-formed requests that succeed (called yield).

Although 100% availability is impossible, near-100% availability is often achievable, and is commonly expressed in terms of the number of "nines" in the availability percentage. For example:

Objectives (SLO)

An SLO (Service Level Objective) is a target value or range of values for a service level that is measured by an SLI. A typical SLO structure is:

SLI ≤ target or lower bound ≤ SLI ≤ upper bound

For example, an SLO might specify that the average search request latency should be less than 100 milliseconds.

Choosing an appropriate SLO can be complex. For example, while you may not control the queries per second (QPS) for incoming HTTP requests (which depends on user demand), you can set an SLO for latency, which could motivate optimizations like low-latency equipment or frontend design.

Although 100 milliseconds for latency is arbitrary, fast is better than slow in general. Studies suggest that user-experienced latency above certain values can drive users away.

Agreements (SLA)

An SLA (Service Level Agreement) is an explicit or implicit contract with users that includes the consequences of meeting or missing the SLOs contained within it. Consequences are often financial (e.g., rebates or penalties), but can take other forms.

Difference Between SLO and SLA:

SRE teams don’t typically construct SLAs because SLAs are tied to business and product decisions. However, SREs do help avoid the consequences of missed SLOs and are involved in defining the SLIs to ensure there is an objective way to measure the SLOs in the agreement.

Example:

Company Search doesn’t have an SLA for the public, but if it's unavailable, it affects reputation and advertising revenue. Many other Company services, such as Company for Work, do have explicit SLAs.

Whether or not a service has an SLA, defining SLIs and SLOs is valuable for managing the service.

Indicators in Practice

Identifying Meaningful Metrics

Selecting appropriate metrics to measure your service’s performance involves understanding what matters to both you and your users. Avoid tracking every possible metric—focus on a handful of representative indicators to evaluate and reason about the system’s health effectively.

Key Guidelines:

What Do You and Your Users Care About?

Your choice of Service Level Indicators (SLIs) should reflect what your users value most. Different types of services prioritize different indicators:

Categories of SLIs by Service Type:

  1. User-Facing Serving Systems (e.g., search frontends):

    • Availability: Could we respond to the request?
    • Latency: How long did it take to respond?
    • Throughput: How many requests could be handled?
  2. Storage Systems:

    • Latency: How long does it take to read or write data?
    • Availability: Can we access the data on demand?
    • Durability: Is the data still there when we need it?
      • For an extended discussion, see Chapter 26.
  3. Big Data Systems (e.g., data processing pipelines):

    • Throughput: How much data is being processed?
    • End-to-End Latency: How long does it take for data to progress from ingestion to completion?
      • Some pipelines may also set latency targets for individual processing stages.
  4. Correctness (relevant to all systems):

    • Key Question: Was the correct answer returned, the right data retrieved, or the right analysis done?
    • While correctness is critical for system health, it often pertains to the data rather than the infrastructure itself and may not fall under SRE responsibility.

Summary

Focusing on the SLIs most relevant to your service type and user needs ensures effective monitoring and better system health management.

Collecting Indicators

Server-Side vs. Client-Side Metrics


Aggregation of Metrics

Challenges in Aggregation:

Distribution Over Averages:

Why Focus on High Percentiles?


Standardizing Indicators

Benefits of Standardization:

Components of Standardized SLIs:

Reusable Templates:


Summary

Effective monitoring of indicators requires thoughtful collection, careful aggregation, and standardized definitions. Using distributions and high-percentile values ensures deeper insights into system behavior, while standardized templates save effort and foster consistency.

Objectives in Practice

Start with User Needs

Defining Objectives

Examples:

Multi-Target SLOs:

Heterogeneous Workloads:

SLO (Service Level Objective)

A Service Level Objective (SLO) is a specific, measurable target for the performance or reliability of a service, often expressed as a percentage over a defined time period. It represents the expected level of service agreed upon between stakeholders, serving as a benchmark for evaluating system health.

For example: "99.9% of requests should respond within 200ms over the last 30 days."

SLOs are typically derived from SLAs (Service Level Agreements) and tracked using SLIs (Service Level Indicators) to guide operational priorities and reliability efforts.

Managing SLO

Error Budgets

Choosing SLO Targets

Control Measures

SLOs are integral to system management through control loops:

  1. Monitor and measure SLIs.
  2. Compare SLIs to SLOs.
  3. Determine actions needed to meet targets.
  4. Take action based on findings.

Example: If latency is rising and threatens to exceed the SLO:

  1. Hypothesize that servers are CPU-bound.
  2. Add more servers to balance the load.

Publishing SLOs

Strategies:

SLOs as Prioritization Tools

SLA: Agreements in Practice

Crafting an SLA

SRE's Role in SLA Development:

Applying SLO Principles to SLAs

Being Conservative in SLAs

Best Practices:

Toil vs Engineering

What Is Toil?

Toil is not:

Toil is work tied to running a production service that is:

Why Less Toil Is Better

Calculating Toil

Key Takeaways

  • Toil must be managed to ensure time is spent effectively on engineering.
  • Automation and strategy are critical tools for reducing toil.
  • Balancing toil reinforces the promise of SRE as an engineering-focused discipline.

What is Engineering?

Engineering work is:

Categories Work

A. Engineering

  1. Software Engineering:

    • Writing or modifying code, including associated design and documentation.
    • Examples:
      • Writing automation scripts.
      • Creating tools or frameworks.
      • Adding features to improve scalability and reliability.
  2. Systems Engineering:

    • Configuring production systems or documenting changes for lasting benefits.
    • Examples:
      • Updating monitoring setups.
      • Configuring load balancers.
      • Consulting on architecture and productionization for Dev teams.

B. Not Engineering

  1. Overhead:

    • Administrative work not directly tied to production services.
    • Examples:
      • HR paperwork.
      • Meetings, training, or bug queue hygiene.
  2. Toil:

    • Manual, repetitive work tied to running a production service (e.g., handling alerts, manually executing scripts).

Balance Between Toil and Engineering

Is Toil Always Bad?

A. When Toil Is Acceptable:

B. When Toil Becomes Toxic:

  1. Career Stagnation:
    • Excessive toil leaves little time for impactful engineering work, limiting career growth.
  2. Low Morale:
    • Leads to burnout, boredom, and dissatisfaction.
  3. Organizational Impact:
    • Confusion: Dilutes the engineering focus of SRE.
    • Slowed Progress: Reduces feature velocity.
    • Bad Precedents: Encourages Devs to shift operational burdens to SRE.
    • Attrition: Talented engineers may leave for better opportunities.
    • Breach of Faith: Betrays the promise of a balanced engineering role for new hires.

Reducing Toil: A Collective Effort

Conclusion: Let’s invent more and toil less.

Monitoring Distributed Systems

Definitions

Monitoring

Types of Monitoring

  1. White-box Monitoring:

    • Based on internal metrics exposed by the system.
    • Sources include:
      • Logs
      • Interfaces (e.g., JVM Profiling Interface)
      • HTTP endpoints for internal statistics.
  2. Black-box Monitoring:

    • Tests externally visible behavior as a user would experience it.

Alert

Root Cause

Node/Machine

Push

Reasons to Monitor

  1. Analyze Long-Term Trends:

    • Database size and growth rate.
    • Daily active user count trends.
  2. Compare Over Time/Experiment Groups:

    • Performance differences between technologies (e.g., DB versions).
    • Impact of changes (e.g., adding a memcache node).
  3. Conduct Ad Hoc Debugging:

    • Investigate latency spikes or related events.
  4. Build Dashboards:

    • Visualize core metrics using the Four Golden Signals:
      • Latency
      • Traffic
      • Errors
      • Saturation
  5. Alerting:

    • Notify when:
      • Something is broken: Immediate human intervention required.
      • Something may break soon: Preemptive action recommended.

Effective Alerting

Alerting Costs

Conclusion

Monitoring and alerting are essential for:

Setting Reasonable Expectations for Monitoring

Key Takeaways


Design Principles for Monitoring

  1. Avoid Over-Reliance on Manual Observation:

    • Monitoring should not require humans to “stare at a screen” for issues.
    • Use automated alerts and dashboards for immediate problem detection.
  2. Keep Rules Simple and Clear:

    • Rules should detect unexpected changes with minimal complexity.
    • Example: Simple rules for end-user request rate anomalies are effective and quick.
  3. Understand System Use Cases:

    • Critical Monitoring (e.g., alerting):
      • Must be robust and simple to minimize noise.
    • Secondary Monitoring (e.g., capacity planning):
      • Can tolerate higher complexity and fragility.
  4. Limit Dependency Hierarchies:

    • Complex dependency-based rules (e.g., “if X, then Y”) are avoided.
    • Stable, well-defined dependencies (e.g., datacenter drain rules) are exceptions.

Goals for Monitoring Systems


Symptoms vs. Causes

Monitoring must answer:

Examples

Symptom Cause
HTTP 500s or 404s Database servers refusing connections.
Slow responses Overloaded CPUs or partial network packet loss.
Users in Antarctica not receiving GIFs CDN blacklisted client IPs.
Private content is world-readable Software push forgot ACLs.

Importance of "What" vs. "Why"


Black-Box vs. White-Box Monitoring

Aspect Black-Box Monitoring White-Box Monitoring
Focus Symptoms Causes and internal system insights
Use Case Detects current problems. Identifies imminent issues and root causes.
Telemetry Minimal insight into internals. Inspects logs, metrics, or endpoints.
Discipline Pages for ongoing and real problems only. Helps distinguish between symptoms and deeper issues.

Practical Considerations

  1. Telemetry for Debugging:

    • White-box monitoring is essential for debugging.
    • Example: Distinguish between:
      • Slow database server.
      • Network issues between web servers and the database.
  2. Paging Rules:

    • Black-box monitoring ensures paging is reserved for active issues with real impact.
    • Avoid using black-box monitoring for imminent, not-yet-occurring problems.
  3. Example for Combined Use:

    • Web servers slow on database-heavy requests:
      • White-box metrics: Database response times and server status.
      • Black-box results: End-user symptoms like HTTP errors.

Conclusion

Effective monitoring balances:

The Four Golden Signals

The four golden signals—latency, traffic, errors, and saturation—are critical metrics for monitoring user-facing systems. If you can monitor only four things, focus on these.


1. Latency


2. Traffic


3. Errors


4. Saturation


Practical Considerations

Latency Analysis: Worrying About the Tail

Paging and Monitoring


Conclusion

Focusing on latency, traffic, errors, and saturation provides a comprehensive, actionable framework for monitoring. By prioritizing these signals, teams can ensure a balance between simplicity, accuracy, and actionable insights, enabling proactive issue detection and resolution.

Choosing an Appropriate Resolution for Measurements

Granularity is critical when measuring different aspects of a system. The resolution should match the needs of the system while balancing cost, complexity, and usefulness.

Examples of Granularity Choices:

Cost vs. Detail:


As Simple as Possible, No Simpler

Monitoring systems must balance complexity with effectiveness.

Avoiding Over-Complexity:

Guidelines for Simplicity:


Monitoring and Alerting Philosophy

Company's approach provides a foundation for creating effective monitoring systems.

Key Questions for Alerts:

Pager Philosophy:

Focus on User Impact:


Practical Example: CPU Monitoring

  1. Granular Collection:
    • Record CPU utilization every second.
    • Increment a bucket (e.g., 0–5%, 6–10%, etc.) for each second.
  2. Aggregation:
    • Summarize bucket data every minute to balance cost and insight.
  3. Alerting:
    • Set thresholds for high CPU utilization trends (e.g., 95th percentile over 5 minutes).

Monitoring for the Long Term

Modern production systems are dynamic, with changing software architecture, load characteristics, and performance targets. As systems evolve, monitoring systems must adapt to track these changes. Alerts that were once rare and difficult to automate may become more frequent, potentially warranting automated scripts to resolve them.

Long-Term Monitoring Considerations:


Conclusion

A healthy monitoring and alerting pipeline should be simple and easy to reason about. It should focus on symptoms for paging and use cause-oriented heuristics as tools for debugging.

Long-Term Monitoring Goals:

Ultimately, a successful on-call rotation and product depend on choosing the right alerts, aligning targets with achievable goals, and enabling rapid diagnosis to maintain system health over time.

Release Engineering

Release engineering is a rapidly growing discipline within software engineering, primarily focused on building and delivering software. It encompasses a broad set of skills, including an understanding of:

Release engineers have expertise across multiple domains: development, configuration management, test integration, system administration, and customer support.

The Importance of Reliable Release Processes

Running reliable services requires reliable release processes. Site Reliability Engineers (SREs) need to ensure that the binaries and configurations they use are built in a reproducible, automated way. This ensures that releases are repeatable and not "unique snowflakes." Every change in the release process should be intentional, not accidental. SREs are concerned with the entire process, from source code to deployment.

The Role of a Release Engineer

At Company, release engineering is a distinct job function. Release engineers collaborate with software engineers (SWEs) and SREs to define all the steps necessary to release software, which includes:

Company is a data-driven company, and release engineering is no exception. The company uses tools to track key metrics such as:

These tools are often designed and developed by release engineers.

Release engineers also define best practices for using these tools to ensure consistent, repeatable releases. These best practices cover various aspects of the release process, including:

The goal is to make sure that tools behave correctly by default and are adequately documented, enabling teams to focus on features and user needs rather than reinventing release processes.

Collaboration Between Release Engineers and SREs

Company has a large number of SREs responsible for safely deploying products and maintaining service uptime. Release engineers and SREs work together to:

This collaboration ensures that the release process meets business requirements while maintaining service reliability.

Key Points from Release Engineering Philosophy

Self-Service Model

High Velocity

Hermetic Builds

Enforcement of Policies and Procedures

Packaging

Rapid System

Continuous Build and Deployment Process

  1. Building: Blaze compiles binaries and runs unit tests.
  2. Branching: Releases are created from specific branches, not directly from the mainline, to ensure consistency.
  3. Testing: Continuous test systems catch failures early.
  4. Deployment: MPM packages are deployed using Rapid, and for complex deployments, Sisyphus is used for zero-downtime updates.

Configuration Management

Key Points from Software Simplicity Philosophy

System Stability vs. Agility

The Virtue of Boring

Managing Code and Complexity

Minimal APIs

Modularity

Release Simplicity

SRE Practices

Monitoring

Incident Response

Development

Product

Further Reading from Company SRE
- Resilience Testing: Company conducts company-wide testing to ensure readiness for unexpected events.
- Capacity Planning: It’s vital for long-term stability and doesn’t require predicting the future.
- Network Security: New approach for securing corporate networks through device and user credentials instead of privileged intranets.

Incident Management

Postmortem and Root-Cause Analysis

Testing

Capacity Planning and Load Balancing

Addressing Cascading Failures

Being On-Call in SRE

Introduction

On-Call in IT Context

Key Differences of SRE Approach

Life of an On-Call Engineer

Key Responsibilities

Response Time and Service Availability

Non-Paging Events

On-Call Rotation

Balanced On-Call

Quantity of On-Call

Quality of On-Call

Compensation

Feeling Safe

Postmortem and Improvement

Avoiding Inappropriate Operational Load

Operational Overload

Operational Underload

Alert Management

Collaboration with Developer Teams

Conclusion

Effective Troubleshooting

Theory

Common Pitfalls

Ineffective troubleshooting is often caused by problems during the Triage, Examine, and Diagnose steps, typically due to insufficient system understanding.

Conclusion

A methodical, hypothesis-driven approach, combined with an understanding of common troubleshooting pitfalls, leads to more effective problem-solving. It's essential to test hypotheses and remain open to simpler, more probable causes to avoid wasting time on irrelevant or improbable theories.

Key Points for Troubleshooting in Complex Systems

Problem Report

Triage

Examine

Diagnose

Asking the Right Questions

Conclusion

Test and Treat

Experimental Method for Troubleshooting

Test Design Considerations

Documentation and Tracking

Negative Results Are Magic

Importance of Negative Results

The Role of Negative Results

Cure

Proving the Cause

Postmortem

Making Troubleshooting Easier

Fundamental Approaches to Troubleshooting

Preventing Troubleshooting Needs

Emergency Response: What to Do When Systems Break

Key Principles

Test-Induced Emergency Example

Findings: What Went Well

Lessons Learned

Change-Induced Emergency

Key Scenario

Findings: What Went Well

Lessons Learned

Key Takeaways

Process-Induced Emergency

Key Scenario

Findings: What Went Well

Lessons Learned

Key Takeaways

All Problems Have Solutions

Key Insights

Learning from the Past: Asking Big Questions

Key Takeaways

Conclusion

While each emergency case may have its own unique trigger, the common thread is that effective incident response, learning from failures, and proactive testing lead to continuous improvement. The lessons from Company’s experiences can be applied to organizations of any size to improve both processes and systems.

Managing Incidents

Unmanaged Incidents: A Case Study

Imagine being Mary, an on-call engineer for a company, when you suddenly face a cascading failure in your service:

  1. Problem Starts: Your service stops serving traffic in one datacenter, then another, and eventually all five.
  2. Traffic Overload: The remaining datacenters can't handle the load, causing further overload.
  3. Troubleshooting: You begin investigating logs, suspecting an error in a recent module update. Rolling back doesn’t help, so you call Josephine for assistance.
  4. Escalating Pressure: Your boss and other business leaders demand answers, putting pressure on you while you try to fix the issue. Meanwhile, others offer suggestions that are irrelevant or unhelpful.
  5. Further Complications: A colleague, Malcolm, decides to implement a quick fix (changing CPU affinity) without consulting anyone, making the situation worse.

Anatomy of an Unmanaged Incident

Despite everyone trying to do their job, the incident spirals out of control due to several key factors:

1. Sharp Focus on the Technical Problem

2. Poor Communication

3. Freelancing

Key Takeaways

Elements of Incident Management Process

Overview

A well-designed incident management process is essential to effectively handle incidents. Company's incident management system is based on the Incident Command System (ICS), which emphasizes clarity and scalability. Key features of an effective incident management process include clear roles, delegated responsibilities, and a well-defined command structure.

Key Features of Incident Management

1. Recursive Separation of Responsibilities

2. Roles in Incident Management

3. A Recognized Command Post

4. Live Incident State Document

5. Clear, Live Handoff

Key Takeaways

A Managed Incident

Scenario

Mary, the on-call engineer, receives an alert at 2 p.m. that one of the datacenters has stopped serving traffic. She begins investigating, and soon the second datacenter is also down. Knowing the importance of having a structured incident management system, Mary uses the framework to handle the situation effectively.

Incident Management Process

  1. Initial Response

    • Mary immediately informs Sabrina, asking her to take command of the incident.
    • Sabrina quickly gets a rundown of the situation from Mary and updates the incident status via email to the prearranged mailing list, keeping VPs informed.
    • Sabrina acknowledges that she can't yet scope the full impact and asks Mary for an assessment. Mary estimates that no users are impacted yet and hopes to avoid a third datacenter failure.
    • Sabrina records Mary’s response in a live incident document.
  2. Escalation and Communication

    • When the third datacenter fails, Sabrina updates the email thread with this new information, keeping the VPs updated without overwhelming them with technical details.
    • Sabrina asks an external communications representative to begin drafting user messaging and reaches out to the developer on-call (Josephine) for assistance, after getting approval from Mary.
  3. Collaboration and Documentation

    • Robin volunteers to help, and Sabrina reminds both Robin and Josephine to prioritize tasks assigned by Mary and to keep her informed of their actions.
    • Robin and Josephine read the live incident document to familiarize themselves with the current state of the incident.
    • After testing a previous release fix that didn’t work, Mary shares this with Robin, who updates IRC. Sabrina logs this update in the live incident document.
  4. Shift Change and Handoff

    • By 5 p.m., Sabrina starts organizing staff replacements for the evening shift and updates the incident document.
    • A phone conference is held at 5:45 p.m. to ensure everyone is on the same page regarding the incident status.
    • At 6 p.m., the incident is handed off to the team in the sister office.
  5. Post-Incident

    • When Mary returns the next morning, she learns that her colleagues have already mitigated the issue, closed the incident, and started working on the postmortem analysis.
    • Mary uses the lessons learned from the incident to plan structural improvements to prevent similar issues in the future.

Key Takeaways

This approach illustrates how an organized, managed incident can lead to quicker resolution and valuable learning for future improvements.

When to Declare an Incident

Key Guidelines for Declaring an Incident

It’s crucial to declare an incident early in order to respond more effectively. Delaying the declaration and letting a problem grow can lead to a chaotic response. To determine whether to declare an incident, the following conditions should be considered:

Importance of Proactive Incident Management

Incident management proficiency diminishes if it isn’t regularly practiced. To keep skills sharp, engineers can apply the incident management framework to operational changes that span across time zones or involve multiple teams. This ensures familiarity with the process when an actual incident arises.

Regular Use of Incident Management Framework

By integrating incident management into regular operations and testing, teams remain prepared to handle incidents efficiently when they arise.

Best Practices for Incident Management

1. Prioritize

2. Prepare

3. Trust

4. Introspect

5. Consider Alternatives

6. Practice

7. Change It Around

By following these best practices, teams can respond more effectively to incidents, reduce emotional stress, and improve their ability to handle future incidents.

Postmortem Philosophy

The primary goals of writing a postmortem are to:

While root-cause analysis techniques are extensive and can vary by service, the focus should always be on learning and improvement, not on punishment. Writing a postmortem is a learning opportunity for the entire organization.

Common Postmortem Triggers:

Postmortem Criteria:

It’s important to define when a postmortem is required before an incident occurs. Along with the common triggers, any stakeholder can request a postmortem.

Blameless Postmortems:

Postmortems Should Be Constructive:

Example of Blameless vs. Blaming Language:

Best Practices for Writing a Postmortem:

By embracing blameless postmortems, companies can turn incidents into valuable learning opportunities and improve their systems for the future.

Collaborate and Share Knowledge

Collaboration is a key aspect of the postmortem process, with a focus on real-time data collection, crowdsourced solutions, and broad knowledge-sharing.

Postmortem Workflow Features:

Formal Review Process:

Once the initial review is completed, the postmortem is shared more broadly with the larger engineering team or an internal mailing list to maximize learning.

Best Practice: No Postmortem Left Unreviewed

An unreviewed postmortem is as if it never existed. Regular review sessions are encouraged to ensure completeness and closure of discussions. These sessions help capture ideas and finalize the action items.

Postmortem Repository:

Incorporating a thorough and collaborative review process ensures that postmortems are effective, and their learnings are widely shared and applied across teams.

Introducing a Postmortem Culture

Creating a postmortem culture in an organization requires continuous cultivation and reinforcement. Senior management plays a vital role in encouraging and participating in the postmortem process, but ultimately, it is the engineers who must be self-motivated to embrace a blameless postmortem culture.

Key Activities to Reinforce Postmortem Culture:

Overcoming Challenges:

Introducing postmortems can face resistance, often due to the perceived cost of preparation. The following strategies help:

Best Practices:

Cultural Integration:

Postmortems are now an integral part of Company's culture, ensuring that any significant incident is followed by a comprehensive postmortem. This cultural norm ensures continuous improvement in the organization’s incident response and system resilience.

Testing for Reliability

Site Reliability Engineers (SREs) play a key role in quantifying confidence in the systems they manage. They do so by applying classical software testing techniques at scale to assess both past and future reliability.

Measuring Reliability:

Testing to Reduce Uncertainty:

Testing is used to demonstrate equivalence before and after a change, thereby reducing uncertainty. Thorough testing helps predict the future reliability of a system, and the level of testing required depends on the system's reliability requirements.

The Relationship Between Testing and System Changes:

Testing and Mean Time to Repair (MTTR):

The Impact of Testing on Release Velocity:

Testing Terminology:

Types of Software Testing

Software tests can be divided into traditional and production tests, each serving distinct purposes in the software development lifecycle.

Traditional Tests

  1. Unit Tests

    • Test the smallest, isolated unit of software (e.g., functions, classes).
    • Ensure correctness and can act as a specification for expected behavior.
    • Cheap and fast (run in milliseconds).
    • Often used for test-driven development.
  2. Integration Tests

    • Test interactions between components or units.
    • Verify that combined components function correctly.
    • Often use mocks for dependencies (e.g., mock databases).
  3. System Tests

    • Test the entire system, verifying end-to-end functionality.
    • Types of system tests:
      • Smoke Tests: Basic checks to ensure critical functionality works.
      • Performance Tests: Ensure the system's performance remains stable over time.
      • Regression Tests: Ensure that previously fixed bugs do not reappear.

Production Tests

  1. Configuration Tests

    • Verify that the production system is correctly configured.
    • Run outside the test environment, ensuring the system matches expected configurations.
    • Essential in distributed environments where production and development configurations may differ.
  2. Stress Tests

    • Push the system beyond its normal operating limits.
    • Help understand system limits (e.g., database load, server overload).
    • Identify failure points under extreme conditions.
  3. Canary Tests

    • Gradually roll out changes to a small subset of users or servers.
    • Acts as a form of user acceptance testing in a live environment.
    • Allows early detection of faults before full deployment.
    • Enables quick rollback if issues are detected.

Key Concepts

Creating a Test and Build Environment

When SREs join an already in-progress project, comprehensive testing may not yet be in place. In such cases, prioritize efforts to build testing incrementally.

Steps to Begin Testing

  1. Prioritize the Codebase:

    • Rank components of the system by importance to identify the most critical areas to test.
    • Example: Mission-critical or business-critical code (e.g., billing) should be prioritized.
  2. Focus on APIs:

    • Test APIs that other teams rely on, as issues in these areas can cause significant downstream problems.
  3. Smoke Tests:

    • Implement low-effort, high-impact smoke tests for every release to catch obvious issues early.
    • This helps in shipping reliable software and improving testing culture.
  4. Bug Reporting as Test Cases:

    • Convert every reported bug into a test case. Initially, these tests will fail until the bug is fixed, contributing to a growing regression test suite.

Establishing Testing Infrastructure

  1. Source Control:

    • Set up a versioned source control system to track all changes in the codebase.
  2. Continuous Build System:

    • Add a continuous build system that builds software and runs tests on every code submission.
    • Ensure the system notifies engineers immediately if a change breaks the project, allowing the team to prioritize fixing the issue.
  3. Importance of Fixing Broken Builds:

    • Treat defects in broken builds with high priority because:
      • It's harder to fix if more changes are made after the defect.
      • Broken software slows down development.
      • Release cadences lose value.
      • Emergency releases become more complex.
  4. Stability and Agility:

    • A stable and reliable build system fosters agility. Developers can iterate faster when the system is stable.

Tools for Optimizing Build Systems

Goal-Oriented Testing

Considerations for Different Software Types

Testing at Scale

In order to drive reliability at scale, SRE takes a systems perspective to testing, addressing dependencies and selecting effective testing environments.

Key Concepts of Testing at Scale

  1. Unit Tests and Dependencies:

    • A small unit test typically has a limited set of dependencies (e.g., source file, testing library, runtime libraries, hardware).
    • These dependencies should each have their own test coverage to ensure comprehensive testing.
    • If a unit test relies on an unchecked runtime library, unrelated changes in the environment could lead to the test passing despite faults in the code under test.
  2. Release Tests and Transitive Dependencies:

    • A release test often depends on many components, potentially having a transitive dependency on every object in the code repository.
    • If the test requires a clean copy of the production environment, each small patch might necessitate a full disaster recovery iteration.
  3. Branch Points in Testing:

    • Testing environments aim to select branch points in versions and merges to minimize uncertainty.
    • This reduces the number of iterations needed to test and ensures that faults are detected more efficiently.
  4. Resolving Faults:

    • When uncertainty resolves into a fault, additional branch points need to be selected for further testing.

Testing Scalable Tools

SRE Tools and Their Testing Needs

SRE-developed tools are responsible for various tasks, such as:

Characteristics of SRE Tools:

Barrier Defenses Against Risky Software

Software that bypasses the usual heavily tested APIs can create significant risks, especially when interacting with live services. For example, a database engine may allow administrators to turn off transactions, which could cause issues if the utility is run on a user-facing replica.

Strategies to Mitigate Risk:

  1. Use a separate tool to place a barrier in the replication configuration, ensuring the replica doesn’t pass its health check and is not released to users.
  2. Configure the risky software to check for this barrier upon startup, limiting its access to unhealthy replicas.
  3. Use the replica health validating tool to remove the barrier when needed.

Testing Automation Tools

Automation tools in SRE are responsible for tasks such as:

Characteristics of Automation Tools:

Challenges in Testing Automation Tools:

Example of Automation Tool Interactions:

Testing Disaster

Disaster Recovery Tools

Many disaster recovery tools are designed to operate offline and are used for:

Benefits of Offline Tools:

Online Repair Tools

Online repair tools operate outside the mainstream API and require more complex testing. One of the challenges is ensuring that normal behavior (eventual consistency) interacts well with the repair process.

Challenges with Online Repair Tools:

Using Statistical Tests

Statistical Testing Techniques:

The Need for Speed

Test Likelihood and Statistical Uncertainty

Testing Deadlines

Release Testing Process:

Error Rate Calculation:

Pushing to Production

Segregation of Testing Infrastructure and Production Configuration

SRE Model and the Impact of Segregation

Testing Distributed Architecture Migration

Expect Testing Fail

Evolution of Release Cadence

Reliability Management in SRE

Configuration Management

Handling Configuration Files

Conclusion

Integration Testing for Configuration Files

Risks of Configuration Files in Interpreted Languages

Risks of Custom Syntax and Text-Based Configuration Files

Role of Site Reliability Engineering (SRE)

Example of Configuration Failure

Conclusion

Production Probes

Risks Beyond Testing and Monitoring

While testing ensures known data behaves correctly, and monitoring confirms behavior in the face of unknown user data, actual risk management is more complex. It requires coverage for:

Three Sets of Requests

When dealing with different software versions (older and newer), there are three types of requests:

  1. Known bad requests (expected to fail)
  2. Known good requests that can be replayed against production
  3. Known good requests that can’t be replayed against production (these need their own testing)

Challenges with Different Versions

Fake Backend Versions in Release Testing

Monitoring and Rollout Automation

Role of Monitoring Probes

Handling Failures in Probes

Benefits of Production Probes

Conclusion


Glossary

Term Definition
Availability The proportion of time a system is operational and accessible.
SLO (Service Level Objective) A target level of service for a given metric, often expressed as a percentage.
SLA (Service Level Agreement) A formal agreement that outlines the expected level of service between parties.
SLI (Service Level Indicator) A metric that measures the level of service provided.
Error Budget The amount of downtime or errors a service is allowed within a given period while still meeting the SLO.
Incident An event or series of events that disrupts the normal operation of a service.
MTTR (Mean Time to Recovery) The average time taken to recover from an incident.
MTTF (Mean Time to Failure) The average time a service or system operates before failing.
MTBF (Mean Time Between Failures) The average time between two consecutive failures of a system.
Postmortem A detailed report and analysis after an incident to understand what went wrong and prevent future occurrences.
Root Cause Analysis (RCA) The process of identifying the underlying cause of an incident or failure.
Redundancy The practice of having multiple, backup components to avoid single points of failure.
Load Balancing Distributing workloads across multiple resources to ensure efficient usage and prevent overloading a single resource.
Chaos Engineering The practice of intentionally introducing failures to test how well a system can withstand and recover from them.
Scalability The ability of a system to handle increased load by adding resources or optimizing performance.
Capacity Planning The process of determining the optimal resources (e.g., CPU, memory, storage) needed to handle a system’s workload.
Resilience The ability of a system to handle failures or stress without compromising its functionality or performance.
Observability The ability to measure, monitor, and analyze a system’s internal state using logs, metrics, and traces.
Monitoring The practice of continuously collecting and analyzing data from systems to ensure they are functioning as expected.
Alerting The process of notifying teams about issues or anomalies detected in systems.
Automation The use of software tools to automatically perform repetitive tasks like deployment, testing, or recovery.
DevOps A cultural and technical movement aimed at improving collaboration and automation between development and operations teams.
CI/CD (Continuous Integration / Continuous Deployment) Practices that automate the integration of code changes and the deployment of those changes to production.
Reliability The probability of a system operating without failure over a given time period.
Latency The time delay between a request being sent and the corresponding response being received.
Throughput The rate at which a system can process requests or data.
Traffic Shaping The practice of managing the amount and type of traffic sent through a system to avoid overloading it.
On-Call A practice where engineers or support teams are available to respond to incidents or service disruptions.
Canary Releases A software release strategy where a new version is rolled out to a small subset of users to test it before a full-scale release.
Blue/Green Deployment A deployment method where two identical production environments (blue and green) are used, allowing for easy switching between them.
Zero-Downtime Deployment A deployment process that ensures no downtime during the deployment of new software or infrastructure changes.
Distributed System A system in which components are located on different networked machines, communicating and coordinating with each other.
Service Mesh A dedicated infrastructure layer that handles service-to-service communication in a microservices architecture.
Microservices A design pattern in which an application is broken down into smaller, independent services that can be deployed and scaled independently.
Auto-scaling The process of automatically adjusting the number of active servers or resources in response to traffic or load.
<< back to Guides