Carlos Aguni

Highly motivated self-taught IT analyst. Always learning and ready to explore new skills. An eternal apprentice.


Devops Institute - SRE Foundation Certification

15 Nov 2021 »

file.pdf

  • Certifications
    • Site Reliability Engineering (SRE) Foundation℠
      • https://www.devopsinstitute.com/certifications/sre-foundation/
    • Site Reliability Engineering (SRE) Practitioner
      • https://www.devopsinstitute.com/certifications/sre-practitioner/

Site Reliability Engineering (SRE) Foundation℠

What Skills & Knowledge Will You Validate?

  • SRE Principles & Practices
  • Service Level Objectives & Error Budgets
  • Reducing Toil
  • Monitoring & Service Level Indicators
  • SRE Tools & Automation
  • Anti-Fragility & Learning from Failure
  • Organizational Impact of SRE
  • SRE, Other Frameworks, The Future

  • Benefits for Organizations

format of the examination

question weighting

conceptand terminology list

  • Resources
    • https://dev.to/squadcast/golden-signals-monitoring-from-first-principles-kpn
    • Books
      • Livro SRE (Google) https://landing.google.com/sre/sre-book/toc/index.html
      • Livro de trabalho SRE (Google publicação O’Reilly) https://landing.google.com/sre/workbook/toc/
  • Resources
    • Quode SRE
      • file.pdf
    • DOFD Reference Card
      • file.pdf
    • Master DevOps Glossary
      • file.pdf
    • Seeking SRE
      • file.pdf
    • Site Reliability Engineer
      • file.pdf
    • The Site Reliability Workbook
      • file.pdf

Cert Prep Foundation

  • Which of the following ITSM processes are most critical to DevOps?
    • Change, Release, and Service Asset and Configuration Management
  • Which of the following is not a goal of Devops leadership?
    • Control and evaluate workers using metrics
    • The job of DevOps leadership is to help organizations become better at self-diagnosis, self-improvement, and to make sure that local discoveries can be translated and converted to global improvements.
  • Which of the Three Ways encourages peer review of production changes?
    • The Second Way
    • The Second Way is Feedback.
    • In order for this to be successful, the feedback loops should be short so that information is received in a timely manner. Feedback should also be visible (e.g., via dashboards) to amplify the feedback. Without this feedback, teams lack the information needed to continually improve. With shorter feedback loops - sharing and knowledge grow, and so trust grows.
    • As a result we better understand our customers – both internal and external. We fix defects faster and are better able to prevent problems. Processes are improved resulting in better flow, faster delivery speeds and ultimately there is less reactive, unplanned work.
  • A small group of individuals recently returned from a conference where they learned about DevOps. They cannot agree on how to get started. Where should an IT organization start when adopting DevOps practices?
    • Understand why the organization exists
    • The “Why” is the center of the Golden Circle, and understanding why will determine the what and the how.
  • Which choice is a characteristic of a DevOps culture?
    • Shared vision, goals, and incentives
  • Which statement about DevOps teams is the most accurate?
    • They should have shared accountabilities
    • Expand upon the concept of an Agile or Scrum team
    • Embed Dev and Ops skills into a single holistic group
    • May be temporary or dedicated to a specific product
    • May be cross-functional ‘tiger teams’ for short-term projects
    • May evolve to provide shared services
    • Have shared accountabilities
    • Should adhere to the defined standards for development, automation, risk and compliance that applies to all DevOps teams
  • An organization is implementing DevOps. The developers are concerned that their ITSM processes are too complex, slow, and will not support DevOps principles and practices. Which IT framework will help the organization instill agile thinking into existing ITSM processes?
    • Agile service management
    • Agile service management is a framework that ensures that ITSM processes reflect agile values and are designed with “just enough” control and structure in order to effectively and efficiently deliver services that facilitate customer outcomes when and how they are needed.
  • An organization has identified they have a culture of blame and fear, where incidents are not valued and failure is not embraced as a learning opportunity. There are many single points of failure and employees suffer daily as a result of the fragility of the systems, enduring painful war-rooms during frequent outages. What should this organization look to in order to improve the situation?
    • Safety culture in DevOps is about both psychological safety, rewarding the right behaviors and being fearless about reporting problems, and systemic safety, using techniques and tooling to prevent, preempt and predict and remediate failure.
  • An organization is trying to overcome the challenges of their legacy silo culture where teams have been organized by subject matter expertise. What is this organization suffering from?
    • Cultural debt
    • Cultural debt happens when:
      • Silos become impenetrable (fiefdoms) Companies hire the wrong people
      • Employees don’t feel empowered
      • Employees don’t feel their contributions matter
      • Peoples’ contributions aren’t acknowledged
      • People aren’t given the time or resources needed to make improvements
      • Feedback loops are negative or non-existent
      • Information is hidden
  • which choice best describes change fatigue?
    • apathy
    • Change fatigue is a general sense of apathy or passive resignation towards organizational changes by individuals or teams.
  • What statement about Kanban is correct?
    • It pulls work through a process.
    • Kanban is a method of work that pulls the flow of work through a process at a manageable pace.
  • When trying to effect major change, who should be engaged in planning activities and serve as change agents?
    • Early adopters
    • Early adopters are people who represent opinion leaders. They enjoy leadership roles, and embrace change opportunities. They are already aware of the need to change and so are very comfortable adopting new ideas. Strategies to appeal to this population include how-to manuals and information sheets on implementation. They do not need information to convince them to change.
  • What are ‘The Three Ways?’
    • The key principles of DevOps
    • In The Three Ways, the first way is Flow, or understanding and increasing the flow of work from Dev to Ops. The second way is Feedback, creating short feedback loops that enable continuous improvement from Ops to Dev. The third way is continuous experimentation and learning. Creating a culture that fosters experimentation, taking risks and learning from failure.
  • An organization has just completed the deployment of a pilot release using DevOps practices and a preliminary deployment pipeline. Which metric would provide the most information to help them continually improve?
    • Mean Time to Repair (MTTR)
    • Change lead and cycle times
    • Knowledge sharing
    • Mean Time to Repair is the average time required to repair a failed component or device. MTTR does not include the time required to recover or restore service. Change lead time is a measure of the time from a request for a change to delivery of the change. Cycle times a measure of the time from start of work to ready for delivery.
  • The theory of constraints supports which of the Three Ways?
    • The third way
    • The Theory of Constraints is a methodology for identifying the most important limiting factor (i.e., constraint) that stands in the way of achieving a goal and then systematically improving that constraint until it is no longer the limiting factor.

Catchpoint Survey

https://www.catchpoint.com/blog/site-reliability-engineer

I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted

VALET

https://sre.google/workbook/slo-engineering-case-studies/

VALET

  • Volume (traffic)
    • How much business volume can my service handle?
  • Availability
    • Is the service up when I need it?
  • Latency
    • Does the service respond fast when I use it?
  • Errors
    • Does the service throw an error when I use it?
  • Tickets
    • Does the service require manual intervention to complete my request?

Westrum Organization Culture

Pathological, Bureaucratic, Generative

Wikipedia High Availability

https://en.wikipedia.org/wiki/High_availability

Devops exam

  1. Which of the following is the BEST definition of SRE?
    A. What happens when a software engineer is tasked with doing operations
    B. A discipline that incorporates aspects of software engineering and applies them to
    infrastructure and operations problems
    C. A role that handles post-production operational work such as managing SLOs and
    Error Budgets
    D. A discipline that incorporates aspects of software engineering and applies them to
    post-production management of services
  2. Which of the following is NOT a pillar of DevOps according to Google?
    A. Reduce organizational silos
    B. Accept failure as normal
    C. Implement gradual changes
    D. Increase flow from right to left
  3. What are services “managed to” in SRE?
    A. Service Level Objectives (SLOs)
    B. Service Level Indicators
    C. Error Budgets
    D. Service Level Agreements
  4. Cost of failure can be reduced by?
    A. Improving Mean Time to Repair (MTTR)
    B. Making smaller changes
    C. Canary deployments
    D. All of the Above
  5. Which of the following is the most widely tracked Service Level Objectives (SLOs)?
    A. Change failure rate
    B. Security
    C. Availability
    D. Capacity
  6. According to the Catchpoint survey, what is the definition of “latency”?
    A. Delays associated with downtime
    B. The total time it takes from when a user makes a request until they receive a
    response
    C. The difference between the Service Level Objective and the Error Budget
    D. The delay incurred in communicating a message
  7. A team has a monthly Service Level Objective for availability of 99.9%. How much time
    is allocated for their Error Budget?
    A. 45 minutes
    B. 43 minutes
    C. 25 minutes
    D. 32 minutes
  8. What does the “T” represent in the SLO VALET model?
    A. Traffic
    B. Tickets
    C. Testing
    D. Time to Repair
  9. A travel website used by millions of customers experience a one-hour outage which
    exceeds the agreed Error Budget. What could be one possible impact on the business?
    A. Failure of related applications
    B. Employee distractions
    C. Missed Service Level Indicators
    D. Social media backlash
  10. The Error Budget should be burned to zero every month?
    A. TRUE
    B. FALSE
  11. A healthcare company SRE team have started exploring which operational tasks they
    can possibly automate. What would be an example of an automatable task??
    A. Regular work
    B. Blameless post-mortems
    C. Physical meetings to approve production deployments
    D. Planning sessions
  12. Experimentation to solve a problem is a type of toil?
    A. TRUE
    B. FALSE
  13. A twenty-year-old software company is having a difficult time staying competitive with
    so many new startups. It takes too long to bring new features to market and those that are
    released often have errors. What practice would help the company go faster with higher
    quality?
    A. Service Level Objectives (SLOs)
    B. Reduction of Toil
    C. Managing Error Budgets
    D. Auto scaling infrastructure
  14. At least 50% of a SRE’s time should be spent on engineering work that:
    A. Makes improvements to a service
    B. Adheres to error budget policies
    C. Improves service support
    D. Reduces costs
  15. What is one of the reasons that Google introduced the 50% rule for engineering work for
    SREs?
    A. To provide time for SREs to learn from developers
    B. To ensure that one team or person has enough engineering skill to help automate
    toil
    C. To ensure that one team or person does not become “ops”
    D. To ensure that there is adequate time to evaluate and implement automation
  16. Which of the following describes a Service Level Indicator?
    A. A target level of reliability for a service
    B. Data that indicates whether a service level objective is being met
    C. A formal contract that defines service levels
    D. A service level requirement that is defined by the customer
  17. While many numbers can serve as a Service Level Indicator, Google recommends
    treating SLIs as a:
    A. Formula
    B. Automated algorithm
    C. Ratio
    D. Observable data
  18. An automated process that collects and aggregates data from multiple remote end
    points is known as
    A. Alerting
    B. Observability
    C. Telemetry
    D. Application Performance Management
  19. What is the goal of Application Performance Management?
    A. The monitoring and aggregation of remote data for application aspects such
    network, servers, application software
    B. The monitoring and management of performance and availability
    of software applications
    C. The use of a hardware or software component to monitor system resources and
    performance
    D. Anomaly detection
  20. Which of the following is NOT part of a monitoring approach?
    A. Establishing a rule of what is right and what is wrong
    B. Aggregation across a time horizon graphing at an appropriate scale
    C. Dashboards and displays of SLO’s and associated SLI’s
    D. Incident ticket system to track incident response
  21. An operations team is implementing a new monitoring solution and must agree on the
    CPU threshold that would generate an alert. What is this an example of?
    A. Graphing
    B. Dashboarding
    C. Incident identification
    D. Anomaly detection
  22. Which of the following is the definition for Telemetry?
    A. The monitoring and management of performance and availability
    of software applications
    B. The use of a hardware or software component to monitor the system resources and
    performance of a computer system
    C. Ways for engineers to communicate quantitative data about systems
    D. The highly automated communications process by which measurements are made
    and other data collected at remote or inaccessible points and transmitted to
    receiving equipment for monitoring.
  23. Which of the following is true about SRE led service automation
    A. Testing is focused on things that are known to go wrong
    B. It is encouraged that more and more features are pushed production without
    consideration of the impact
    C. Environments must be provisioned as infrastructure-(or configuration)-as-code to
    support consistency
    D. Automated deployments are replaced by manual deployments done by SREs
  24. What type of testing should SREs perform in production?
    A. Functional testing
    B. Non-functional testing
    C. User requirements testing
    D. Functional and non-functional testing
  25. What is the primary benefit of Infrastructure as Code?
    A. Developers can test their code in production since infrastructure code is treated the
    same as application code
    B. The same code is used to create development, test, staging and production so all
    environments are consistent
    C. It sets the stage for containerization
    D. It allows SREs to increase reliability even with more features being pushed into
    production faster
  26. Why is it important to have instrumentation in place to make a service externally
    visible?
    A. To ensure that the correct data and service level indicators are being returned and
    that log files are being generated and stored
    B. To ensure that pre-prod testing and production testing are identical in order to
    validate results
    C. To ensure that the DevOps pipeline is focused on increasing reliability through better
    production testing
    D. To ensure that developers and business stakeholders can access the service
  27. Which of the following is NOT a benefit of outlining the future growth envelope?
    A. Toil is reduced upfront
    B. Rework is reduced
    C. Security and audit events are centralized
    D. The total cost of service ownership is reduced
  28. Which of the following is NOT a benefit of provisioning with Infrastructure as Code?
    A. It would be easier to test and audit changes in production
    B. All environments pre-production and post-production would be consistent
    C. Developers need to understand environments in order to write infrastructure code
    D. It would be easier to identify errors in test environments
  29. If an organization is struggling with adopting a learning from failure culture, what should
    they do?
    A. Do more testing before releasing
    B. Do more retrospectives
    C. Investigate the cost of downtime
    D. Invest in specialized training
  30. In the Westrum study, what type of organization causes an enquiry into failure?
    A. Pathological
    B. Generative
    C. Bureaucratic
    D. Learning
  31. What is the primary benefit of doing regular fire drills?
    A. To ensure that the organization can introduce chaos engineering
    B. To ensure that funding for the business continuity plan is adequate
    C. To ensure that the business can continue to operate during unforeseen events or
    failures
    D. To identify production incidents before they become major incidents
  32. What is the best definition of Chaos Engineering?
    A. A discipline of randomly shutting servers or infrastructure down in order to practice
    recovery in an unexpected outage.
    B. A discipline of engaging in frequent fire drills to ensure that the business can recover
    in the event of a major outage
    C. A discipline of experimenting on a software system in production in order to build
    confidence in the system’s capability to withstand turbulent and unexpected
    conditions
    D. A discipline that incorporates aspects of software engineering and applies them to
    the post-production management of services
  33. Which of the following describes a Full SRE approach to SRE adoption?
    A. Ownership of SRE is with the service/delivery teams
    B. There is shared responsibility and shared on-call between SRE and development
    teams
    C. Full ownership of operational practices is with the SRE team
    D. Mixing Operations and Engineering teams
  34. An on-call program is essential to ensure Service Level Objectives are met. Who
    should participate in the on-call program?
    A. Everyone
    B. SRE teams and Incident Response teams
    C. The Service Desk and SRE teams
    D. Representatives from Operations and Engineering teams
  35. What percentage of time does Google advocate for on-call time for SREs?
    A. 25%
    B. 50%
    C. 30%
    D. 10%
  36. What is one of the key outputs of a blameless post mortem?
    A. A list of follow up actions to mitigate future similar incidents
    B. A list of tasks that SREs can take to reduce toil in order to mitigate similar incidents
    C. A discussion about the resources that are needed to avoid future similar incidents
    D. A list of Service Level Objectives and how they were impacted by the incident
  37. What aspect of ITIL/ITSM can particularly support SRE?
    A. The Change Advisory Board (CAB)
    B. The Service Value System
    C. The Service Desk
    D. Process models
  38. What is the primary responsibility of a Database Reliability Engineer?
    A. Helping to move organizations from a bureaucratic to generative flow of
    information
    B. To ensure data reliability by performing frequent fire drills and data integrity tests
    C. To apply chaos engineering to the database to confirm failover and restore
    responses
    D. To ensure that data is adequately backed up and recoverable
  39. Legacy applications can also benefit from SRE principles and practices. What Service
    Level Objectives could be established to focus on legacy processing?
    A. Number of legacy applications
    B. Lines of COBOL code
    C. Mainframe processor speed
    D. Batch processing time
  40. Which of the following is NOT a key advantage of aligning SRE and Agile practices?
    A. The Definition of Done will be clearer and have a more end to end perspective
    B. Customers will realize more value from working software
    C. Backlogs of toil make work visible and can help prioritize engineering and
    automation stories
    D. SREs will be embedded into Agile or Scrum Teams