Carlos Aguni

Highly motivated self-taught IT analyst. Always learning and ready to explore new skills. An eternal apprentice.


Observability Study

01 Dec 2021 »

Prologue

https://wentzwu.com/2020/08/24/microservices-containerization-and-serverless/

https://www.akana.com/resources/microservices-why-should-businesses-care

Concept

https://passionateaboutoss.com/getting-confused-by-key-assurance-metrics/

  • MTBF (Mean Time Between Failures) – the average elapsed time between failures of a system, service or device. It’s the basic measure of availability / reliability of the system / service / device. The higher, the better.
  • MTTR (Mean Time to Repair) – generally used to denote the average time to close a trouble ticket (to repair a failed system / service / device). It’s the basic measure of corrective action efficiency. The lower, the better.

Some also use MTTR as a Mean Time to Recover / Resolve (ie MTTD + MTTR in the diagram above) or Mean Time to Respond (MTTD in the diagram above to acknowledge an event and create a ticket). See why I get confused?

  • MTTD (Mean Time to Detect / Diagnose) – the average time taken from when an event is first generated and timestamped to when the NOC detects / diagnoses the cause and generates a ticket. The lower, the better.

  • MTTF (Mean Time to Failure) – the average system / service / device up-time

https://www.atlassian.com/incident-management/kpis/common-metrics

  • MTTR (mean time to respond) is the average time it takes to recover from a product or system failure from the time when you are first alerted to that failure.
  • MTTA (mean time to acknowledge) is the average time it takes from when an alert is triggered to when work begins on the issue.

https://www.guyfighel.com/blog/tag/SRE

AWS re:Invent 2019: Cut through the chaos: Gain operational visibility and insight (MGT301-R1)

file.pdf

types of monitoring blackbox vs whitebox

https://www.meshcloud.io/2020/08/28/multi-cloud-monitoring-a-cloud-security-essential/

Spaghetti Dash

https://github.com/leeoniya/uPlot/issues/108

https://twitter.com/suprememoocow/status/1392065948845215746/photo/1

https://twitter.com/suprememoocow/status/1398236948502827011/photo/1

Why Dashboards Are Useless and Observability Is the New Buzzword

https://pt.slideshare.net/timetrix/why-dashboards-are-useless-and-observability-is-the-new-buzzword

Example Dash

https://indico.cern.ch/event/639271/contributions/2591582/attachments/1460975/2256862/2017_05_WLCGTransfers.pdf

linkedin

Dashboards Observability Perks

https://docs.splunk.com/Documentation/ITSI/4.10.2/SI/DeepDives

Phantom Cyber Corporation Splunk Computer Software Orchestration Privately Held Company

https://splunkbase.splunk.com/app/4261/

https://twitter.com/mattdavies_uk/status/932906148155527168

https://twitter.com/RKela/status/1159117734426398720

Grafana Many Datasources

https://blog.zabbix.com/configuring-grafana-with-zabbix/8007/

Netflix Edgar

https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f

Availability

https://manishsharma.blog/2020/02/04/design-for-availability-game-of-9s/

SLOs SLIs SLAs

distributed tracing dapper jaeger

https://logz.io/blog/distributed-tracing-dapper-jaeger/

https://eng.uber.com/microservice-architecture/

Uber’s microservice architecture circa mid-2018 from Jaeger

monitoring data quality at scale

https://eng.uber.com/monitoring-data-quality-at-scale/

cat pink

cat blue

Critical Path Analysis for Microservice Architectures

https://eng.uber.com/crisp-critical-path-analysis-for-microservice-architectures/

Introducing uGroup: Uber’s Consumer Management Framework

https://eng.uber.com/introducing-ugroup-ubers-consumer-management-framework/

Capacity Planning

A Systematic Approach to Capacity Planning in the Real World

https://www.slideshare.net/arunkejariwal/a-systematic-approach-to-capacity-planning-in-the-real-world

  • Metrics
    • Average, Standard deviation, 95th, 99th percentile
  • Techniques
    • Moving Average - EMA (exponential moving average)
    • Correlation
    • Beta Analysis
    • MACD
    • Forecasting - ARIMA
  • Limitations
    • Changing usage patterns
      • Organic growth, behavioral, cultural

Beta analysis

Rolling Beta

examples

Orientaçao utilizar AppDynamics

https://devops.com/wp-content/uploads/2019/07/Datadog-NPM.png

Flux: A New Approach to System Intuition

Imagine a suit that is wired with tens of thousands of electrodes.

Unknown unknowns

https://thenewstack.io/how-comprehensive-observability-can-save-devops-from-unknown-unknowns/

Observability involves analyzing “unknown unknowns versus known unknowns,” Fong-Jones said. “Monitoring was about measuring things that you knew to predict in advance, whereas observability helps you understand how and why.”

https://blog.paessler.com/the-future-of-monitoring-the-rise-of-observability

https://devops.com/visualize-logs-to-get-more-value-from-data/

SREcon19 Asia/Pacific - Why Does My Monitoring Suck?

SREcon22 Americas - A Fresh Look at Operational Debt

https://youtu.be/oeFpJv-ujXM?list=PLbRoZ5Rrl5leMkjJdKIOI-vOxMVr7U5w_&t=846

Talks

YOW! 2018 Brendan Gregg - Cloud Performance Root Cause Analysis at Netflix #YOW

SPS

https://netflixtechblog.com/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Charity Majors - HoneyComb

SREcon18 Europe - Observability for Emerging Infra: What Got You Here Won’t Get You There

You have an observable system when your team can quickly and reliably track down any new problem with no prior knowledge.

  • Monitoring
    • known-unknowns
      • What’s the capacity of the app tier, do I need to increase it?
      • Is mean latency above historical norm?
      • When will we run out of disk space?
  • Observability
    • unknown-unknowns
      • Why are we getting some complaints about users seeing the wrong image?
      • Is the code I just wrote working as intended?

Extra

https://newrelic.com/blog/best-practices/what-is-observability

Kubicast

https://www.getup.io/kubicast

instana

  • Fernando:
    • Big Data in Practice: How 45 Successful Companies Used Big Data Analytics to Deliver Extraordinary Results (Livro do Bernard Marr)
    • Live-action da Mulan (Disney+)
  • João:
    • A Arte da Autodefesa (Netflix)

elastic

  • quais logs eu vou ler?
  • todos falam com todos onde está o erro?
  • garantir o SLA. MTTR

  • instrumentacao whitebox vs blackbox
  • instrumentacao blackbox regras pre estabelecidas
    • agente magico attache no processo e faça o profiling e retire as informações que precise
      • tentar extrair o maximo que pode sem ser instrusivo o sufficiente para nao onerar a performance do codigo
    • java: tudo exposto via jmx
    • regras pre-estabelecidades. tudo o que for default.

new relic

  • monitoramento
    • o que?
    • quando?
  • observability
    • por que?
    • como?

flowcharting awesome

https://www.linkedin.com/posts/raphaelisp_dashboard-para-monitoramento-pops-atendidos-activity-6914692173141524480-zEvK?utm_source=linkedin_share&utm_medium=member_desktop_web

Observability Concepts you should know

https://blog.devgenius.io/observability-concepts-you-should-know-943fc057b208

Helios

https://app.gethelios.dev/trace-vis

Lightlitcs

https://www.linkedin.com/posts/stavsitnikov_grabbed-this-photo-taken-by-alex-elshamouty-activity-7004844830900023296-Uacc?utm_source=share&utm_medium=member_desktop