Observability Study

Prologue

https://wentzwu.com/2020/08/24/microservices-containerization-and-serverless/

https://www.akana.com/resources/microservices-why-should-businesses-care

Concept

https://passionateaboutoss.com/getting-confused-by-key-assurance-metrics/

MTBF (Mean Time Between Failures) – the average elapsed time between failures of a system, service or device. It’s the basic measure of availability / reliability of the system / service / device. The higher, the better.
MTTR (Mean Time to Repair) – generally used to denote the average time to close a trouble ticket (to repair a failed system / service / device). It’s the basic measure of corrective action efficiency. The lower, the better.

Some also use MTTR as a Mean Time to Recover / Resolve (ie MTTD + MTTR in the diagram above) or Mean Time to Respond (MTTD in the diagram above to acknowledge an event and create a ticket). See why I get confused?

MTTD (Mean Time to Detect / Diagnose) – the average time taken from when an event is first generated and timestamped to when the NOC detects / diagnoses the cause and generates a ticket. The lower, the better.
MTTF (Mean Time to Failure) – the average system / service / device up-time

https://www.atlassian.com/incident-management/kpis/common-metrics

MTTR (mean time to respond) is the average time it takes to recover from a product or system failure from the time when you are first alerted to that failure.
MTTA (mean time to acknowledge) is the average time it takes from when an alert is triggered to when work begins on the issue.

https://www.guyfighel.com/blog/tag/SRE

AWS re:Invent 2019: Cut through the chaos: Gain operational visibility and insight (MGT301-R1)

file.pdf

types of monitoring blackbox vs whitebox

https://www.meshcloud.io/2020/08/28/multi-cloud-monitoring-a-cloud-security-essential/

Spaghetti Dash

https://github.com/leeoniya/uPlot/issues/108

https://twitter.com/suprememoocow/status/1392065948845215746/photo/1

https://twitter.com/suprememoocow/status/1398236948502827011/photo/1

Why Dashboards Are Useless and Observability Is the New Buzzword

https://pt.slideshare.net/timetrix/why-dashboards-are-useless-and-observability-is-the-new-buzzword

Example Dash

https://indico.cern.ch/event/639271/contributions/2591582/attachments/1460975/2256862/2017_05_WLCGTransfers.pdf

Dashboards Observability Perks

https://docs.splunk.com/Documentation/ITSI/4.10.2/SI/DeepDives

Phantom Cyber Corporation Splunk Computer Software Orchestration Privately Held Company

https://splunkbase.splunk.com/app/4261/

https://twitter.com/mattdavies_uk/status/932906148155527168

https://twitter.com/RKela/status/1159117734426398720

Grafana Many Datasources

https://blog.zabbix.com/configuring-grafana-with-zabbix/8007/

Netflix Edgar

https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f

Availability

https://manishsharma.blog/2020/02/04/design-for-availability-game-of-9s/

SLOs SLIs SLAs

distributed tracing dapper jaeger

https://logz.io/blog/distributed-tracing-dapper-jaeger/

https://eng.uber.com/microservice-architecture/

Uber’s microservice architecture circa mid-2018 from Jaeger

monitoring data quality at scale

https://eng.uber.com/monitoring-data-quality-at-scale/

cat pink

cat blue

Critical Path Analysis for Microservice Architectures

https://eng.uber.com/crisp-critical-path-analysis-for-microservice-architectures/

Introducing uGroup: Uber’s Consumer Management Framework

https://eng.uber.com/introducing-ugroup-ubers-consumer-management-framework/

Capacity Planning

A Systematic Approach to Capacity Planning in the Real World

https://www.slideshare.net/arunkejariwal/a-systematic-approach-to-capacity-planning-in-the-real-world

Metrics
- Average, Standard deviation, 95th, 99th percentile
Techniques
- Moving Average - EMA (exponential moving average)
- Correlation
- Beta Analysis
- MACD
- Forecasting - ARIMA
Limitations
- Changing usage patterns
  - Organic growth, behavioral, cultural

Beta analysis

Rolling Beta

examples

Orientaçao utilizar AppDynamics

https://devops.com/wp-content/uploads/2019/07/Datadog-NPM.png

Flux: A New Approach to System Intuition

Imagine a suit that is wired with tens of thousands of electrodes.

Unknown unknowns

https://thenewstack.io/how-comprehensive-observability-can-save-devops-from-unknown-unknowns/

Observability involves analyzing “unknown unknowns versus known unknowns,” Fong-Jones said. “Monitoring was about measuring things that you knew to predict in advance, whereas observability helps you understand how and why.”

https://blog.paessler.com/the-future-of-monitoring-the-rise-of-observability

https://devops.com/visualize-logs-to-get-more-value-from-data/

SREcon19 Asia/Pacific - Why Does My Monitoring Suck?

SREcon22 Americas - A Fresh Look at Operational Debt

https://youtu.be/oeFpJv-ujXM?list=PLbRoZ5Rrl5leMkjJdKIOI-vOxMVr7U5w_&t=846

Talks

YOW! 2018 Brendan Gregg - Cloud Performance Root Cause Analysis at Netflix #YOW

SPS

https://netflixtechblog.com/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Charity Majors - HoneyComb

SREcon18 Europe - Observability for Emerging Infra: What Got You Here Won’t Get You There

You have an observable system when your team can quickly and reliably track down any new problem with no prior knowledge.

Monitoring
- known-unknowns
  - What’s the capacity of the app tier, do I need to increase it?
  - Is mean latency above historical norm?
  - When will we run out of disk space?
Observability
- unknown-unknowns
  - Why are we getting some complaints about users seeing the wrong image?
  - Is the code I just wrote working as intended?

Extra

https://newrelic.com/blog/best-practices/what-is-observability

Kubicast

https://www.getup.io/kubicast

instana

Fernando:
- Big Data in Practice: How 45 Successful Companies Used Big Data Analytics to Deliver Extraordinary Results (Livro do Bernard Marr)
- Live-action da Mulan (Disney+)
João:
- A Arte da Autodefesa (Netflix)

elastic

quais logs eu vou ler?
todos falam com todos onde está o erro?
garantir o SLA. MTTR
instrumentacao whitebox vs blackbox
instrumentacao blackbox regras pre estabelecidas
- agente magico attache no processo e faça o profiling e retire as informações que precise
  - tentar extrair o maximo que pode sem ser instrusivo o sufficiente para nao onerar a performance do codigo
- java: tudo exposto via jmx
- regras pre-estabelecidades. tudo o que for default.

new relic

monitoramento
- o que?
- quando?
observability
- por que?
- como?

flowcharting awesome

https://www.linkedin.com/posts/raphaelisp_dashboard-para-monitoramento-pops-atendidos-activity-6914692173141524480-zEvK?utm_source=linkedin_share&utm_medium=member_desktop_web

Observability Concepts you should know

https://blog.devgenius.io/observability-concepts-you-should-know-943fc057b208

Helios

https://app.gethelios.dev/trace-vis

Lightlitcs

https://www.linkedin.com/posts/stavsitnikov_grabbed-this-photo-taken-by-alex-elshamouty-activity-7004844830900023296-Uacc?utm_source=share&utm_medium=member_desktop

Carlos Aguni