97 things

Every Data Engineer Should Know

3 - About the storage Layer - Julie Le Dem
- We want to reduce data footprint, so the data costs less to store and is faster to retrieve.
- Those implementation details are usually hidden behind what is commonly known as pushdowns.
- Projection pushdown: consists of reading only the columns requested.
- Predicate pushdown: consists of avoiding deserializing rows that are going to be discarded by a filter.
7 - Be Intentional About the Batching Model in Your Data Pipelines - Raghtham Murthy
- DTW: Data Time Batching Model, a batch is created for a time windows when all records with a date_timestamp in that iwndows have been received.
- ATW: Arrival time window, a batch is created at a certain wall-clock time with records that were received in the time window prior to that time.
8 - Beware of Silver-Bullet Syndrome - Thomas Nield
- “We all have encountered this person: the one who zealously prometes Apache Spark or pushes to have all data wrangling work done in Alteryx Designer”.
- Do you really want your professional identity to be simply a tool stack?
- Would you rather your resume say, “I Know SQL , MongoDB, and Tableau” or
- “I am an adaptable professional who can navigate ambiguous environments, overcome departmental barriers, and provide technical insights to maximize data value for an organization”?
- Build your profesisonal identity on skills, problem-solving, and adaptability - not a fleeting technology.
12 - Change Data Capture
- This technique is a lot more robust than batch exports of the tables and has a low footprint on the production database.
- Mind for
  - Scale
    - CDC pipeline has to be robust enough for high data volume. For example, in PostgreSQL, delays in reading the WAL can cause the database’s disk space to run out
  - Replication lag
    - This refers to the duration between the time that a transaction is committed in the primary database and the time it becomes available in the data warehouse.
  - Schema changes
    - It is important to propagate the schema changes to the data warehouse. Sometimes a schema change might require a historical sync
  - Masking
    - You have to mask sensitive columns for compliance purposes.
  - Historical syncs
    - Before applying the CDC changes, an initial historical sync of the tables is needed.
19 - Data Pipeline Design Patterns for Reusability and Extensibility
- SRP: single-responisibility principle
- DRY: Don’t Repeat Yourself
- Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma et al.
33 - Friends Don’t Let Friends Do Dual-Writes - Gunnar Morling
- CDC Debezium
53 - Observability for Data Engineers - Barr Moses
- Domain-Oriented Data Meshes
  - https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to-mesh-it-up/
- Cloud warehouses
  - https://www.montecarlodata.com/blog-how-to-migrate-to-snowflake-like-a-boss/
- Data-modeling solutions
  - https://en.wikipedia.org/wiki/Data_modeling
- Data downtime
  - https://www.montecarlodata.com/blog-the-rise-of-data-downtime/
- Data observability: Five pillars
  - https://www.montecarlodata.com/blog-what-is-data-observability/
- Delivery of data products as platforms
  - https://www.montecarlodata.com/blog-how-to-build-your-data-platform-like-a-product/
61 - Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
- “I propose six dimensions for evaluating a data-warehousing solutions for the following use cases”:
  - Ingesting and storing all analytical data
  - Executing data transformations (the T of ELT)
  - Serving data to consumers (powering dashboards and ad hoc analysis)
- Scalability
- Price Elasticity
  - Resource-bases (pay per node of a certain configuration)
  - Usage-based (e.g. pay per gb of data scanned or CPU timej)
- Interoperability
- Querying and Transformation Features
- Speed
  - A TPC-DS benchmark by Fivetran shows that most maintream (as of 2021) engines are roughly in the same ballpark in terms of performance.
- Zero Maintenance
62 - Small files in a Big Data World - Adi Polak
- Why does this happen?
  - IoT ingestion
  - Spark jobs. many parallel tasks
  - Over-partitioned Hive tables
    - hive.merge.smallfiles.avgsize
    - hive.merge.size.per.task
- Be aware of small files when designing data pipelines. Try to avoid them but know you can fix them too!
- https://docs.microsoft.com/en-us/azure/storage/blobs/scalability-targets
- https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
73 - The Implications of the CAP Theorem
- Consistency means all clients see the same response to a query
- Availability means every query from a client gets a response
- Partition telerance means the system works if the system loses messages.
81 - Threading and Concurrency in Data Processing - Matthew Housley, PhD
- Solving the C10K Problem
  - Nginx was designed from the ground up to serve more than 10,000 simultaneous clients (C10K) by tasking each thread to manage many connections.
  - The Go programming language builds software on concurrency primitives - goroutines - that are automatically multiplexed across a small number of threads optimized to available cores.
- Further Reading
  - Summary of the Amazon Kinesis Event in us-east-1
    - https://aws.amazon.com/message/11201/
  - AWS Reveals It Broke Itself
    - https://www.theregister.com/2020/11/30/aws_outage_explanation/
  - Maximum Number of Threads Per Process in Linux?
    - https://stackoverflow.com/questions/344203/maximum-number-of-threads-per-process-in-linux
  - Make your Program Slower with threads
    - https://brooker.co.za/blog/2014/12/06/random.html
  - nginx
    - https://www.aosabook.org/en/nginx.html
  - The secret to 10 million concurrent connections: The kernel is the problem, not the solution
    - http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html
88 - What is a Data Mesh, and How Not to Mesh It Up - Barr Moses and Lior Gavish
- As first defined by Zhamak Dehghani, the original architect of the term, a data mesh is a type of data-platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design.
- This idea borrows from Eric Evans’s theory of domain-drive design, a flexible, scalable software development paradigm that matches the structure and language of your code with its corresponding business domain.
- If you haven’t already we highly recommend reading Dehghani’s groundbreaking article
  - How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh?
    - https://martinfowler.com/articles/data-monolith-to-mesh.html
  - or watching Max Schulte’s tech talk on
    - Why Zalando transitioned to a data mesh
      - https://databricks.com/session_na20/data-mesh-in-practice-how-europes-leading-online-platform-for-fashion-goes-beyond-the-data-lake
- Domain-oriented data architectures, like data meshes, give teams the best of both worlds: a centralized database (or a distributed data lake) with domains (or business areas) responsible for handling their own pipelines.
- Data mesh actualy mandates scalable, self-serve observability in your data
  - https://towardsdatascience.com/what-is-data-observability-40b337971e3e
- Domains cannot truly own their data if they don’t have observability
  - Encryption for data at rest and in motion
  - Data product versioning
  - Data product schemas
  - Data product discovery, catalog registration, and publishing
  - Data governance and standardization
  - Data production lineage
  - Data product monitoring, alerting, and logging
  - Data product quality metrics

Carlos Aguni

97 things

Every Data Engineer Should Know

Every SRE Should Know