Every Data Engineer Should Know
- 3 - About the storage Layer - Julie Le Dem
- We want to reduce data footprint, so the data costs less to store and is faster to retrieve.
- Those implementation details are usually hidden behind what is commonly known as pushdowns.
- Projection pushdown: consists of reading only the columns requested.
- Predicate pushdown: consists of avoiding deserializing rows that are going to be discarded by a filter.
- 7 - Be Intentional About the Batching Model in Your Data Pipelines - Raghtham Murthy
- DTW: Data Time Batching Model, a batch is created for a time windows when all records with a date_timestamp in that iwndows have been received.
- ATW: Arrival time window, a batch is created at a certain wall-clock time with records that were received in the time window prior to that time.
- 8 - Beware of Silver-Bullet Syndrome - Thomas Nield
- “We all have encountered this person: the one who zealously prometes Apache Spark or pushes to have all data wrangling work done in Alteryx Designer”.
- Do you really want your professional identity to be simply a tool stack?
- Would you rather your resume say, “I Know SQL , MongoDB, and Tableau” or
- “I am an adaptable professional who can navigate ambiguous environments, overcome departmental barriers, and provide technical insights to maximize data value for an organization”?
- Build your profesisonal identity on skills, problem-solving, and adaptability - not a fleeting technology.
- 12 - Change Data Capture
- This technique is a lot more robust than batch exports of the tables and has a low footprint on the production database.
- Mind for
- Scale
- CDC pipeline has to be robust enough for high data volume. For example, in PostgreSQL, delays in reading the WAL can cause the database’s disk space to run out
- Replication lag
- This refers to the duration between the time that a transaction is committed in the primary database and the time it becomes available in the data warehouse.
- Schema changes
- It is important to propagate the schema changes to the data warehouse. Sometimes a schema change might require a historical sync
- Masking
- You have to mask sensitive columns for compliance purposes.
- Historical syncs
- Before applying the CDC changes, an initial historical sync of the tables is needed.
- Scale
- 19 - Data Pipeline Design Patterns for Reusability and Extensibility
- SRP: single-responisibility principle
- DRY: Don’t Repeat Yourself
- Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma et al.
- 33 - Friends Don’t Let Friends Do Dual-Writes - Gunnar Morling
- CDC Debezium
- 53 - Observability for Data Engineers - Barr Moses
- Domain-Oriented Data Meshes
- https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to-mesh-it-up/
- Cloud warehouses
- https://www.montecarlodata.com/blog-how-to-migrate-to-snowflake-like-a-boss/
- Data-modeling solutions
- https://en.wikipedia.org/wiki/Data_modeling
- Data downtime
- https://www.montecarlodata.com/blog-the-rise-of-data-downtime/
- Data observability: Five pillars
- https://www.montecarlodata.com/blog-what-is-data-observability/
- Delivery of data products as platforms
- https://www.montecarlodata.com/blog-how-to-build-your-data-platform-like-a-product/
- Domain-Oriented Data Meshes
- 61 - Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
- “I propose six dimensions for evaluating a data-warehousing solutions for the following use cases”:
- Ingesting and storing all analytical data
- Executing data transformations (the T of ELT)
- Serving data to consumers (powering dashboards and ad hoc analysis)
- Scalability
- Price Elasticity
- Resource-bases (pay per node of a certain configuration)
- Usage-based (e.g. pay per gb of data scanned or CPU timej)
- Interoperability
- Querying and Transformation Features
- Speed
- A TPC-DS benchmark by Fivetran shows that most maintream (as of 2021) engines are roughly in the same ballpark in terms of performance.
- Zero Maintenance
- “I propose six dimensions for evaluating a data-warehousing solutions for the following use cases”:
- 62 - Small files in a Big Data World - Adi Polak
- Why does this happen?
- IoT ingestion
- Spark jobs. many parallel tasks
- Over-partitioned Hive tables
- hive.merge.smallfiles.avgsize
- hive.merge.size.per.task
- Be aware of small files when designing data pipelines. Try to avoid them but know you can fix them too!
- https://docs.microsoft.com/en-us/azure/storage/blobs/scalability-targets
- https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
- Why does this happen?
- 73 - The Implications of the CAP Theorem
- Consistency means all clients see the same response to a query
- Availability means every query from a client gets a response
- Partition telerance means the system works if the system loses messages.
- 81 - Threading and Concurrency in Data Processing - Matthew Housley, PhD
- Solving the C10K Problem
- Nginx was designed from the ground up to serve more than 10,000 simultaneous clients (C10K) by tasking each thread to manage many connections.
- The Go programming language builds software on concurrency primitives - goroutines - that are automatically multiplexed across a small number of threads optimized to available cores.
- Further Reading
- Summary of the Amazon Kinesis Event in us-east-1
- https://aws.amazon.com/message/11201/
- AWS Reveals It Broke Itself
- https://www.theregister.com/2020/11/30/aws_outage_explanation/
- Maximum Number of Threads Per Process in Linux?
- https://stackoverflow.com/questions/344203/maximum-number-of-threads-per-process-in-linux
- Make your Program Slower with threads
- https://brooker.co.za/blog/2014/12/06/random.html
- nginx
- https://www.aosabook.org/en/nginx.html
- The secret to 10 million concurrent connections: The kernel is the problem, not the solution
- http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html
- Summary of the Amazon Kinesis Event in us-east-1
- Solving the C10K Problem
- 88 - What is a Data Mesh, and How Not to Mesh It Up - Barr Moses and Lior Gavish
- As first defined by Zhamak Dehghani, the original architect of the term, a data mesh is a type of data-platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design.
- This idea borrows from Eric Evans’s theory of domain-drive design, a flexible, scalable software development paradigm that matches the structure and language of your code with its corresponding business domain.
- If you haven’t already we highly recommend reading Dehghani’s groundbreaking article
- How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh?
- https://martinfowler.com/articles/data-monolith-to-mesh.html
- or watching Max Schulte’s tech talk on
- Why Zalando transitioned to a data mesh
- https://databricks.com/session_na20/data-mesh-in-practice-how-europes-leading-online-platform-for-fashion-goes-beyond-the-data-lake
- Why Zalando transitioned to a data mesh
- How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh?
- Domain-oriented data architectures, like data meshes, give teams the best of both worlds: a centralized database (or a distributed data lake) with domains (or business areas) responsible for handling their own pipelines.
- Data mesh actualy mandates scalable, self-serve observability in your data
- https://towardsdatascience.com/what-is-data-observability-40b337971e3e
- Domains cannot truly own their data if they don’t have observability
- Encryption for data at rest and in motion
- Data product versioning
- Data product schemas
- Data product discovery, catalog registration, and publishing
- Data governance and standardization
- Data production lineage
- Data product monitoring, alerting, and logging
- Data product quality metrics