Carlos Aguni

Highly motivated self-taught IT analyst. Always learning and ready to explore new skills. An eternal apprentice.


97 things

10 Apr 2022 »

Every Data Engineer Should Know

  • 3 - About the storage Layer - Julie Le Dem
    • We want to reduce data footprint, so the data costs less to store and is faster to retrieve.
    • Those implementation details are usually hidden behind what is commonly known as pushdowns.
    • Projection pushdown: consists of reading only the columns requested.
    • Predicate pushdown: consists of avoiding deserializing rows that are going to be discarded by a filter.
  • 7 - Be Intentional About the Batching Model in Your Data Pipelines - Raghtham Murthy
    • DTW: Data Time Batching Model, a batch is created for a time windows when all records with a date_timestamp in that iwndows have been received.
    • ATW: Arrival time window, a batch is created at a certain wall-clock time with records that were received in the time window prior to that time.
  • 8 - Beware of Silver-Bullet Syndrome - Thomas Nield
    • “We all have encountered this person: the one who zealously prometes Apache Spark or pushes to have all data wrangling work done in Alteryx Designer”.
    • Do you really want your professional identity to be simply a tool stack?
    • Would you rather your resume say, “I Know SQL , MongoDB, and Tableau” or
    • “I am an adaptable professional who can navigate ambiguous environments, overcome departmental barriers, and provide technical insights to maximize data value for an organization”?
    • Build your profesisonal identity on skills, problem-solving, and adaptability - not a fleeting technology.
  • 12 - Change Data Capture
    • This technique is a lot more robust than batch exports of the tables and has a low footprint on the production database.
    • Mind for
      • Scale
        • CDC pipeline has to be robust enough for high data volume. For example, in PostgreSQL, delays in reading the WAL can cause the database’s disk space to run out
      • Replication lag
        • This refers to the duration between the time that a transaction is committed in the primary database and the time it becomes available in the data warehouse.
      • Schema changes
        • It is important to propagate the schema changes to the data warehouse. Sometimes a schema change might require a historical sync
      • Masking
        • You have to mask sensitive columns for compliance purposes.
      • Historical syncs
        • Before applying the CDC changes, an initial historical sync of the tables is needed.
  • 19 - Data Pipeline Design Patterns for Reusability and Extensibility
    • SRP: single-responisibility principle
    • DRY: Don’t Repeat Yourself
    • Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma et al.
  • 33 - Friends Don’t Let Friends Do Dual-Writes - Gunnar Morling
    • CDC Debezium
  • 53 - Observability for Data Engineers - Barr Moses
    • Domain-Oriented Data Meshes
      • https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to-mesh-it-up/
    • Cloud warehouses
      • https://www.montecarlodata.com/blog-how-to-migrate-to-snowflake-like-a-boss/
    • Data-modeling solutions
      • https://en.wikipedia.org/wiki/Data_modeling
    • Data downtime
      • https://www.montecarlodata.com/blog-the-rise-of-data-downtime/
    • Data observability: Five pillars
      • https://www.montecarlodata.com/blog-what-is-data-observability/
    • Delivery of data products as platforms
      • https://www.montecarlodata.com/blog-how-to-build-your-data-platform-like-a-product/
  • 61 - Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
    • “I propose six dimensions for evaluating a data-warehousing solutions for the following use cases”:
      • Ingesting and storing all analytical data
      • Executing data transformations (the T of ELT)
      • Serving data to consumers (powering dashboards and ad hoc analysis)
    • Scalability
    • Price Elasticity
      • Resource-bases (pay per node of a certain configuration)
      • Usage-based (e.g. pay per gb of data scanned or CPU timej)
    • Interoperability
    • Querying and Transformation Features
    • Speed
      • A TPC-DS benchmark by Fivetran shows that most maintream (as of 2021) engines are roughly in the same ballpark in terms of performance.
    • Zero Maintenance
  • 62 - Small files in a Big Data World - Adi Polak
    • Why does this happen?
      • IoT ingestion
      • Spark jobs. many parallel tasks
      • Over-partitioned Hive tables
        • hive.merge.smallfiles.avgsize
        • hive.merge.size.per.task
    • Be aware of small files when designing data pipelines. Try to avoid them but know you can fix them too!
    • https://docs.microsoft.com/en-us/azure/storage/blobs/scalability-targets
    • https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
  • 73 - The Implications of the CAP Theorem
    • Consistency means all clients see the same response to a query
    • Availability means every query from a client gets a response
    • Partition telerance means the system works if the system loses messages.
  • 81 - Threading and Concurrency in Data Processing - Matthew Housley, PhD
    • Solving the C10K Problem
      • Nginx was designed from the ground up to serve more than 10,000 simultaneous clients (C10K) by tasking each thread to manage many connections.
      • The Go programming language builds software on concurrency primitives - goroutines - that are automatically multiplexed across a small number of threads optimized to available cores.
    • Further Reading
      • Summary of the Amazon Kinesis Event in us-east-1
        • https://aws.amazon.com/message/11201/
      • AWS Reveals It Broke Itself
        • https://www.theregister.com/2020/11/30/aws_outage_explanation/
      • Maximum Number of Threads Per Process in Linux?
        • https://stackoverflow.com/questions/344203/maximum-number-of-threads-per-process-in-linux
      • Make your Program Slower with threads
        • https://brooker.co.za/blog/2014/12/06/random.html
      • nginx
        • https://www.aosabook.org/en/nginx.html
      • The secret to 10 million concurrent connections: The kernel is the problem, not the solution
        • http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html
  • 88 - What is a Data Mesh, and How Not to Mesh It Up - Barr Moses and Lior Gavish
    • As first defined by Zhamak Dehghani, the original architect of the term, a data mesh is a type of data-platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design.
    • This idea borrows from Eric Evans’s theory of domain-drive design, a flexible, scalable software development paradigm that matches the structure and language of your code with its corresponding business domain.
    • If you haven’t already we highly recommend reading Dehghani’s groundbreaking article
      • How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh?
        • https://martinfowler.com/articles/data-monolith-to-mesh.html
      • or watching Max Schulte’s tech talk on
        • Why Zalando transitioned to a data mesh
          • https://databricks.com/session_na20/data-mesh-in-practice-how-europes-leading-online-platform-for-fashion-goes-beyond-the-data-lake
    • Domain-oriented data architectures, like data meshes, give teams the best of both worlds: a centralized database (or a distributed data lake) with domains (or business areas) responsible for handling their own pipelines.
    • Data mesh actualy mandates scalable, self-serve observability in your data
      • https://towardsdatascience.com/what-is-data-observability-40b337971e3e
    • Domains cannot truly own their data if they don’t have observability
      • Encryption for data at rest and in motion
      • Data product versioning
      • Data product schemas
      • Data product discovery, catalog registration, and publishing
      • Data governance and standardization
      • Data production lineage
      • Data product monitoring, alerting, and logging
      • Data product quality metrics

Every SRE Should Know