The ETL Checklist: 10 Critical Design Decisions Before You Write a Single Line of Code

Every ETL project starts with a spark of optimism. You have a source, a target, and a pressing need to move data from point A to point B. The temptation to jump straight into writing code is almost irresistible. But experienced teams know that the first line of code is rarely the problem—it's the decisions made before that line that determine whether the pipeline survives its first month in production.

This checklist covers ten design decisions that deserve your attention before you open an editor. They are not listed in strict priority order; the weight of each depends on your context. But ignoring any one of them is a bet that your data is simpler than it actually is. Let's walk through them one by one.

1. Who Needs This Checklist and What Goes Wrong Without It

If you are a data engineer, analytics engineer, or a developer tasked with moving data between systems, this checklist is for you. It is also for team leads who review ETL designs and want a structured way to catch oversights before they become incidents.

Without a deliberate design phase, teams commonly face a cascade of problems. The most frequent is schema drift: a source system adds a column, and the pipeline silently drops it or breaks. Another classic is performance collapse: a full load that worked fine with 100,000 rows fails catastrophically when the source grows to 10 million rows. Then there is the debugging nightmare: a failure occurs at 3 AM, and the logs contain nothing but a cryptic database error with no context about which row or transformation caused it.

These problems share a root cause—decisions made in haste, often because the team felt pressure to deliver quickly. But rushing the design phase rarely saves time; it just shifts the time cost to later, where it multiplies. A few hours of upfront thinking can save weeks of firefighting.

We have seen teams spend months rebuilding a pipeline that was originally written in two days. The rebuild happened because nobody had asked: what happens when the source changes its schema? How do we handle duplicate rows? What is our retry strategy? These are not advanced questions—they are basic hygiene. Yet they are routinely skipped.

This checklist is not a silver bullet. It will not tell you which tool to use or which cloud service to pick. It is a set of questions and trade-offs that every ETL design must address. By the end, you should have a clear picture of your pipeline's shape, its failure modes, and its maintainability over time.

2. Prerequisites: What You Need to Settle Before the First Design Meeting

Before you can make informed design decisions, you need a clear understanding of three things: the source, the target, and the business rules that connect them. This sounds obvious, but in practice, teams often start designing with only a vague idea of what the source looks like.

Source Profiling Is Non-Negotiable

You cannot design a reliable pipeline without knowing your data's shape, volume, and quirks. Source profiling means running queries to discover column types, null percentages, distinct counts, min/max values, and patterns of missing data. It also means checking for things like trailing spaces, inconsistent date formats, and hidden characters that break delimiters.

One team we heard about spent weeks building a pipeline that ingested customer addresses. The source had a column called state that appeared to contain two-letter state codes. But a profile revealed that 3% of rows had three-letter codes, 1% had full state names, and 0.5% were blank. Without profiling, those rows would have silently failed or produced garbage in the target.

Profiling does not need to be exhaustive on day one. But you should run enough queries to understand the data's boundaries. A good rule of thumb is to profile at least 100,000 rows or 10% of the table, whichever is larger.

Target Schema Design

Your target schema is not just a dump of the source. It should be designed for the queries that will run against it. This means thinking about partitioning, indexing, and data types that match the analytical workload. If the target is a data warehouse, consider using star schemas or dimensional models. If it is a data lake, think about file formats (Parquet, ORC) and compression.

One common mistake is to mirror the source schema exactly, including its quirks. For example, a source might store dates as strings in MM/DD/YYYY format. In the target, you should convert them to proper date types. This is a transformation that belongs in the ETL, not in every downstream query.

Business Rules and Data Contracts

Business rules define what the data means and how it should be transformed. For example, a rule might say: "If the order status is 'cancelled' and the refund date is null, set the refund date to the current timestamp." These rules must be documented and agreed upon by stakeholders before coding begins.

Data contracts are a newer concept that formalizes the expectations between data producers and consumers. A contract specifies the schema, semantics, and service-level agreements (SLAs) for a dataset. While not every project needs a formal contract, having a written agreement about what the source guarantees (or does not guarantee) can prevent endless back-and-forth later.

3. The Core Workflow: Ten Design Decisions in Sequence

These ten decisions form the backbone of your ETL design. They are not a rigid sequence—some decisions will loop back and influence others—but this order provides a logical flow from understanding to implementation.

Decision 1: Full Load or Incremental?

The first and most impactful decision is whether to reload the entire dataset each time or only process changes. Full loads are simpler to implement and debug, but they become impractical as data grows. Incremental loads are more efficient but require a reliable way to detect changes—usually a timestamp column, a change data capture (CDC) feed, or a version number.

If your source does not have a reliable change indicator, you may need to use a full load with a small window, or implement a hash-based comparison to find changed rows. Both approaches have trade-offs in complexity and resource usage.

Decision 2: Push or Pull?

Will your pipeline pull data from the source, or will the source push data to you? Pull is more common and gives you control over scheduling. Push is useful when the source cannot be queried directly (e.g., a third-party API that sends webhooks). Push also reduces load on the source database.

For pull-based pipelines, consider the frequency of polling. Too frequent and you overload the source; too infrequent and data freshness suffers. A good starting point is to match the source's natural update cadence.

Decision 3: Batch or Stream?

Batch processing is simpler and more forgiving of errors. Streaming is necessary for real-time use cases but introduces complexity around state management, exactly-once semantics, and ordering. Most pipelines start as batch and evolve to streaming only when latency requirements demand it.

If you choose batch, decide on the batch window. Hourly batches are common for operational reporting; daily batches work for many analytics use cases. If you choose stream, you will need a stream processing framework like Apache Kafka or AWS Kinesis, and you must plan for checkpointing and recovery.

Decision 4: Schema on Read or Write?

Schema-on-write enforces a fixed schema before data is stored. Schema-on-read applies the schema at query time, allowing raw data to be stored in a flexible format like JSON or Avro. The former is stricter and catches errors early; the latter is more flexible but pushes complexity to consumers.

For data warehouses, schema-on-write is the norm. For data lakes, schema-on-read is common, but it requires careful management of schema evolution. A hybrid approach is to store raw data in a schema-on-read format and then apply a schema-on-write layer for curated datasets.

Decision 5: How to Handle Schema Changes?

Schema changes are inevitable. Your design must account for new columns, deleted columns, changed data types, and renamed columns. Common strategies include:

Ignore: silently drop unknown columns. Simple but risky.
Fail: raise an alert and stop the pipeline. Safe but may cause downtime.
Auto-evolve: automatically add new columns to the target. Flexible but can lead to schema bloat.
Versioned: maintain multiple schema versions and route data accordingly. Complex but robust.

Most teams start with a fail-fast approach and evolve to auto-evolve after they have monitoring in place.

Decision 6: Error Handling and Retry Logic

Not all errors are equal. Transient errors (network timeouts, deadlocks) should be retried with exponential backoff. Permanent errors (invalid data, constraint violations) should be quarantined for manual review. Design a dead-letter queue or error table where problematic rows are stored with context about why they failed.

Also decide on the retry limit. Three retries with a 30-second delay is a common baseline. Beyond that, alert a human. Do not retry indefinitely—that can mask underlying issues.

Decision 7: Logging and Observability

Your pipeline must produce enough information to diagnose failures without overwhelming storage. Log the start and end of each run, the number of rows processed, the number of errors, and the duration. Include a unique run ID that can be traced from source to target.

Consider structured logging (JSON format) so that logs can be ingested into a centralized monitoring system. Also think about metrics: record throughput, latency, error rates, and data quality checks (e.g., row count comparisons between source and target).

Decision 8: Testing Strategy

Testing an ETL pipeline is harder than testing application code because the data is unpredictable. Still, you need tests. Start with unit tests for individual transformations—test that a function correctly handles nulls, edge cases, and type conversions. Then add integration tests that run a small subset of real data through the entire pipeline and verify the output.

Data quality tests are also essential. For example, after loading, run a query to check that no duplicate primary keys exist, that foreign key references are valid, and that numeric fields fall within expected ranges. Automate these checks and alert on failures.

Decision 9: Deployment and Versioning

Your ETL code should be version-controlled, just like any other software. Use Git and follow a branching strategy. Automate deployments with CI/CD so that changes are tested and promoted through environments (dev, staging, prod).

Also version your data models. When you change a transformation, the downstream consumers need to know. A simple approach is to include a version number in the table name or metadata column.

Decision 10: Monitoring and Alerting

Once your pipeline is in production, you need to know when it fails. Set up alerts for failures, delays, and data quality violations. Use a monitoring tool that can track pipeline health over time and provide dashboards for stakeholders.

Do not alert on every minor issue. Tune thresholds so that alerts are actionable. A good rule is to alert when a run fails, when a run takes longer than twice its expected duration, or when row counts deviate by more than 5% from the historical average.

4. Tools, Setup, and Environment Realities

Your design decisions will be shaped by the tools and environment you have. This section covers practical considerations around infrastructure, cost, and team skills.

Cloud vs. On-Premises

Cloud platforms offer managed ETL services (AWS Glue, Azure Data Factory, Google Cloud Dataflow) that reduce operational overhead. On-premises solutions give you more control but require you to manage servers, storage, and networking. For most teams, the cloud is the default choice, but be aware of egress costs and data residency requirements.

If you are in a regulated industry, you may need to keep data on-premises or in a specific region. Factor that into your decision early.

Orchestration Tools

You need a way to schedule and monitor your pipeline. Apache Airflow is the most popular open-source orchestrator, but it has a learning curve. Managed alternatives like Prefect, Dagster, and cloud-native schedulers (e.g., AWS Step Functions) are also worth considering.

Choose an orchestrator that supports retries, dependencies, and alerting out of the box. Avoid cron jobs for anything beyond simple scripts—they lack visibility and error handling.

Development Environment

Set up a local development environment that mirrors production as closely as possible. Use containers (Docker) to package your ETL code and dependencies. This ensures that what works on your laptop will work in production.

Also invest in a test data generator. Production data is often sensitive and cannot be used in development. A good generator can create realistic data with the same distributions and edge cases as the real source.

Cost Considerations

ETL pipelines consume compute and storage resources. Full loads cost more than incremental loads. Streaming costs more than batch. Make sure you estimate the cost of running the pipeline at scale before committing to a design.

One common surprise is the cost of scanning data in cloud data warehouses. If your pipeline reads the entire source table every time, you may incur significant query costs. Incremental loads reduce this, but they require a partition or index on the change column.

5. Variations for Different Constraints

Not every ETL project looks the same. Here are three common scenarios and how the design decisions shift.

Scenario A: Small Data, Simple Transformations

If you are moving a few thousand rows a day from a single source to a single target, you can afford to keep things simple. Use full loads, batch processing, and a straightforward error-handling strategy. Schema-on-write is fine. You may not need a sophisticated orchestrator—a cron job with logging might suffice.

But even in simple scenarios, do not skip profiling and testing. The data may be small, but the business may depend on its accuracy.

Scenario B: Large Data, Complex Business Rules

When you have billions of rows and dozens of transformations, every decision matters. Incremental loads are mandatory. You need a robust CDC mechanism, probably from a log-based source like Debezium. Schema evolution must be handled carefully, perhaps with a schema registry. Error handling should include a dead-letter queue and automated retries with exponential backoff.

Testing becomes critical. You cannot afford to reprocess a full load after a bug is discovered. Invest in data quality checks and a staging environment that mirrors production.

Scenario C: Real-Time Requirements

If the business needs data within seconds of it being created, you need streaming. This changes the design significantly. You must handle out-of-order events, exactly-once semantics, and stateful operations. Tools like Apache Flink or Kafka Streams become relevant.

Even with streaming, you may still need a batch layer for historical reprocessing. The Lambda architecture (batch + stream) is one approach, though the Kappa architecture (stream only) is gaining popularity for simpler systems.

6. Pitfalls, Debugging, and What to Check When It Fails

No matter how well you design, things will go wrong. Here are common pitfalls and how to debug them.

Silent Data Loss

The most dangerous failure is one that goes unnoticed. A row is dropped because of a type mismatch, but the pipeline reports success. To catch this, always compare row counts between source and target after each run. If the counts differ, investigate immediately.

Also check for nulls where you expect values. A common culprit is a join that silently drops rows because of a missing foreign key.

Performance Degradation Over Time

Pipelines that run fine for weeks may suddenly slow down. The usual cause is data growth. If you are using full loads, the time to process increases linearly with data volume. Switch to incremental loads or partition your source queries.

Another cause is lock contention. If your pipeline holds long-running transactions, it may block other processes. Consider using snapshot isolation or read-only replicas.

Unhandled Schema Changes

A new column appears in the source, and your pipeline ignores it. Later, a downstream report breaks because it expected that column. The fix is to monitor schema changes proactively. Many databases have system tables that track column additions. Poll them and alert when changes occur.

Debugging Steps

When a pipeline fails, follow these steps:

Check the logs for the error message and the run ID.
Identify which step failed (extract, transform, or load).
Look at the data that caused the failure—often a single row with unexpected values.
Reproduce the failure in a test environment with the same data.
Fix the transformation or add a data quality rule to quarantine the problematic row.
Rerun the pipeline from the point of failure, not from the beginning.

7. FAQ: Common Questions About ETL Design Decisions

Should I use a custom ETL framework or build from scratch?

It depends on your team's size and the complexity of your pipelines. Custom frameworks (like Airflow with Python operators) give you flexibility and are easier to debug. Managed services reduce operational overhead but may lock you into a vendor. For most teams, starting with a managed service and migrating to a custom framework when needed is a safe path.

How do I choose between ELT and ETL?

ELT (extract, load, transform) pushes transformations to the target system, which is often a data warehouse with powerful compute. This is popular in modern cloud warehouses because it leverages their scalability. ETL (extract, transform, load) transforms data before loading, which is useful when the target is a simple database or when you need to clean data before storage. Choose ELT if your target is a cloud warehouse and your transformations are SQL-friendly. Choose ETL if your transformations require complex logic or if the target has limited compute.

What is the best way to handle slowly changing dimensions (SCDs)?

SCDs are a classic data warehousing challenge. Type 1 overwrites old values; Type 2 maintains history by adding new rows with effective dates; Type 3 stores limited history in separate columns. The choice depends on whether you need to track historical changes. For most analytical use cases, Type 2 is the standard, but it increases storage and query complexity. Start with Type 1 if history is not required, and evolve to Type 2 only when needed.

How often should I run my pipeline?

The frequency should match the business need for fresh data. If reports are generated daily, a daily batch is sufficient. If the operations team needs near-real-time visibility, consider hourly or streaming. Also consider the source system's load—do not poll more often than necessary.

8. What to Do Next: Specific Actions After Reading This Checklist

You have read through the ten decisions. Now it is time to apply them to your project. Here are five concrete next steps:

Profile your source data within the next week. Run queries to understand its shape, volume, and quirks. Document the results in a shared document.
Write down your business rules for each transformation. Get sign-off from stakeholders. This will prevent scope creep later.
Choose your incremental strategy. If your source has a reliable timestamp column, plan to use it. If not, investigate CDC tools or hash-based comparison.
Set up a test environment with a small subset of real data. Write at least one unit test and one integration test before you write production code.
Design your error handling and monitoring. Decide on retry limits, dead-letter queues, and alert thresholds. Implement logging from day one.

These steps will not make your pipeline perfect, but they will give it a solid foundation. The goal is not to eliminate all failures—that is impossible—but to make failures visible, diagnosable, and recoverable. Start with one decision at a time, and iterate as you learn more about your data.

The ETL Checklist: 10 Critical Design Decisions Before You Write a Single Line of Code

Table of Contents

1. Who Needs This Checklist and What Goes Wrong Without It

2. Prerequisites: What You Need to Settle Before the First Design Meeting

Source Profiling Is Non-Negotiable

Target Schema Design

Business Rules and Data Contracts

3. The Core Workflow: Ten Design Decisions in Sequence

Decision 1: Full Load or Incremental?

Decision 2: Push or Pull?

Decision 3: Batch or Stream?

Decision 4: Schema on Read or Write?

Decision 5: How to Handle Schema Changes?

Decision 6: Error Handling and Retry Logic

Decision 7: Logging and Observability

Decision 8: Testing Strategy

Decision 9: Deployment and Versioning

Decision 10: Monitoring and Alerting

4. Tools, Setup, and Environment Realities

Cloud vs. On-Premises

Orchestration Tools

Development Environment

Cost Considerations

5. Variations for Different Constraints

Scenario A: Small Data, Simple Transformations

Scenario B: Large Data, Complex Business Rules

Scenario C: Real-Time Requirements

6. Pitfalls, Debugging, and What to Check When It Fails

Silent Data Loss

Performance Degradation Over Time

Unhandled Schema Changes

Debugging Steps

7. FAQ: Common Questions About ETL Design Decisions

Should I use a custom ETL framework or build from scratch?

How do I choose between ELT and ETL?

What is the best way to handle slowly changing dimensions (SCDs)?

How often should I run my pipeline?

8. What to Do Next: Specific Actions After Reading This Checklist

Comments (0)

Table of Contents

1. Who Needs This Checklist and What Goes Wrong Without It

2. Prerequisites: What You Need to Settle Before the First Design Meeting

Source Profiling Is Non-Negotiable

Target Schema Design

Business Rules and Data Contracts

3. The Core Workflow: Ten Design Decisions in Sequence

Decision 1: Full Load or Incremental?

Decision 2: Push or Pull?

Decision 3: Batch or Stream?

Decision 4: Schema on Read or Write?

Decision 5: How to Handle Schema Changes?

Decision 6: Error Handling and Retry Logic

Decision 7: Logging and Observability

Decision 8: Testing Strategy

Decision 9: Deployment and Versioning

Decision 10: Monitoring and Alerting

4. Tools, Setup, and Environment Realities

Cloud vs. On-Premises

Orchestration Tools

Development Environment

Cost Considerations

5. Variations for Different Constraints

Scenario A: Small Data, Simple Transformations

Scenario B: Large Data, Complex Business Rules

Scenario C: Real-Time Requirements

6. Pitfalls, Debugging, and What to Check When It Fails

Silent Data Loss

Performance Degradation Over Time

Unhandled Schema Changes

Debugging Steps

7. FAQ: Common Questions About ETL Design Decisions

Should I use a custom ETL framework or build from scratch?

How do I choose between ELT and ETL?

What is the best way to handle slowly changing dimensions (SCDs)?

How often should I run my pipeline?

8. What to Do Next: Specific Actions After Reading This Checklist

Share this article:

Comments (0)

Related Articles

Building Your First ETL Pipeline: A Toy Train Analogy for Beginners

ETL Process Design: Building a Lego City from Jumbled Toy Parts

ETL Is a Smoothie Recipe: Designing Data Workflows with Joy