The Problem
A common proof-of-concept pipeline design for ingesting 5 million daily learner activity logs from an external LMS API into Snowflake — while functional enough to demonstrate the concept — contains four critical anti-patterns that would cause outages, data loss, or security incidents at production scale. This write-up diagnoses each failure mode and documents the redesign.
The Naive Design & Its Failure Modes
The original proposal loaded all 5M rows into a Matillion in-memory variable, wiped the target table nightly before reinserting, used row-by-row database inserts, and stored API credentials directly inside the script. Each of these decisions creates a distinct failure mode.
Storing 5M rows in a Matillion local variable
Matillion is an orchestration engine, not a data processing engine. Attempting to buffer 5 million rows in instance memory will trigger an Out-of-Memory crash, taking down the entire orchestrator — not just this job. The fix is to treat Matillion as a coordinator: the Python script fetches data in paginated chunks and writes each chunk directly to S3 as a compressed Parquet file, keeping memory consumption flat regardless of dataset size.
Truncate-and-load with no rollback
Wiping the target table before each run means a mid-run API failure leaves the
ANALYTICS_READY
table empty — causing downtime for every dashboard and model that depends on it.
It also makes historical trend analysis impossible since yesterday's data is gone.
The fix is a Medallion staging pattern: raw data lands in a transient table first,
and a MERGE
upsert promotes only clean, validated records into the analytics layer.
Row-by-row inserts via Database Query Component
Snowflake's billing model charges by compute time. Row-by-row inserts create enormous
per-transaction overhead and can turn a seconds-long load into a hours-long credit drain.
Snowflake's native COPY INTO
command — invoked through Matillion's S3 Load component — is purpose-built for bulk
ingestion: it parallelizes across Snowflake's virtual warehouse nodes and loads millions
of rows in seconds.
Hardcoded API credentials in the script body
Embedding an API secret key inside a Matillion job script exposes it to every user with job-view access and to any log aggregation system. In an educational context, this likely violates FERPA and university data governance standards. The fix is to route credentials through AWS Secrets Manager: the key is encrypted at rest, injected only at runtime, and access is auditable via IAM policy.
The Redesigned Architecture
The redesign shifts from a memory-heavy ETL model to a storage-first ELT approach — using S3 as an immutable Bronze layer and Snowflake's own compute for all transformation and merge logic. The pipeline is idempotent: re-running it produces the same result, with no risk of data loss or duplication.
flowchart LR
API["External API\n(Learner Activity Logs)"]
SM["AWS Secrets Manager\n(API Key at runtime)"]
PY["Matillion\nPython Script\n(paginated fetch)"]
S3["AWS S3\nBronze Layer\n(Parquet, encrypted)"]
LOAD["Matillion S3 Load\n(COPY INTO)"]
RAW["Snowflake\nRAW_LEARNER_ACTIVITY\n(Transient Table)"]
MERGE["Matillion SQL Merge\n(UPSERT on activity_id)"]
READY["ANALYTICS_READY\nSilver Layer\n(with INGESTED_AT, BATCH_ID)"]
BI["Downstream\nBI / AI Teams"]
ALERT["SNS / Slack\nFailure Alerts"]
SM -->|"inject key"| PY
API -->|"paginated chunks"| PY
PY -->|"write Parquet chunks"| S3
S3 -->|"COPY INTO"| LOAD
LOAD --> RAW
RAW --> MERGE
MERGE --> READY
READY --> BI
LOAD -->|"on failure"| ALERT
PY -->|"on failure"| ALERT
style S3 fill:#3b5998,color:#fff,stroke:#2d4a7a
style READY fill:#2997ff,color:#fff,stroke:#0077ed Architecture Diagram
The full system view — showing data flow from the external API through Matillion orchestration, S3 Bronze staging, Snowflake bulk load, and MERGE-based promotion to the analytics layer, with IAM and Secrets Manager securing every boundary.
Key Design Decisions
High-water mark over full historical pull
Instead of pulling all 5M rows every night, the Python script first queries
MAX(activity_timestamp)
from the analytics table and passes that timestamp to the API as a filter. Only records
created or modified since the last successful run are fetched. This reduces API load,
ingestion time, and Snowflake compute spend — and makes the pipeline naturally incremental
as the dataset grows.
Parquet on S3 as the Bronze layer
Writing compressed Parquet files to S3 (rather than loading directly into Snowflake)
decouples extraction from ingestion. S3 becomes an immutable audit log — if the Snowflake
load fails, raw data is preserved and re-loadable with no additional API call. Parquet's
columnar compression also reduces S3 storage costs and speeds up
COPY INTO
throughput compared to CSV.
UPSERT on activity_id for late-arriving updates
Educational platforms routinely backfill or modify historical records — a student's
session completion status might be updated days after the activity occurred. A pure
INSERT pattern would produce duplicates; a full reload would be expensive and fragile.
A
MERGE
on
activity_id
handles both cases: it updates existing rows in-place and inserts only net-new records,
rewriting only the affected Snowflake micro-partitions.
Audit columns and alerting on failure paths
Every promoted row is stamped with
INGESTED_AT
and
BATCH_ID
for lineage tracking. Matillion's failure paths are wired to SNS/Slack so any API outage
or load error surfaces immediately — preventing the silent data gaps that truncate-and-load
pipelines are notorious for hiding until a stakeholder notices stale dashboards.
Reflection
The most instructive part of this design exercise was tracing how each anti-pattern compounds the others. A truncate-and-load strategy feels safe until you add the OOM risk: now a crash during extraction leaves the analytics table empty with no recovery path. Fixing memory without fixing the load strategy still leaves you exposed. The Medallion pattern resolves all four failure modes together — extraction, staging, and promotion are independent and individually recoverable.
The FERPA dimension reinforced something I've applied in production: security controls aren't optional plumbing to add later. Routing credentials through AWS Secrets Manager from day one costs nothing at small scale and saves significant remediation effort once an audit reveals a hardcoded key in a shared job definition.