Designing a Scalable Learner Data Pipeline

The Problem

A common proof-of-concept pipeline design for ingesting 5 million daily learner activity logs from an external LMS API into Snowflake — while functional enough to demonstrate the concept — contains four critical anti-patterns that would cause outages, data loss, or security incidents at production scale. This write-up diagnoses each failure mode and documents the redesign.

The Naive Design & Its Failure Modes

The original proposal loaded all 5M rows into a Matillion in-memory variable, wiped the target table nightly before reinserting, used row-by-row database inserts, and stored API credentials directly inside the script. Each of these decisions creates a distinct failure mode.

Anti-Pattern #1 — Memory

Storing 5M rows in a Matillion local variable

Matillion is an orchestration engine, not a data processing engine. Attempting to buffer 5 million rows in instance memory will trigger an Out-of-Memory crash, taking down the entire orchestrator — not just this job. The fix is to treat Matillion as a coordinator: the Python script fetches data in paginated chunks and writes each chunk directly to S3 as a compressed Parquet file, keeping memory consumption flat regardless of dataset size.

Anti-Pattern #2 — Data Integrity

Truncate-and-load with no rollback

Wiping the target table before each run means a mid-run API failure leaves the ANALYTICS_READY table empty — causing downtime for every dashboard and model that depends on it. It also makes historical trend analysis impossible since yesterday's data is gone. The fix is a Medallion staging pattern: raw data lands in a transient table first, and a MERGE upsert promotes only clean, validated records into the analytics layer.

Anti-Pattern #3 — Performance

Row-by-row inserts via Database Query Component

Snowflake's billing model charges by compute time. Row-by-row inserts create enormous per-transaction overhead and can turn a seconds-long load into a hours-long credit drain. Snowflake's native COPY INTO command — invoked through Matillion's S3 Load component — is purpose-built for bulk ingestion: it parallelizes across Snowflake's virtual warehouse nodes and loads millions of rows in seconds.

Anti-Pattern #4 — Security

Hardcoded API credentials in the script body

Embedding an API secret key inside a Matillion job script exposes it to every user with job-view access and to any log aggregation system. In an educational context, this likely violates FERPA and university data governance standards. The fix is to route credentials through AWS Secrets Manager: the key is encrypted at rest, injected only at runtime, and access is auditable via IAM policy.

The Redesigned Architecture

The redesign shifts from a memory-heavy ETL model to a storage-first ELT approach — using S3 as an immutable Bronze layer and Snowflake's own compute for all transformation and merge logic. The pipeline is idempotent: re-running it produces the same result, with no risk of data loss or duplication.

flowchart LR
    API["External API\n(Learner Activity Logs)"]
    SM["AWS Secrets Manager\n(API Key at runtime)"]
    PY["Matillion\nPython Script\n(paginated fetch)"]
    S3["AWS S3\nBronze Layer\n(Parquet, encrypted)"]
    LOAD["Matillion S3 Load\n(COPY INTO)"]
    RAW["Snowflake\nRAW_LEARNER_ACTIVITY\n(Transient Table)"]
    MERGE["Matillion SQL Merge\n(UPSERT on activity_id)"]
    READY["ANALYTICS_READY\nSilver Layer\n(with INGESTED_AT, BATCH_ID)"]
    BI["Downstream\nBI / AI Teams"]
    ALERT["SNS / Slack\nFailure Alerts"]

    SM -->|"inject key"| PY
    API -->|"paginated chunks"| PY
    PY -->|"write Parquet chunks"| S3
    S3 -->|"COPY INTO"| LOAD
    LOAD --> RAW
    RAW --> MERGE
    MERGE --> READY
    READY --> BI
    LOAD -->|"on failure"| ALERT
    PY -->|"on failure"| ALERT

    style S3 fill:#3b5998,color:#fff,stroke:#2d4a7a
    style READY fill:#2997ff,color:#fff,stroke:#0077ed

Architecture Diagram

The full system view — showing data flow from the external API through Matillion orchestration, S3 Bronze staging, Snowflake bulk load, and MERGE-based promotion to the analytics layer, with IAM and Secrets Manager securing every boundary.

Key Design Decisions

Decision #1 — Extraction Strategy

High-water mark over full historical pull

Instead of pulling all 5M rows every night, the Python script first queries MAX(activity_timestamp) from the analytics table and passes that timestamp to the API as a filter. Only records created or modified since the last successful run are fetched. This reduces API load, ingestion time, and Snowflake compute spend — and makes the pipeline naturally incremental as the dataset grows.

Decision #2 — Storage Format

Parquet on S3 as the Bronze layer

Writing compressed Parquet files to S3 (rather than loading directly into Snowflake) decouples extraction from ingestion. S3 becomes an immutable audit log — if the Snowflake load fails, raw data is preserved and re-loadable with no additional API call. Parquet's columnar compression also reduces S3 storage costs and speeds up COPY INTO throughput compared to CSV.

Decision #3 — Merge Strategy

UPSERT on activity_id for late-arriving updates

Educational platforms routinely backfill or modify historical records — a student's session completion status might be updated days after the activity occurred. A pure INSERT pattern would produce duplicates; a full reload would be expensive and fragile. A MERGE on activity_id handles both cases: it updates existing rows in-place and inserts only net-new records, rewriting only the affected Snowflake micro-partitions.

Decision #4 — Observability

Audit columns and alerting on failure paths

Every promoted row is stamped with INGESTED_AT and BATCH_ID for lineage tracking. Matillion's failure paths are wired to SNS/Slack so any API outage or load error surfaces immediately — preventing the silent data gaps that truncate-and-load pipelines are notorious for hiding until a stakeholder notices stale dashboards.

Reflection

The most instructive part of this design exercise was tracing how each anti-pattern compounds the others. A truncate-and-load strategy feels safe until you add the OOM risk: now a crash during extraction leaves the analytics table empty with no recovery path. Fixing memory without fixing the load strategy still leaves you exposed. The Medallion pattern resolves all four failure modes together — extraction, staging, and promotion are independent and individually recoverable.

The FERPA dimension reinforced something I've applied in production: security controls aren't optional plumbing to add later. Routing credentials through AWS Secrets Manager from day one costs nothing at small scale and saves significant remediation effort once an audit reveals a hardcoded key in a shared job definition.