App Store Ecosystem Analytics

Overview

The App Store has over 13 years of public performance data — but raw datasets don't tell stories. This project answers one question visually: which app genres dominated the App Store, and how did that competition shift year by year?

The core technical challenge: the source dataset is 1.2M+ records (~500MB). Loading that volume in a browser would crash it or take minutes. The solution was treating Python as an ETL layer — pre-aggregating to two tiny CSV files (~50KB total) — then using D3.js exclusively for rendering a synchronized animated bar chart race.

Data Pipeline

Python handles all the heavy lifting. D3.js only ever sees pre-aggregated, browser-safe data.

flowchart LR
    A[/"Kaggle Dataset\n1.2M+ App Records\n~500MB raw CSV"/] -->|"data_prep.ipynb\nPandas + NumPy"| B{"Pre-aggregation\nby genre + time"}
    B -->|"monthly granularity"| C[/"genre_race_monthly.csv\n~45KB — monthly rankings"/]
    B -->|"yearly granularity"| D[/"genre_race_yearly.csv\n~5KB — yearly rankings"/]
    C -->|"fetched by"| E["D3.js\nObservable Notebook"]
    D -->|"fetched by"| E
    E -->|"renders"| F["SVG Bar Chart Race\n13 Years Animated"]
    F -->|"published"| G["Medium Article\n+ YouTube Demo"]

Pre-aggregation Design

Both output files are produced from a single Jupyter notebook (data_prep.ipynb). The 99.99% size reduction — 500MB to 50KB — is what makes browser rendering possible.

flowchart TD
    RAW["Raw Dataset — 1.2M+ rows\napp_name · genre · rating · reviews · release_date · price"] --> CLEAN["Data Cleaning\n— Parse release_date to datetime\n— Drop nulls in genre\n— Normalize genre labels"]
    CLEAN --> AGG["Aggregation\n— Group by genre + year_month\n— Sum review counts per genre\n— Rank genres per time period"]
    AGG --> M["genre_race_monthly.csv\nColumns: year_month · genre · review_count · rank\nRows: ~2,500"]
    AGG --> Y["genre_race_yearly.csv\nColumns: year · genre · review_count · rank\nRows: ~180"]
    style M fill:#2997ff,color:#fff,stroke:#0077ed
    style Y fill:#2997ff,color:#fff,stroke:#0077ed

Key Engineering Decisions

Decision #1 — Visualization Library

Why D3.js over Plotly, Streamlit charts, or Tableau?

A synchronized bar chart race with smooth interpolation requires frame-level SVG control. D3.js provides this via d3.transition() and selection.join() — every bar's position, width, and label animates independently per keyframe. Plotly and Streamlit charts treat each frame as a full re-render, producing non-interpolated jumps. Tableau cannot produce custom animations of this type. D3.js is the only library with the low-level SVG control this visualization requires.

Decision #2 — ETL Strategy

Why pre-aggregate with Python instead of processing in the browser?

1.2M records at ~500MB will exhaust browser memory or take minutes to parse. Python is the right tool for heavy transformation: Pandas group-bys and aggregations run in milliseconds on this dataset. The browser's job is exclusively rendering, not computing. This separation of concerns — Python for ETL, D3.js for rendering — mirrors production analytics architecture: the warehouse transforms data; the BI layer renders it.

Decision #3 — Two Output Granularities

Why monthly and yearly as separate files?

Monthly data reveals short-term market shifts and seasonal patterns. Yearly data reveals long-term structural trends (the rise of Games from 2008–2014, the plateau of Utilities). Combining them forces a tradeoff between granularity and payload size. Two focused files, loaded on demand, keeps each visualization fast and purposeful — the same principle as maintaining separate fact tables at different time grains in a dimensional model.

Outcome & Reflection

The most-read article on my Medium profile and the project that generated the most LinkedIn engagement. The key insight that resonated with readers: data engineering principles — ETL, pre-aggregation, separation of concerns — apply equally to frontend data visualization.

Any visualization project above ~10K rows benefits from treating data preparation as a proper pipeline problem. The D3.js code doesn't need to know how the data was shaped — it just needs clean, small, consistently structured input. That's what a good data layer provides.