Overview
The App Store has over 13 years of public performance data — but raw datasets don't tell stories. This project answers one question visually: which app genres dominated the App Store, and how did that competition shift year by year?
The core technical challenge: the source dataset is 1.2M+ records (~500MB). Loading that volume in a browser would crash it or take minutes. The solution was treating Python as an ETL layer — pre-aggregating to two tiny CSV files (~50KB total) — then using D3.js exclusively for rendering a synchronized animated bar chart race.
Data Pipeline
Python handles all the heavy lifting. D3.js only ever sees pre-aggregated, browser-safe data.
flowchart LR
A[/"Kaggle Dataset\n1.2M+ App Records\n~500MB raw CSV"/] -->|"data_prep.ipynb\nPandas + NumPy"| B{"Pre-aggregation\nby genre + time"}
B -->|"monthly granularity"| C[/"genre_race_monthly.csv\n~45KB — monthly rankings"/]
B -->|"yearly granularity"| D[/"genre_race_yearly.csv\n~5KB — yearly rankings"/]
C -->|"fetched by"| E["D3.js\nObservable Notebook"]
D -->|"fetched by"| E
E -->|"renders"| F["SVG Bar Chart Race\n13 Years Animated"]
F -->|"published"| G["Medium Article\n+ YouTube Demo"] Pre-aggregation Design
Both output files are produced from a single Jupyter notebook
(data_prep.ipynb).
The 99.99% size reduction — 500MB to 50KB — is what makes browser rendering possible.
flowchart TD
RAW["Raw Dataset — 1.2M+ rows\napp_name · genre · rating · reviews · release_date · price"] --> CLEAN["Data Cleaning\n— Parse release_date to datetime\n— Drop nulls in genre\n— Normalize genre labels"]
CLEAN --> AGG["Aggregation\n— Group by genre + year_month\n— Sum review counts per genre\n— Rank genres per time period"]
AGG --> M["genre_race_monthly.csv\nColumns: year_month · genre · review_count · rank\nRows: ~2,500"]
AGG --> Y["genre_race_yearly.csv\nColumns: year · genre · review_count · rank\nRows: ~180"]
style M fill:#2997ff,color:#fff,stroke:#0077ed
style Y fill:#2997ff,color:#fff,stroke:#0077ed Key Engineering Decisions
Why D3.js over Plotly, Streamlit charts, or Tableau?
A synchronized bar chart race with smooth interpolation requires frame-level SVG control.
D3.js provides this via d3.transition() and
selection.join() — every bar's position, width, and label
animates independently per keyframe. Plotly and Streamlit charts treat each frame as a full
re-render, producing non-interpolated jumps. Tableau cannot produce custom animations of this type.
D3.js is the only library with the low-level SVG control this visualization requires.
Why pre-aggregate with Python instead of processing in the browser?
1.2M records at ~500MB will exhaust browser memory or take minutes to parse. Python is the right tool for heavy transformation: Pandas group-bys and aggregations run in milliseconds on this dataset. The browser's job is exclusively rendering, not computing. This separation of concerns — Python for ETL, D3.js for rendering — mirrors production analytics architecture: the warehouse transforms data; the BI layer renders it.
Why monthly and yearly as separate files?
Monthly data reveals short-term market shifts and seasonal patterns. Yearly data reveals long-term structural trends (the rise of Games from 2008–2014, the plateau of Utilities). Combining them forces a tradeoff between granularity and payload size. Two focused files, loaded on demand, keeps each visualization fast and purposeful — the same principle as maintaining separate fact tables at different time grains in a dimensional model.
Outcome & Reflection
The most-read article on my Medium profile and the project that generated the most LinkedIn engagement. The key insight that resonated with readers: data engineering principles — ETL, pre-aggregation, separation of concerns — apply equally to frontend data visualization.
Any visualization project above ~10K rows benefits from treating data preparation as a proper pipeline problem. The D3.js code doesn't need to know how the data was shaped — it just needs clean, small, consistently structured input. That's what a good data layer provides.