A serverless AWS data platform I built from scratch to replace 21 Excel files with automated operations infrastructure. It supports a 79-person annotation team delivering 2.5M evaluations annually across 88 concurrent ML projects.
Manual operations infrastructure couldn't scale with team growth.
The manual operations infrastructure couldn't keep up with team growth from 40 to 79 people and projects scaling from 20 to 88 concurrent workstreams.
Real technical problems I encountered and how I solved them.
PROJECT#week. Partition key: member_id. GSI with inverted keys for project-first queries. Week filtering via begins_with() on sort key. Supports all patterns without table scans.
Quantified results from replacing manual operations with the platform.
| Metric | Value |
|---|---|
| Browser-based UIs | 7 (Team, Requests, Capacity, Throughput, Quality, Bug Tracker, Worker Dashboard) |
| Lambda functions | 4 (API, ingestion, alerts, analytics export) |
| DynamoDB tables | 10 (all on-demand, PITR enabled) |
| EventBridge schedules | 10 (ingestion, alerts, exports, backups) |
| API endpoints | 30+ REST endpoints (GET/POST/PUT/DELETE) |
| Alert types | 7 (OOTO conflict, quality, shrinkage, unresolved worker, late request, capacity risk, SLA breach) |
| Team size supported | 79 annotators, 88 concurrent projects |
| Athena tables | 6 (quality, throughput, allocations, requests, members, projects) |
What I learned and what I'd do differently.