PRISM โ€” Program & Resource
Intelligence System

A serverless AWS data platform I built from scratch to replace 21 Excel files with automated operations infrastructure. It supports a 79-person annotation team delivering 2.5M evaluations annually across 88 concurrent ML projects.

2.5M
Annotations Delivered
92.1%
Quality Maintained
80%
Manual Work Reduced
88
Concurrent Projects
7
Browser UIs

Technology Stack

Compute & API

ฮป
Lambda
๐ŸŒ
API Gateway
โšก
EventBridge
๐Ÿ‘๏ธ
CloudWatch

Data Storage & Analytics

๐Ÿ“Š
DynamoDB
๐Ÿ—„๏ธ
S3
๐Ÿ”
Athena
๐Ÿ”ง
Glue

Frontend & Development

๐ŸŒ
HTML5
๐ŸŽจ
CSS3
โš™๏ธ
JavaScript
๐Ÿ
Python

The Problem

Manual operations infrastructure couldn't scale with team growth.

Before: Manual Operations
  • 21 Excel files tracking quality, throughput, allocations, requests, team data
  • 3+ hours weekly capacity planning via spreadsheet formulas
  • 9 Python scripts for data ingestion, quality calculations, reporting
  • No automated alerts. Issues discovered reactively
  • Data scattered across files, impossible to query holistically
  • Cell-position-dependent formulas breaking with schema changes
  • No historical backfill or point-in-time recovery
After: Automated Platform
  • 10 DynamoDB tables as single source of truth
  • Automated capacity planning via assignment engine
  • 7 browser-based UIs for all operational workflows
  • 7 proactive alert types catching issues before impact
  • SQL-queryable analytics via S3 + Athena
  • Schema-agnostic data model supporting evolution
  • Point-in-time recovery + historical backfill from Week 1

The manual operations infrastructure couldn't keep up with team growth from 40 to 79 people and projects scaling from 20 to 88 concurrent workstreams.

System Architecture

7 Browser UIs Team ยท Requests ยท Capacity Quality ยท Throughput ยท Bugs API Gateway REST endpoints API Lambda 30+ endpoints DynamoDB 10 tables (on-demand) PITR enabled CloudWatch Ingestion Lambda Auto-ingest from SageMaker EventBridge 10 schedules Triggers workflows Alerts Lambda 7 alert types Email notifications Analytics Lambda Weekly exports ~2.7s runtime S3 Data Lake JSON Lines format Week-partitioned Athena SQL queries 6 tables Blue arrows: API/control flow Orange dashed: Data export flow

Development Timeline

Week 1-2
Foundation & Data Modeling
Designed DynamoDB key structure, created 10 tables, built basic API Lambda with GET/POST endpoints for team members and project allocations. First UI: simple team roster.
Week 3-4
Core UIs & Capacity Engine
Built Requests UI, Capacity Planning UI with auto-assignment engine. Implemented batch operations (bulk reassignment, multi-select filters). Reduced weekly planning time from 3+ hours to minutes.
Week 5-6
Quality & Throughput Tracking
Added CloudWatch ingestion Lambda for automated throughput data from SageMaker. Built Quality UI with manual entry + validation. Throughput UI with batch reassignment and per-project drill-down.
Week 7-8
Alerts & Monitoring
Implemented 7-trigger alert system: OOTO conflicts, quality degradation, shrinkage risks, unresolved workers, late requests. EventBridge schedules for daily/weekly checks. Email notifications via SES.
Week 9-10
Analytics Pipeline
Built analytics export Lambda (~2.7s runtime), S3 JSON Lines exports with week partitioning, Athena tables (6 total), Glue crawlers. Historical backfill from Week 1. Foundation for QuickSight dashboards.
Week 11-12
Bug Tracker & Polish
Added Bug Tracker UI (15-field form, screenshot attachments with client-side compression, severity/category filters). Worker Dashboard for self-service. Final UI polish and documentation.

Key Challenges Solved

Real technical problems I encountered and how I solved them.

๐Ÿ”‘
DynamoDB Key Design
Problem
Needed to query allocations by: (1) member ID, (2) project ID, (3) week number. DynamoDB only allows queries on partition + sort key. Can't have 3 query patterns without GSIs.
Solution
Composite sort key: PROJECT#week. Partition key: member_id. GSI with inverted keys for project-first queries. Week filtering via begins_with() on sort key. Supports all patterns without table scans.
๐Ÿšจ
Alert False Positives
Problem
OOTO conflict alert flagged members with OOTO on Thursday but who only worked Mon-Wed. False positives caused alert fatigue. Team stopped trusting notifications.
Solution
Changed logic from "has any allocation during OOTO week" to "has allocation on specific days matching OOTO dates." Cross-reference hours-per-day-per-allocation with OOTO calendar. Zero false positives since fix.
๐Ÿ“Š
Backward Compatibility
Problem
Category taxonomy changed mid-project. Legacy data used old category IDs. Couldn't migrate 10,000+ records without risk. Reports needed to work across old and new data.
Solution
ELT pattern: stored raw category IDs, transformed at query time. Built mapping layer in Lambda that translates old IDs โ†’ new taxonomy on read. 30-minute fix instead of risky migration. Zero data loss.
โšก
Lambda Cold Starts
Problem
Main API Lambda handled 30+ endpoints. Cold starts took 3-4 seconds with all dependencies. UIs felt slow on first load after idle periods.
Solution
Lazy imports for heavy dependencies (boto3 clients only imported when needed). EventBridge warmer pinging Lambda every 5 minutes during work hours. Provisioned concurrency during peak times. Cold starts down to <1s.
๐Ÿ”
Analytics Historical Backfill
Problem
Analytics pipeline launched Week 10, but leadership needed historical data from Week 1. DynamoDB doesn't have native time-travel. Manual reconstruction from old Excel files would be error-prone.
Solution
Built backfill mode in analytics Lambda: reads current state, allows specifying week range, reconstructs point-in-time snapshots from changelog data. Exported Weeks 1-9 retroactively in single run. Complete historical dataset in S3.
๐Ÿ–ผ๏ธ
Screenshot Storage
Problem
Bug tracker needed screenshot attachments. S3 uploads from browser = CORS complexity + presigned URLs + extra latency. Wanted to keep bugs atomic in DynamoDB.
Solution
Client-side image compression to base64 (Canvas API, quality: 0.7). Store directly in DynamoDB item as string. Lazy-load images in UI. Trade-off: item size increase, but bugs stay atomic and queryable. Works well for <100KB screenshots.

Metrics & Impact

Quantified results from replacing manual operations with the platform.

โฑ๏ธ
80% Reduction in Manual Work
Weekly capacity planning went from 3+ hours of spreadsheet manipulation to minutes with the auto-assign engine. Quality data entry, throughput tracking, and availability management all moved from scattered spreadsheets to purpose-built UIs.
๐ŸŽฏ
2.5M Annotations Delivered
The team delivered 2.5 million annotations in 2025, 25% above target, across 88 concurrent projects at 82% utilization and 92.1% quality. The platform provided the visibility to make this possible.
๐Ÿ›ก๏ธ
Proactive Issue Detection
7 automated alert triggers catch OOTO conflicts, quality drops, shrinkage risks, unresolved workers, and late requests before they become problems. Alerts only fire when issues exist, preventing fatigue.
๐Ÿ”—
Single Source of Truth
10 DynamoDB tables replaced 21 Excel files, 9 Python scripts, and a fragile cell-position-dependent pipeline. All team data now lives in one queryable system with point-in-time recovery.
๐Ÿ“Š
SQL-Queryable Analytics
Weekly automated exports to S3 in JSON Lines format, queryable via Athena. Historical backfill from Week 1 through current. Foundation for QuickSight dashboards for leadership reporting.
โšก
~2.7s Full Export
The analytics export Lambda processes all 10 tables and writes week-partitioned data to S3 in under 3 seconds. Supports backfill ranges, selective table exports, and specific week targeting.

System Scale

MetricValue
Browser-based UIs7 (Team, Requests, Capacity, Throughput, Quality, Bug Tracker, Worker Dashboard)
Lambda functions4 (API, ingestion, alerts, analytics export)
DynamoDB tables10 (all on-demand, PITR enabled)
EventBridge schedules10 (ingestion, alerts, exports, backups)
API endpoints30+ REST endpoints (GET/POST/PUT/DELETE)
Alert types7 (OOTO conflict, quality, shrinkage, unresolved worker, late request, capacity risk, SLA breach)
Team size supported79 annotators, 88 concurrent projects
Athena tables6 (quality, throughput, allocations, requests, members, projects)

Reflections

What I learned and what I'd do differently.

What I Learned

๐Ÿงฉ
Data Modeling Is the Foundation
Getting the DynamoDB key design right (or wrong) cascades through everything. Composite sort keys, GSI choices, and access pattern planning saved me from costly refactors, but I learned this by hitting walls first.
๐Ÿ“ฆ
Build for Change
The category taxonomy changed mid-project. Because I stored raw data and transformed at query time (ELT), I didn't need to migrate a single record. The backward-compatibility layer took 30 minutes instead of a risky data migration.
๐Ÿ””
Alerts Need Discipline
Early versions of the OOTO conflict alert had false positives, flagging members who had OOTO on Thursday but only worked Mon-Wed. The fix was checking hours per day per allocation, not just "has any allocation." Precision matters more than coverage in alerting.
๐Ÿš€
Ship, Then Iterate
The first version of every UI was functional but rough. Beta testing with real users surfaced needs I never would have anticipated: multi-select filters, bulk batch reassignment, screenshot attachments. The best features came from feedback, not planning.