Silke Graefnitz — PRISM: Data Infrastructure Portfolio

The Problem

Manual operations infrastructure couldn't scale with team growth.

Before: Manual Operations

21 Excel files tracking quality, throughput, allocations, requests, team data
3+ hours weekly capacity planning via spreadsheet formulas
9 Python scripts for data ingestion, quality calculations, reporting
No automated alerts. Issues discovered reactively
Data scattered across files, impossible to query holistically
Cell-position-dependent formulas breaking with schema changes
No historical backfill or point-in-time recovery

After: Automated Platform

10 DynamoDB tables as single source of truth
Automated capacity planning via assignment engine
7 browser-based UIs for all operational workflows
7 proactive alert types catching issues before impact
SQL-queryable analytics via S3 + Athena
Schema-agnostic data model supporting evolution
Point-in-time recovery + historical backfill from Week 1

The manual operations infrastructure couldn't keep up with team growth from 40 to 79 people and projects scaling from 20 to 88 concurrent workstreams.

Development Timeline

Week 1-2

Foundation & Data Modeling

Designed DynamoDB key structure, created 10 tables, built basic API Lambda with GET/POST endpoints for team members and project allocations. First UI: simple team roster.

Week 3-4

Core UIs & Capacity Engine

Built Requests UI, Capacity Planning UI with auto-assignment engine. Implemented batch operations (bulk reassignment, multi-select filters). Reduced weekly planning time from 3+ hours to minutes.

Week 5-6

Quality & Throughput Tracking

Added CloudWatch ingestion Lambda for automated throughput data from SageMaker. Built Quality UI with manual entry + validation. Throughput UI with batch reassignment and per-project drill-down.

Week 7-8

Alerts & Monitoring

Implemented 7-trigger alert system: OOTO conflicts, quality degradation, shrinkage risks, unresolved workers, late requests. EventBridge schedules for daily/weekly checks. Email notifications via SES.

Week 9-10

Analytics Pipeline

Built analytics export Lambda (~2.7s runtime), S3 JSON Lines exports with week partitioning, Athena tables (6 total), Glue crawlers. Historical backfill from Week 1. Foundation for QuickSight dashboards.

Week 11-12

Bug Tracker & Polish

Added Bug Tracker UI (15-field form, screenshot attachments with client-side compression, severity/category filters). Worker Dashboard for self-service. Final UI polish and documentation.

Key Challenges Solved

Real technical problems I encountered and how I solved them.

🔑

DynamoDB Key Design

Problem

Needed to query allocations by: (1) member ID, (2) project ID, (3) week number. DynamoDB only allows queries on partition + sort key. Can't have 3 query patterns without GSIs.

Solution

Composite sort key: PROJECT#week. Partition key: member_id. GSI with inverted keys for project-first queries. Week filtering via begins_with() on sort key. Supports all patterns without table scans.

🚨

Alert False Positives

Problem

OOTO conflict alert flagged members with OOTO on Thursday but who only worked Mon-Wed. False positives caused alert fatigue. Team stopped trusting notifications.

Solution

Changed logic from "has any allocation during OOTO week" to "has allocation on specific days matching OOTO dates." Cross-reference hours-per-day-per-allocation with OOTO calendar. Zero false positives since fix.

📊

Backward Compatibility

Problem

Category taxonomy changed mid-project. Legacy data used old category IDs. Couldn't migrate 10,000+ records without risk. Reports needed to work across old and new data.

Solution

ELT pattern: stored raw category IDs, transformed at query time. Built mapping layer in Lambda that translates old IDs → new taxonomy on read. 30-minute fix instead of risky migration. Zero data loss.

⚡

Lambda Cold Starts

Problem

Main API Lambda handled 30+ endpoints. Cold starts took 3-4 seconds with all dependencies. UIs felt slow on first load after idle periods.

Solution

Lazy imports for heavy dependencies (boto3 clients only imported when needed). EventBridge warmer pinging Lambda every 5 minutes during work hours. Provisioned concurrency during peak times. Cold starts down to <1s.

🔍

Analytics Historical Backfill

Problem

Analytics pipeline launched Week 10, but leadership needed historical data from Week 1. DynamoDB doesn't have native time-travel. Manual reconstruction from old Excel files would be error-prone.

Solution

Built backfill mode in analytics Lambda: reads current state, allows specifying week range, reconstructs point-in-time snapshots from changelog data. Exported Weeks 1-9 retroactively in single run. Complete historical dataset in S3.

🖼️

Screenshot Storage

Problem

Bug tracker needed screenshot attachments. S3 uploads from browser = CORS complexity + presigned URLs + extra latency. Wanted to keep bugs atomic in DynamoDB.

Solution

Client-side image compression to base64 (Canvas API, quality: 0.7). Store directly in DynamoDB item as string. Lazy-load images in UI. Trade-off: item size increase, but bugs stay atomic and queryable. Works well for <100KB screenshots.

Metrics & Impact

Quantified results from replacing manual operations with the platform.

⏱️

80% Reduction in Manual Work

Weekly capacity planning went from 3+ hours of spreadsheet manipulation to minutes with the auto-assign engine. Quality data entry, throughput tracking, and availability management all moved from scattered spreadsheets to purpose-built UIs.

🎯

2.5M Annotations Delivered

The team delivered 2.5 million annotations in 2025, 25% above target, across 88 concurrent projects at 82% utilization and 92.1% quality. The platform provided the visibility to make this possible.

🛡️

Proactive Issue Detection

7 automated alert triggers catch OOTO conflicts, quality drops, shrinkage risks, unresolved workers, and late requests before they become problems. Alerts only fire when issues exist, preventing fatigue.

🔗

Single Source of Truth

10 DynamoDB tables replaced 21 Excel files, 9 Python scripts, and a fragile cell-position-dependent pipeline. All team data now lives in one queryable system with point-in-time recovery.

📊

SQL-Queryable Analytics

Weekly automated exports to S3 in JSON Lines format, queryable via Athena. Historical backfill from Week 1 through current. Foundation for QuickSight dashboards for leadership reporting.

⚡

~2.7s Full Export

The analytics export Lambda processes all 10 tables and writes week-partitioned data to S3 in under 3 seconds. Supports backfill ranges, selective table exports, and specific week targeting.

System Scale

Metric	Value
Browser-based UIs	7 (Team, Requests, Capacity, Throughput, Quality, Bug Tracker, Worker Dashboard)
Lambda functions	4 (API, ingestion, alerts, analytics export)
DynamoDB tables	10 (all on-demand, PITR enabled)
EventBridge schedules	10 (ingestion, alerts, exports, backups)
API endpoints	30+ REST endpoints (GET/POST/PUT/DELETE)
Alert types	7 (OOTO conflict, quality, shrinkage, unresolved worker, late request, capacity risk, SLA breach)
Team size supported	79 annotators, 88 concurrent projects
Athena tables	6 (quality, throughput, allocations, requests, members, projects)

Reflections

What I learned and what I'd do differently.

What I Learned

🧩

Data Modeling Is the Foundation

Getting the DynamoDB key design right (or wrong) cascades through everything. Composite sort keys, GSI choices, and access pattern planning saved me from costly refactors, but I learned this by hitting walls first.

📦

Build for Change

The category taxonomy changed mid-project. Because I stored raw data and transformed at query time (ELT), I didn't need to migrate a single record. The backward-compatibility layer took 30 minutes instead of a risky data migration.

🔔

Alerts Need Discipline

Early versions of the OOTO conflict alert had false positives, flagging members who had OOTO on Thursday but only worked Mon-Wed. The fix was checking hours per day per allocation, not just "has any allocation." Precision matters more than coverage in alerting.

🚀

Ship, Then Iterate

The first version of every UI was functional but rough. Beta testing with real users surfaced needs I never would have anticipated: multi-select filters, bulk batch reassignment, screenshot attachments. The best features came from feedback, not planning.

PRISM — Program & Resource
Intelligence System

Technology Stack

Compute & API

Data Storage & Analytics

Frontend & Development

The Problem

System Architecture

Development Timeline

Key Challenges Solved

Metrics & Impact

System Scale

Reflections

What I Learned

PRISM — Program & ResourceIntelligence System

Technology Stack

Compute & API

Data Storage & Analytics

Frontend & Development

The Problem

System Architecture

Development Timeline

Key Challenges Solved

Metrics & Impact

System Scale

Reflections

What I Learned

PRISM — Program & Resource
Intelligence System