Migrations considered helpful! And the tale of the Matryoshka Migration
Migrations are an art, not a science. A migration is a juggling act between breaking glass, keeping the lights on, and opening new doors.
Migrations are the evolution of a system through incremental changes.
Migrations are not limited to infrastructure; customer/product changes can be migrations too.
The difference between a single commit and a migration is that a migration is a long series of commits. Also, migrations generally affect a lot more than one person.
Migrations require updating docs, people’s brains, and muscle memory.
Done well, migrations reduce complexity and accelerate product development.
Done poorly, migrations increase complexity and slow product development.
Why did I write this?
Circa November 2024, some seasoned engineers, a few newer folks, and I, had a passionate discussion:
We’ve increasingly moved our product towards being easy – wizards, fast onboarding – at the cost of simplicity – removing non-essential complexity. Does that resonate with anyone else? The easy things become quite hard to manage in the medium-to-long term.
From what I heard, yes. It resonated, resoundingly.
How do we think about the tradeoff between simple and easy in a longer view? If we build something easy today, how do we think about incremental architectural changes to reduce our complexity? Likely easier said than done, but how have we thought about it in the past, vs. now?
What are the biggest sources of pain and complexity right now in the platform, and are there ways to reduce that complexity over time, balancing it with product roadmap, new features, etc.?
These are great questions!
This conversation happened in November 2024, when I had just completed a 3-year migration project of my own. So I was already in a headspace of reflecting on the past 5 years and what wisdom can be gleaned from it.
Outcomes of this doc
At Amperity, the word “migration” carries a lot of historical baggage. Let’s sit with the baggage, process it, and accept it.
Let’s:
- Share examples of good outcomes of migrations.
- Share examples of troubled migrations and how they turned successful.
- Share common themes from the above.
How do big migrations start?
Usually, but not always, a design-and-architecture or staff engineering group gets together and studies a big problem. A design doc, or series of design docs get written. It may be very long, 20+ pages. Usually, an engineer is chosen as the Directly Responsible Individual (DRI) to start and navigate the project.
Examples of successful big migrations
We did it! Yay, us! This is in no particular order, and I’m obviously forgetting some as I write this:
| Before | After | Big unlocks (non-exhaustive) | Est. finish date |
|---|---|---|---|
| Workflow System V0, built on Apache Airflow | Workflow System V1 | Customers can own Workflows – Reliable Workflows – Customer vs. Platform Attribution of Failures | 2024 |
| Finagle/Thrift Service Framework | Prodigal Service Framework | Developer Efficiency | 2022 |
| Self-Running Apache Kafka on VMs in the cloud | Purchase Confluent Kafka-as-a-Service Kafka Topics Managed in Infrastructure as Code in a CD pipeline | Developer Efficiency – Infra Cost Efficiency | 2022 |
| Cloud Storage Spaghetti | Storage Service and APIs | Bridge / Delta Sharing – Strong security posture – Bring Your Own Storage | 2024 |
| Jobs Inputs/Outputs in Cloud Storage | Job Inputs/Outputs in Redis | Developer Efficiency | 2022 |
| API Framework V0 | API Framework V1 | Developer Efficiency | 2022 |
| ADLS Gen 1 | ADLS Gen 2 | Infra Cost Efficiency – Platform Reliability – Security Posture – Developer Efficiency | 2021 |
| Loggly for structured logs + Honeycomb for traces | Honeycomb for traces and structured logs | Infra Cost Efficiency – Developer Efficiency – Security Posture | 2023 |
| Plain Text Logging | Structured Logging and Tracing | Speed to debug/resolve incidents – Developer Efficiency | 2021 |
| Deprecated Azure Public IP | Modern Azure Public IP | BFD: Connections to customer systems won’t break when Azure retires their old infra – Developer Efficiency | 2024 |
| Loading Dock | Not Loading Dock | Security Developer Efficiency | 2024 |
| Tables In Accumulo + HDFS | Files in Cloud Storage, removal of Accumulo/HDFS | Infra Cost Efficiency – Developer Efficiency | 2020 |
| HDFS | Cloud Storage, full removal of HDFS | Infra Cost Efficiency – Developer Efficiency | idk |
| Product Configuration in Spaghetti/Accumulo | Unified Product Configuration | Sandboxes – Developer Efficiency | idk |
| Service Data in Spaghetti/Accumulo | Service Data in PostgreSQL | Sandboxes – Developer Efficiency | idk |
| Something I’m Forgetting | Something I’m Forgetting | idk |
TODO:
I’ve worked on backend/infra most of my time at Amperity, so I’m probably missing some of the more direct customer-facing migrations. Feel free to comment here!
What’s in a successful big migration?
Usually:
- A Directly-Responsible Individual (DRI) who is also a direct contributor to the code (or holds enough sway with engineers who are direct contributors)
- A central Design Doc, kept up to date. A Google Doc works well.
- Maintain buy-in from engineering and product
- You will be asked, “What?” and “Why?”. Do Repeat Yourself, and link to the Design Doc.
- Small focused team: Team-to-team handoffs and trying to coordinate across teams is the path of pain.
- This may include embedding: bringing together SMEs from different teams (e.g. a product domain SME with a cloud infrastructure SME) into a “tiger team” or “v(irtual) team”
- Observe the Churn Rule: take the initiative on updating callers, in backward-compatible fashion. Avoid imposing work on teams who are likely busy and far from the migration details
- Friction in the migration strategy directly feeds back into accelerating the migration (e.g. they may start with a runbook, then feel the repetitive parts, and then automate them)
- Ratcheting: when the new pattern is ready and in use, prevent new usages of the old pattern from being added - usages of the old pattern monotonically decrease.
- See also: Strangler Fig Pattern
- See also: Shitlists (CI tests that fail when old pattern is re-introduced)
- Examples at Amperity:
mulchis a homegrown Shitlist for Clojure code. It blocks new usages of deprecated functions in our Clojure and ClojureScript code.- After we migrated a few services from Finagle/Thrift to Prodigal, we produced a runbook to migrate services and made Prodigal the officially blessed path for new services.
- After we migrated a few services from Aurora to K8s, we produced a repeatable migration runbook. We updated the “add a new service” tools/docs/guidance to target K8s. Soon after, 5 new services needed to be added, and they were stood up directly on K8s.
- They are finished and recognized as finished. How?
- Written and shared definition of done
- Announcement that it’s done
- Deliver an early win and/or other incremental wins
- Caveat: Avoid starting with the easiest parts
- Up front research to find the riskiest, hardest parts and derisk them
- Sequence work so that the hard parts get easier
- Example: we don’t know how to use K8s, and have no K8s tooling, which is a big operational risk we need to solve before using it widely in production → Let’s start by using K8s in a lesser-used part of the product to build expertise and foundational tooling
- Or, start with the riskiest part to prove the migration strategy
- Example: in both the Prodigal and K8s migration, one of the first services we tackled was Tenant Service, which is a high-traffic, core service that could bring the whole app down
- Sometimes migrations stall because they dive head-first into the easy parts, and then the harder parts lead to re-considering earlier design decisions.
- Also, Scenic Routes: take a little detour to get obvious wins for customers, which are clearly beneficial to the migration
- WARNING: when Scenic Routes keep getting pulled into the critical path, and they grow in scope, this can lead to a Matryoshka Migration
- Caveat: Avoid starting with the easiest parts
- Learn from Incidents
- We run AARs and incorporate the learnings into the Design Doc
- Delete the dead code!
- If the only dead stuff is code, it’s often easiest to delete it all at once, when we’re confident it’s unused. Big red line PRs are tasty!
- Shut down the dead infra!
- In infra migrations targeting cost efficiency (e.g. shutdown of Accumulo/HDFS), slowly scaling down the old system helps deliver incremental wins.
How do big migrations struggle?
Pattern: Matryoshka Migration
Commonly, migrations turn into Matryoshka Migrations. Matryoshka is the Russian nesting doll. 🪆Migrations nested in migrations.
A Matryoshka Migration is one that has diverged a bit too far from its original goals.
This migration has gone on so long that we started other migrations that affect the same parts of the system.
The critical path of the migration has gotten longer and more complicated, rather than shorter and simpler.
You can smell it:
- A migration started
- It has been almost done for a really long time: months, or years.
- Though it’s done with good intentions (e.g. Scenic Routes motivated by product/customers), the migration keeps going… and going… and going… or pausing inexplicably.
Pattern: Drained DRI
For example, a migration was started 1 year ago. Then, re-orgs and staffing changes (e.g. hiring, attrition, reduction in force) happened.
All the teams, code ownership, and managers have shuffled around.
Company strategy can change, too.
However, the friction that motivated the migration, and the friction of unfinished migrations, usually rear their ugly head again.
It’s hard work for the DRI to re-establish buy-in and kickstart the migration. Sometimes they have to do this multiple times.
What can help:
- Design Doc
- Switch DRI
- Embedding, re-establishing a focus team
- Let the DRI take a break
Turn this Matryoshka around!
Some of the successful migrations became Matryoshka Migrations and then turned around. How?
- Recognize that it’s a Matryoshka Migration.
- Compare the original goals/timeline with the current timeline. A smell is that the original timeline hasn’t been re-evaluated.
- Re-evaluate the cost of complexity inherent in having multiple, large in-progress migrations.
- Re-focus on the original goals. Telling the customer value, and buy-in of those goals.
- Re-evaluate what’s really in the critical path, and focus on it.
- Re-establish a small focused team with a “mandate”, free up their time to focus on it.
Own up to blocked and stalled big migrations
Let’s be leaders: communicate, and have empathy.
Yes, it is okay, and sometimes better for the company to wait to start, pause, or stop a migration.
There’s a negative outcome we’ve seen sometimes at Amperity. Example:
- Engineers or customers feel pain from a problem that exists today
- Engineers have designed a solution, socialized it with engineers, and feel that they have clear buy-in.
- Engineers feel blocked from starting or continuing work that they believe would solve the problem.
- This blockage lacks a clear explanation: nameless leaders withhold their buy-in
- Outcome: a tenured, high-value engineer is demoralized.
If we decide to block, pause, or cancel a migration, let’s communicate clearly and honestly why.
Reflect on finished big migrations
When a migration finishes, we can run retrospectives to gather and share what we learned. Consider this doc a “Meta-Retrospective” of 5 years of migrations at Amperity.
One example retrospective is the Spark 2-to-3 Retrospective.
So… what should we do?
It’s a fact. We have a complex product and complex infrastructure.
What are the biggest sources of complexity right now?
Are there ways to reduce that complexity over time, balancing it with product roadmap, new features, etc.?
Yes!
We’ll do all the migrations, all at the same time, all as fast as possible. No - sorry, that’s a joke. Today, we can’t do all of them at full speed.
- As engineering leaders (e.g. staff-eng community), let’s catalog and prioritize all our major migrations that are executing and proposed.
- If some migrations are entering Matryoshka status, and still valuable, let’s focus on finishing them.
- We should decide which ones to pause, and which ones to hold off on starting, with empathy.
- Note that pausing may mean some progress may still happen - it just will be in the background with gradual progress, likely not on any team’s Jira board. Perhaps interspersed among new feature commits, written on quiet Wednesday and Friday afternoons. (See below: Tidy as we go)
We’ve done a lot of the hard thinking on ways to simplify that complexity, but we haven’t started or implemented them fully.
Examples:
- Datasets: groups of named tables. Tables are tables!
- Amperity’s internal library,
comp-graphfor forming Computation Graphs of complex layered transformations. (Credit to Kevin Litwack) - Amperity Sandboxes: a massive enabler for testing, sharing changes. I think this is done from a product standpoint – what is left from the engineering perspective?
- Coordinated Changes
- Explicit vs. implicit configuration / semantic action-at-a-distance
- Workbench for exploring Datasets (state)
- Drafts
Tidy as we go
We can pragmatically “Tidy as we go”. Tidy First? is a nice 2-hour read on this topic.
Quick summary:
- “Tidying” is at the level of individual commits, a conversation among engineers and the code; it’s not called out as a “migration project”.
- Tidying means effectively carrying out a “low-urgency migration” bit-by-bit, interspersed with new-feature commits. In other words, gradual refactoring towards a higher quality, lower-friction codebase.
- Tidying requires trust and good judgment.
Example at Amperity:
- UI Code Modernization: Reframe Spaghetti → React/Helix
Sources and inspiration
This doc is a synthesis of my own experience and several other people’s great ideas.
(iterate inc @amperity-engineering)
- I’ve chatted about this topic with these people at Amperity (I took 5 seconds on this list, it is not exhaustive, and in no particular order.)
- … many others!
- John Rush
- Jeff Stokes
- Greg Look
- Stephen Meyles
- Hemanth Srinivas
- Aria Haghighi
- Bryce Covert
- Brandon Vincent
- Ace Levenberg
- Drew Inglis
- Graeme Roche
- Kevin Litwack
- Cary Lee
- Joe Christianson
- … many others!
- Amperity
#staff-engineeringdiscussions - Pushing Through Friction – Dan Na
- What is friction? It is resistance; when things feel harder than they ought to be.
- Friction in engineering demotivates otherwise smart and highly-motivated engineers.
- Friction in a product turns customers away.
- “WTF Factor” and “Normalization of Deviance”: The harms when tech debt, stalled migrations, or struggling migrations “swept under the rug” become normalized in an organization.
- “Pushing Through Friction Is The Job [of a tech lead or staff+ engineer].”
- Tanya Reilly – The Staff Engineer’s Path
- Chapter: Leading big projects
- Section: Why have we stopped?
- Tidy First? – Kent Beck
- Will Larson
- You will always have more problems than engineers – Matt Schellhas