diff options
Diffstat (limited to 'doc/development/multi_version_compatibility.md')
-rw-r--r-- | doc/development/multi_version_compatibility.md | 164 |
1 files changed, 164 insertions, 0 deletions
diff --git a/doc/development/multi_version_compatibility.md b/doc/development/multi_version_compatibility.md index aedd5c1ffb7..5ca43b9b818 100644 --- a/doc/development/multi_version_compatibility.md +++ b/doc/development/multi_version_compatibility.md @@ -20,6 +20,121 @@ but AJAX requests to URLs (like the GraphQL endpoint) won't match the pattern. With this canary setup, we'd be in this mixed-versions state for an extended period of time until canary is promoted to production and post-deployment migrations run. +Also be aware that during a deployment to production, Web, API, and +Sidekiq nodes are updated in parallel, but they may finish at +different times. That means there may be a window of time when the +application code is not in sync across the whole fleet. Changes that +cut across Sidekiq, Web, and/or the API may [introduce unexpected +errors until the deployment is complete](#builds-failing-due-to-varying-deployment-times-across-node-types). + +One way to handle this is to use a feature flag that is disabled by +default. The feature flag can be enabled when the deployment is in a +consistent state. However, this method of synchronization doesn't +guarantee that customers with on-premise instances can [upgrade with +zero downtime](https://docs.gitlab.com/omnibus/update/#zero-downtime-updates) +since point releases bundle many changes together. Minimizing the time +between when versions are out of sync across the fleet may help mitigate +errors caused by upgrades. + +## Requirements for zero downtime upgrades + +One way to guarantee zero downtime upgrades for on-premise instances is following the +[expand and contract pattern](https://martinfowler.com/bliki/ParallelChange.html). + +This means that every breaking change is broken down in three phases: expand, migrate, and contract. + +1. **expand**: a breaking change is introduced keeping the software backward-compatible. +1. **migrate**: all consumers are updated to make use of the new implementation. +1. **contract**: backward compatibility is removed. + +Those three phases **must be part of different milestones**, to allow zero downtime upgrades. + +Depending on the support level for the feature, the contract phase could be delayed until the next major release. + +## Expand and contract examples + +Route changes, changing Sidekiq worker parameters, and database migrations are all perfect examples of a breaking change. +Let's see how we can handle them safely. + +### Route changes + +When changing routing we should pay attention to make sure a route generated from the new version can be served by the old one and vice versa. +As you can see in [an example later on this page](#some-links-to-issues-and-mrs-were-broken), not doing it can lead to an outage. +This type of change may look like an immediate switch between the two implementations. However, +especially with the canary stage, there is an extended period of time where both version of the code +coexists in production. + +1. **expand**: a new route is added, pointing to the same controller as the old one. But nothing in the application will generate links for the new routes. +1. **migrate**: now that every machine in the fleet can understand the new route, we can generate links with the new routing. +1. **contract**: the old route can be safely removed. (If the old route was likely to be widely shared, like the link to a repository file, we might want to add redirects and keep the old route for a longer period.) + +### Changing Sidekiq worker's parameters + +This topic is explained in detail in [Sidekiq Compatibility across Updates](sidekiq_style_guide.md#sidekiq-compatibility-across-updates). + +When we need to add a new parameter to a Sidekiq worker class, we can split this into the following steps: + +1. **expand**: the worker class adds a new parameter with a default value. +1. **migrate**: we add the new parameter to all the invocations of the worker. +1. **contract**: we remove the default value. + +At a first look, it may seem safe to bundle expand and migrate into a single milestone, but this will cause an outage if Puma restarts before Sidekiq. +Puma enqueues jobs with an extra parameter that the old Sidekiq cannot handle. + +### Database migrations + +The following graph is a simplified visual representation of a deployment, this will guide us in understanding how expand and contract is implemented in our migrations strategy. + +There's a special consideration here. Using our post-deployment migrations framework allows us to bundle all three phases into one milestone. + +```mermaid +gantt + title Deployment + dateFormat HH:mm + + section Deploy box + Run migrations :done, migr, after schemaA, 2m + Run post-deployment migrations :postmigr, after mcvn , 2m + + section Database + Schema A :done, schemaA, 00:00 , 1h + Schema B :crit, schemaB, after migr, 58m + Schema C. : schmeaC, after postmigr, 1h + + section Machine A + Version N :done, mavn, 00:00 , 75m + Version N+1 : after mavn, 105m + + section Machine B + Version N :done, mbvn, 00:00 , 105m + Version N+1 : mbdone, after mbvn, 75m + + section Machine C + Version N :done, mcvn, 00:00 , 2h + Version N+1 : mbcdone, after mcvn, 1h +``` + +If we look at this schema from a database point of view, we can see two deployments feed into a single GitLab deployment: + +1. from `Schema A` to `Schema B` +1. from `Schema B` to `Schema C` + +And these deployments align perfectly with application changes. + +1. At the beginning we have `Version N` on `Schema A`. +1. Then we have a _long_ transition periond with both `Version N` and `Version N+1` on `Schema B`. +1. When we only have `Version N+1` on `Schema B` the schema changes again. +1. Finally we have `Version N+1` on `Schema C`. + +With all those details in mind, let's imagine we need to replace a query, and this query has an index to support it. + +1. **expand**: this is the from `Schema A` to `Schema B` deployment. We add the new index, but the application will ignore it for now +1. **migrate**: this is the `Version N` to `Version N+1` application deployment. The new code is deployed, at this point in time only the new query will run. +1. **contract**: from `Schema B` to `Schema C` (post-deployment migration). Nothing uses the old index anymore, we can safely remove it. + +This is only an example. More complex migrations, especially when background migrations are needed will +still require more than one milestone. For details please refer to our [migration style guide](migration_style_guide.md). + ## Examples of previous incidents ### Some links to issues and MRs were broken @@ -60,3 +175,52 @@ We added a `NOT NULL` constraint to a column and marked it as a `NOT VALID` cons But even with that, this was still a problem because the old servers were still inserting new rows with null values. For more information, see [the relevant issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1944). + +### Downtime on release features between canary and production deployment + +To address the issue, we added a new column to an existing table with a `NOT NULL` constraint without +specifying a default value. In other words, this requires the application to set a value to the column. + +The older version of the application didn't set the `NOT NULL` constraint since the entity/concept didn't +exist before. + +The problem starts right after the canary deployment is complete. At that moment, +the database migration (to add the column) has successfully run and canary instance starts using +the new application code, hence QA was successful. Unfortunately, the production +instance still uses the older code, so it started failing to insert a new release entry. + +For more information, see [this issue related to the Releases API](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/64151). + +### Builds failing due to varying deployment times across node types + +In [one production issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2442), +CI builds that used the `parallel` keyword and depending on the +variable `CI_NODE_TOTAL` being an integer failed. This was caused because after a user pushed a commit: + +1. New code: Sidekiq created a new pipeline and new build. `build.options[:parallel]` is a `Hash`. +1. Old code: Runners requested a job from an API node that is running the previous version. +1. As a result, the [new code](https://gitlab.com/gitlab-org/gitlab/blob/42b82a9a3ac5a96f9152aad6cbc583c42b9fb082/app/models/concerns/ci/contextable.rb#L104) +was not run on the API server. The runner's request failed because the +older API server tried return the `CI_NODE_TOTAL` CI variable, but +instead of sending an integer value (e.g. 9), it sent a serialized +`Hash` value (`{:number=>9, :total=>9}`). + +If you look at the [deployment pipeline](https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/202212), +you see all nodes were updated in parallel: + +![GitLab.com deployment pipeline](img/deployment_pipeline_v13_3.png) + +However, even though the updated started around the same time, the completion time varied significantly: + +|Node type|Duration (min)| +|---------|--------------| +|API |54 | +|Sidekiq |21 | +|K8S |8 | + +Builds that used the `parallel` keyword and depended on `CI_NODE_TOTAL` +and `CI_NODE_INDEX` would fail during the time after Sidekiq was +updated. Since Kubernetes (K8S) also runs Sidekiq pods, the window could +have been as long as 46 minutes or as short as 33 minutes. Either way, +having a feature flag to turn on after the deployment finished would +prevent this from happening. |