7.4 SQL Server Replication¶
This page covers bidirectional merge replication between two MSSQL 2022 Standard Edition instances for the RCIIS database. Each Kubernetes cluster runs a SQL Server StatefulSet deployed via the rciis Helm chart. Cross-cluster connectivity is provided by Cilium Cluster Mesh.
Bidirectional means each SQL Server instance is both a Publisher and a Subscriber. The instance on Cluster A publishes its changes and the instance on Cluster B subscribes to them. The instance on Cluster B publishes its changes and the instance on Cluster A subscribes to them. This results in two publications, two subscriptions, and two Merge Agents — one running on each instance.
Key Decisions¶
- Topology: Active-Active bidirectional (Cluster A ↔ Cluster B)
- Replication Tool: SQL Server 2022 Merge Replication
- Edition: Standard
- Database:
RCIIS - Conflict Resolution: DATETIME (Later Wins) using the
UpdatedDateUTCcolumn - Authentication: SQL Server Authentication (
saaccount) - Networking: Cilium Cluster Mesh global services (port 1433)
Architecture¶
Cluster A Cluster B
┌────────────────────────────────┐ ┌────────────────────────────────┐
│ Namespace: rciis-prod │ │ Namespace: rciis-prod │
│ │ │ │
│ StatefulSet: mssql │ │ StatefulSet: mssql │
│ Database: RCIIS │ │ Database: RCIIS │
│ MSSQL_PID: Standard │ │ MSSQL_PID: Standard │
│ │ │ │
│ ┌──────────────────────────┐ │ │ ┌──────────────────────────┐ │
│ │ Publisher │ │ Merge │ │ Subscriber │ │
│ │ RCIIS_Merge_Pub │──┼── Agent ──┼─►│ (to Cluster A's pub) │ │
│ └──────────────────────────┘ │ A→B │ └──────────────────────────┘ │
│ │ │ │
│ ┌──────────────────────────┐ │ │ ┌──────────────────────────┐ │
│ │ Subscriber │ │ Merge │ │ Publisher │ │
│ │ (to Cluster B's pub) │◄─┼── Agent ──┼──│ RCIIS_Merge_Pub │ │
│ └──────────────────────────┘ │ B→A │ └──────────────────────────┘ │
│ │ │ │
│ ┌──────────────────────────┐ │ │ ┌──────────────────────────┐ │
│ │ Distributor (local) │ │ │ │ Distributor (local) │ │
│ └──────────────────────────┘ │ │ └──────────────────────────┘ │
│ │ │ │
│ Global Svc: mssql-a │ │ Global Svc: mssql-b │
│ (Cilium Cluster Mesh) │ │ (Cilium Cluster Mesh) │
│ Port: 1433 │ │ Port: 1433 │
└────────────────────────────────┘ └────────────────────────────────┘
│ │
└───────── Cilium Cluster Mesh ──────────┘
Both SQL Server instances have a local Distributor, a Publication, and a Subscription to the other instance's Publication. Each Merge Agent runs on its respective Publisher and pushes changes to the remote Subscriber, then uploads the Subscriber's changes back.
Prerequisites¶
- SQL Server 2022 Standard Edition StatefulSet deployed on both clusters via the rciis Helm chart — see 5.3.2 Data Services
- SQL Server Agent enabled (
MSSQL_AGENT_ENABLED: "1"in Helm values) - HADR enabled (
MSSQL_ENABLE_HADR: "1"in Helm values) - Cilium Cluster Mesh connected between clusters — see 5.1.1 Networking
RCIISdatabase created on Cluster A with all schema migrations applied- Tools:
kubectl,sqlcmd(viamssql-tools18image or local install)
Helm Values¶
Verify MSSQL_PID is set to Standard in both clusters:
# charts/rciis/values.yaml (or environment override)
statefulsets:
mssql:
env:
MSSQL_PID:
value: "Standard"
MSSQL_ENABLE_HADR:
value: "1"
MSSQL_AGENT_ENABLED:
value: "1"
| Variable | Purpose |
|---|---|
MSSQL_PID |
The SQL Server product edition. Standard enables merge replication and is licensed for production. |
MSSQL_ENABLE_HADR |
Enables the High Availability and Disaster Recovery feature set, which is required for SQL Server Agent to manage replication agent jobs. |
MSSQL_AGENT_ENABLED |
Starts the SQL Server Agent service when the container starts. The Agent is the job scheduler that runs the Merge Agent, Snapshot Agent, and Log Reader Agent on a schedule. Without the Agent, no replication jobs will execute. |
Phase 1: Expose MSSQL via Cilium Global Services¶
Both clusters use the same namespace (rciis-prod), so the default mssql service resolves identically on each cluster. If both clusters annotate the same service name as global, Cilium Cluster Mesh will load-balance connections across both SQL Server instances rather than routing to a specific one. For merge replication, each instance must be able to address the other instance specifically — so each cluster creates a distinct global service that the remote cluster can use to reach it.
Cluster A — Create Global Service¶
The following YAML defines a Kubernetes Service that Cilium Cluster Mesh advertises to all connected clusters:
| Field | Purpose |
|---|---|
name: mssql-a |
A unique service name that Cluster B uses to reach this specific instance. It must differ from the default mssql service to avoid Cilium load-balancing across both clusters. |
service.cilium.io/global: "true" |
Advertises this service to all clusters in the Cilium Cluster Mesh, making it discoverable from Cluster B. |
service.cilium.io/shared: "true" |
Allows remote clusters to route traffic to this service's endpoints. Without this, the service is visible but unreachable. |
type: ClusterIP |
Internal-only service — no external load balancer is needed. Cross-cluster routing is handled by Cilium Cluster Mesh, not by external IPs. |
selector: app: mssql |
Routes traffic to pods with the label app: mssql, which matches the mssql StatefulSet pods. |
port: 1433 |
SQL Server's default listener port. |
# mssql-global-a.yaml — deploy to Cluster A only
apiVersion: v1
kind: Service
metadata:
name: mssql-a
namespace: rciis-prod
annotations:
service.cilium.io/global: "true"
service.cilium.io/shared: "true"
spec:
type: ClusterIP
selector:
app: mssql
ports:
- port: 1433
targetPort: 1433
protocol: TCP
Apply the service to Cluster A using kubectl apply:
| Flag | Purpose |
|---|---|
apply -f |
Create or update the Kubernetes resource defined in the specified YAML file. |
--context cluster-a |
Tells kubectl which Kubernetes cluster to target. Each cluster is registered as a context in your ~/.kube/config. Without --context, kubectl uses whatever context is currently active, which could be the wrong cluster. |
Cluster B — Create Global Service¶
# mssql-global-b.yaml — deploy to Cluster B only
apiVersion: v1
kind: Service
metadata:
name: mssql-b
namespace: rciis-prod
annotations:
service.cilium.io/global: "true"
service.cilium.io/shared: "true"
spec:
type: ClusterIP
selector:
app: mssql
ports:
- port: 1433
targetPort: 1433
protocol: TCP
Service Resolution¶
Once the global services are created, each cluster can reach the other cluster's SQL Server through Cilium Cluster Mesh's DNS:
| From | To reach Cluster A's SQL Server | To reach Cluster B's SQL Server |
|---|---|---|
| Cluster A | mssql.rciis-prod.svc (local) |
mssql-b.rciis-prod.svc (via Cluster Mesh) |
| Cluster B | mssql-a.rciis-prod.svc (via Cluster Mesh) |
mssql.rciis-prod.svc (local) |
Verify Cross-Cluster Connectivity¶
Test that each cluster can connect to the other's SQL Server instance by running a temporary pod with sqlcmd pre-installed.
kubectl run flags:
| Flag | Purpose |
|---|---|
sqltest |
Name of the temporary pod. |
-it |
-i keeps stdin open (interactive), -t allocates a terminal (TTY). Together they allow you to see the command output. |
--rm |
Automatically delete the pod after it exits. Without this, the pod stays in Completed state and must be manually deleted. |
--image |
The container image to run. mcr.microsoft.com/mssql-tools:latest is Microsoft's image with sqlcmd pre-installed. |
--context |
Which Kubernetes cluster to create the pod on. Determines which cluster the network request originates from. |
--restart=Never |
Do not restart the pod if it exits. This makes it a one-shot command rather than a long-running service. |
-n |
The namespace to create the pod in. Must match the namespace where the Cilium global services are deployed. |
-- |
Separator between kubectl flags and the command to run inside the container. Everything after -- is passed to the container as its entrypoint arguments. |
sqlcmd flags:
| Flag | Purpose |
|---|---|
-S |
Server address in hostname,port format (comma-separated, not colon). Here it uses the Cilium global service name to reach the remote cluster's SQL Server. |
-U |
Username to authenticate with. sa is the SQL Server system administrator account. |
-P |
Password for the specified user. $MSSQL_SA_PASSWORD is an environment variable that should be set in your local shell before running the command. |
-C |
Trust the server's TLS certificate without validation. Required because SQL Server on Linux uses a self-signed certificate by default. Without -C, sqlcmd refuses to connect. |
-Q |
Execute the specified query and exit immediately. Without -Q, sqlcmd enters interactive mode. |
# From Cluster A, connect to Cluster B's SQL Server instance
kubectl run sqltest -it --rm --image=mcr.microsoft.com/mssql-tools:latest \
--context cluster-a --restart=Never -n rciis-prod -- \
/opt/mssql-tools18/bin/sqlcmd \
-S mssql-b.rciis-prod.svc,1433 \
-U sa -P "$MSSQL_SA_PASSWORD" -C \
-Q "SELECT @@SERVERNAME, @@VERSION"
# From Cluster B, connect to Cluster A's SQL Server instance
kubectl run sqltest -it --rm --image=mcr.microsoft.com/mssql-tools:latest \
--context cluster-b --restart=Never -n rciis-prod -- \
/opt/mssql-tools18/bin/sqlcmd \
-S mssql-a.rciis-prod.svc,1433 \
-U sa -P "$MSSQL_SA_PASSWORD" -C \
-Q "SELECT @@SERVERNAME, @@VERSION"
Both commands should return the remote server's name and SQL Server version. If either fails with a connection timeout, check that Cilium Cluster Mesh is healthy (cilium clustermesh status) and that the global service annotations are present (kubectl get svc -n rciis-prod -o yaml).
Phase 2: Configure Replication Prerequisites¶
These steps prepare both SQL Server instances for merge replication by adjusting server-level settings and creating the directory structure that the Snapshot Agent needs.
Connect to each SQL Server instance using sqlcmd. The kubectl exec command opens an interactive session inside the running SQL Server pod and launches the sqlcmd tool.
kubectl exec flags:
| Flag | Purpose |
|---|---|
-it |
-i keeps stdin open, -t allocates a terminal. Together they give you an interactive sqlcmd session where you can type SQL commands. |
statefulset/mssql |
Target the pod managed by the mssql StatefulSet. Kubernetes resolves this to the running pod (e.g., mssql-0). |
-n rciis-prod |
The namespace where the StatefulSet runs. |
--context |
Which Kubernetes cluster to target — cluster-a or cluster-b. |
-- |
Separator between kubectl flags and the command to run inside the pod. |
sqlcmd flags (same as Phase 1 verification, but connecting to localhost instead of the remote cluster):
| Flag | Purpose |
|---|---|
-S localhost |
Connect to the SQL Server running on the same pod. Since we are inside the pod via kubectl exec, localhost reaches the local SQL Server instance. |
-U sa |
Authenticate as the sa (system administrator) account. |
-P "$MSSQL_SA_PASSWORD" |
The sa password, sourced from your local shell environment variable. |
-C |
Trust the server's self-signed TLS certificate. |
# Connect to Cluster A's SQL Server instance
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-a -- \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C
# Connect to Cluster B's SQL Server instance
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-b -- \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C
2.1 Remove the Text Replication Size Limit¶
By default, SQL Server limits the size of text, ntext, and image column values that can be replicated to 65,536 bytes. Any value larger than this limit is truncated during replication, which causes data loss. Setting the limit to -1 removes it entirely, allowing columns of any size to replicate in full.
Run on both SQL Server instances:
-- Enable advanced configuration options (required to change max text repl size)
EXEC sp_configure 'show advanced options', 1;
RECONFIGURE;
-- Remove the replication size limit for text/image columns
-- -1 means unlimited — all data replicates regardless of size
EXEC sp_configure 'max text repl size', -1;
RECONFIGURE;
GO
| Parameter | Value | Purpose |
|---|---|---|
show advanced options |
1 |
Unhides advanced server configuration options. Without this, max text repl size is not visible to sp_configure. |
max text repl size |
-1 |
The maximum size (in bytes) of text/image data that can be replicated. -1 removes the limit. The default of 65536 would truncate any RCIIS column value larger than 64 KB. |
RECONFIGURE |
— | Applies the configuration change immediately without requiring a server restart. |
2.2 Create the Snapshot Working Directory¶
Merge replication uses a Snapshot Agent to generate a baseline copy of the publication's schema and data. The Snapshot Agent writes this baseline as a set of files — schema scripts (.sch), bulk-copy data files (.bcp), and index definitions (.idx) — to a directory on the Publisher's filesystem called the working directory. The Merge Agent then reads these files from the working directory and applies them to the Subscriber during its first synchronisation.
SQL Server does not create this directory automatically. If the directory does not exist when the Distributor is configured (Phase 4) or when the Snapshot Agent runs (Phase 7), the operation fails with a file-not-found error.
Run on both SQL Server instances from the pod shell:
# Cluster A
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-a -- \
mkdir -p /var/opt/mssql/data/repldata
# Cluster B
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-b -- \
mkdir -p /var/opt/mssql/data/repldata
The mkdir -p command creates the directory and any missing parent directories. The -p flag also prevents an error if the directory already exists. The path /var/opt/mssql/data/repldata is inside the persistent volume mounted by the StatefulSet, so it survives pod restarts.
Phase 3: Mark GUID Columns as ROWGUIDCOL¶
Merge replication tracks individual rows using a globally unique identifier. It requires every replicated table to have a uniqueidentifier column marked with the ROWGUIDCOL property. If a table does not have one, merge replication automatically adds a column called rowguid to the table — which changes the table schema and can break application code that does SELECT * or relies on a fixed column list.
The RCIIS database already has a Guid Id primary key on every business table (via BaseEntity). By marking the existing Id column as the ROWGUIDCOL, merge replication uses it for tracking and does not add any extra columns.
The one exception is the Identifier table, which uses a string Prefix primary key instead of a GUID. Merge replication will add a rowguid column to that table automatically. This is acceptable because the Identifier table is only accessed by the identifier generator service — never by SELECT *.
Run on Cluster A's SQL Server instance (the instance that currently has all the data). Cluster B does not have the RCIIS database yet — it will receive these column property changes when the snapshot is applied in Phase 9.
USE RCIIS;
GO
-- Dynamically generate ALTER statements for every table that has a uniqueidentifier
-- column named 'Id' but does not already have the ROWGUIDCOL property set.
DECLARE @sql NVARCHAR(MAX) = N'';
SELECT @sql = @sql +
'ALTER TABLE [' + SCHEMA_NAME(t.schema_id) + '].[' + t.name + ']
ALTER COLUMN [Id] ADD ROWGUIDCOL;
'
FROM sys.tables t
INNER JOIN sys.columns c ON t.object_id = c.object_id
WHERE c.name = 'Id'
AND c.system_type_id = 36 -- 36 = the system type ID for uniqueidentifier
AND t.is_ms_shipped = 0 -- Exclude SQL Server system tables
AND t.name <> '__EFMigrationsHistory' -- Exclude EF Core migration history (not replicated)
AND NOT EXISTS ( -- Skip tables that already have a ROWGUIDCOL
SELECT 1 FROM sys.columns c2
WHERE c2.object_id = t.object_id AND c2.is_rowguidcol = 1
)
ORDER BY t.name;
-- Execute the generated ALTER statements
EXEC sp_executesql @sql;
GO
Verify that only the Identifier table and __EFMigrationsHistory remain without a ROWGUIDCOL:
SELECT
SCHEMA_NAME(t.schema_id) AS SchemaName,
t.name AS TableName
FROM sys.tables t
WHERE t.is_ms_shipped = 0
AND NOT EXISTS (
SELECT 1 FROM sys.columns c
WHERE c.object_id = t.object_id AND c.is_rowguidcol = 1
)
ORDER BY t.name;
GO
If this query returns any table other than Identifier and __EFMigrationsHistory, investigate why — that table is missing a uniqueidentifier column and merge replication will add one automatically.
Phase 4: Configure the Distributor¶
The Distributor is a SQL Server component that manages replication metadata. It stores agent history, snapshot locations, synchronisation state, and pending transactions in a dedicated database called distribution. Every SQL Server instance that acts as a Publisher must have access to a Distributor — either local (on the same instance) or remote (on a separate instance).
In this bidirectional setup, both SQL Server instances act as Publishers, so each one needs its own local Distributor. A shared remote Distributor would be a single point of failure — if it goes down, both directions of replication stop.
Cluster A's SQL Server Instance¶
USE master;
GO
-- Register this SQL Server instance as its own Distributor.
-- The @password secures the connection between the Distributor and Publishers.
EXEC sp_adddistributor
@distributor = @@SERVERNAME, -- @@SERVERNAME = this instance's name (self as Distributor)
@password = N'ReplDistributor!Pass123'; -- Internal password for Distributor-to-Publisher communication
GO
-- Create the distribution database.
-- This is a regular SQL Server database that stores all replication metadata.
EXEC sp_adddistributiondb
@database = N'distribution', -- Name of the distribution database (convention is 'distribution')
@data_folder = N'/var/opt/mssql/data', -- Where to store the .mdf data file
@log_folder = N'/var/opt/mssql/data', -- Where to store the .ldf log file
@max_distretention = 72, -- Maximum hours to retain replication transactions (3 days).
-- Transactions older than this are purged even if not delivered.
@history_retention = 48, -- Hours to retain agent history records (2 days).
-- Older history rows are cleaned up by the Distribution Cleanup job.
@security_mode = 0; -- 0 = SQL Server Authentication. 1 = Windows Authentication
-- (not applicable on Linux containers).
GO
-- Register this instance as a Publisher that uses the local distribution database.
-- This tells SQL Server "this instance is allowed to create publications, and its
-- replication metadata lives in the 'distribution' database."
USE distribution;
GO
EXEC sp_adddistpublisher
@publisher = @@SERVERNAME, -- The Publisher name (this instance)
@distribution_db = N'distribution', -- Which distribution database this Publisher uses
@working_directory = N'/var/opt/mssql/data/repldata', -- Where snapshot files are read/written
-- This must match the directory created in Phase 2.2
@security_mode = 0; -- SQL Server Authentication
GO
Cluster B's SQL Server Instance¶
Run the identical commands on Cluster B's SQL Server instance:
USE master;
GO
EXEC sp_adddistributor
@distributor = @@SERVERNAME,
@password = N'ReplDistributor!Pass123';
GO
EXEC sp_adddistributiondb
@database = N'distribution',
@data_folder = N'/var/opt/mssql/data',
@log_folder = N'/var/opt/mssql/data',
@max_distretention = 72,
@history_retention = 48,
@security_mode = 0;
GO
USE distribution;
GO
EXEC sp_adddistpublisher
@publisher = @@SERVERNAME,
@distribution_db = N'distribution',
@working_directory = N'/var/opt/mssql/data/repldata',
@security_mode = 0;
GO
Verify on Both SQL Server Instances¶
Expected output: installed = 1, distribution db installed = 1. If either value is 0, the Distributor was not configured correctly — re-run the commands above.
Phase 5: Enable Merge Publishing¶
Before a database can host a merge publication, SQL Server requires that merge publishing is explicitly enabled on it. This sets internal flags on the database that activate change tracking infrastructure — without this flag, any attempt to call sp_addmergepublication fails with an error.
Run on both SQL Server instances:
USE master;
GO
EXEC sp_replicationdboption
@dbname = N'RCIIS', -- The database to enable for merge publishing
@optname = N'merge publish', -- The replication option to set. 'merge publish' enables
-- merge replication. Other options include 'publish' (for
-- transactional replication) and 'subscribe' (for subscriptions).
@value = N'true'; -- Enable the option. Use 'false' to disable.
GO
Phase 6: Create the Publication and Articles on Cluster A¶
This phase creates the merge publication on Cluster A's SQL Server instance. A publication defines which tables are replicated, how conflicts are resolved, and how the snapshot is generated. Each table included in the publication is called an article.
6.1 Create the Publication¶
USE RCIIS;
GO
EXEC sp_addmergepublication
@publication = N'RCIIS_Merge_Pub', -- A name for this publication (must be unique within the database)
@description = N'Merge publication for RCIIS bidirectional replication',
-- Subscription expiration: if a Subscriber does not sync within this many days,
-- its subscription is marked inactive and must be reinitialised.
@retention = 14,
-- Snapshot format: 'native' uses SQL Server's binary BCP format, which is faster to
-- generate and apply than 'character' mode. Both the Publisher and Subscriber are SQL
-- Server, so native mode is appropriate.
@sync_mode = N'native',
-- Push vs pull subscriptions:
-- Push = the Merge Agent runs on the Publisher and connects to the Subscriber.
-- Pull = the Merge Agent runs on the Subscriber and connects to the Publisher.
-- We enable both, but will use push subscriptions in this guide.
@allow_push = N'true',
@allow_pull = N'true',
-- Anonymous subscriptions allow any server to subscribe without being registered.
-- Disabled for security — only explicitly registered subscribers can sync.
@allow_anonymous = N'false',
@enabled_for_internet = N'false',
-- Snapshot storage: 'true' means snapshot files are written to the Distributor's
-- default working_directory (configured in Phase 4). An alternate folder can be
-- specified if needed, but the default is simpler.
@snapshot_in_defaultfolder = N'true',
-- Compress snapshot files using CAB compression to reduce transfer time when the
-- Merge Agent copies them to the Subscriber over the network.
@compress_snapshot = N'true',
-- Where to log conflict details: 'both' means conflict rows are logged on both the
-- Publisher and the Subscriber. This allows administrators to review conflicts from
-- either SQL Server instance without needing cross-cluster access.
@conflict_logging = N'both',
-- DDL replication: 1 = replicate supported schema changes (ALTER TABLE ADD/DROP/ALTER
-- COLUMN, etc.) from Publisher to Subscriber automatically. 0 = do not replicate DDL.
@replicate_ddl = 1,
@allow_subscriber_initiated_snapshot = N'false',
@allow_web_synchronization = N'false',
-- Dynamic filters allow each Subscriber to receive a different subset of rows (e.g.,
-- based on a user ID). The RCIIS database replicates all rows to both instances, so
-- dynamic filters are not needed.
@dynamic_filters = N'false',
-- How many days to retain conflict history rows before the cleanup job deletes them.
@conflict_retention = 14,
@keep_partition_changes = N'false',
-- Maximum concurrent merge sessions: 0 = unlimited. Since there is only one Subscriber,
-- this is not a concern.
@max_concurrent_merge = 0,
@max_concurrent_dynamic_snapshots = 0;
GO
6.2 Add the Snapshot Agent¶
The Snapshot Agent is a SQL Server Agent job that generates the baseline snapshot files. This call registers the job but does not start it — the snapshot is triggered manually in Phase 7.
EXEC sp_addpublication_snapshot
@publication = N'RCIIS_Merge_Pub',
-- Schedule configuration: frequency_type = 1 means "one time" — the job does not
-- run on a recurring schedule. We trigger it manually with sp_startpublication_snapshot
-- when needed (initial setup, after adding new articles, after reinitialisation).
@frequency_type = 1,
@frequency_interval = 1,
@frequency_relative_interval = 1,
@frequency_recurrence_factor = 0,
@frequency_subday = 8,
@frequency_subday_interval = 1,
@active_start_time_of_day = 0,
@active_end_time_of_day = 235959,
@active_start_date = 0,
@active_end_date = 0,
-- The Snapshot Agent connects to the Publisher to read data. These credentials tell
-- it how to authenticate.
@publisher_security_mode = 0, -- 0 = SQL Server Authentication
@publisher_login = N'sa',
@publisher_password = N'<MSSQL_SA_PASSWORD>';
GO
6.3 Add Articles¶
Each user table in the RCIIS database (except __EFMigrationsHistory) must be added to the publication as an article. The __EFMigrationsHistory table is excluded because each SQL Server instance must maintain its own migration history independently — if it were replicated, applying a migration on one instance would insert a history row that replicates to the other, causing the other instance to skip all DDL in that migration (because EF Core's IF NOT EXISTS check finds the history row and skips the migration block).
The following script dynamically generates and executes sp_addmergearticle calls for every eligible table:
USE RCIIS;
GO
DECLARE @sql NVARCHAR(MAX) = N'';
SELECT @sql = @sql +
'EXEC sp_addmergearticle
@publication = N''RCIIS_Merge_Pub'',
@article = N''' + t.name + ''',
@source_owner = N''' + SCHEMA_NAME(t.schema_id) + ''',
@source_object = N''' + t.name + ''',
@type = N''table'',
@column_tracking = N''true'',
@identityrangemanagementoption = N''manual'',
@verify_resolver_signature = 0;
'
FROM sys.tables t
INNER JOIN sys.indexes i ON t.object_id = i.object_id AND i.is_primary_key = 1
WHERE t.is_ms_shipped = 0
AND t.name <> '__EFMigrationsHistory'
ORDER BY t.name;
EXEC sp_executesql @sql;
GO
| Parameter | Value | Purpose |
|---|---|---|
@publication |
RCIIS_Merge_Pub |
Which publication this article belongs to. |
@article |
Table name | A logical name for this article (by convention, same as the table name). |
@source_owner |
Schema name | The schema that owns the table (e.g., dbo). |
@source_object |
Table name | The actual table name in the database. |
@type |
table |
The article type. table means the entire table is replicated. Other types include func (function) and proc (stored procedure). |
@column_tracking |
true |
Track changes at the column level instead of the row level. This means if two instances modify different columns of the same row between syncs, both changes are merged without a conflict. A conflict is only raised when the same column of the same row is changed on both instances. |
@identityrangemanagementoption |
manual |
Do not let merge replication manage identity column ranges. The RCIIS database uses GUIDs for all primary keys (no identity/auto-increment columns), so automatic identity range management is unnecessary. |
@verify_resolver_signature |
0 |
Do not verify the digital signature of custom conflict resolvers. The built-in DATETIME resolver we use in Phase 6.4 is a signed Microsoft component, but setting this to 0 avoids issues in environments where the certificate chain cannot be validated. |
6.4 Configure Conflict Resolution¶
When both SQL Server instances modify the same column of the same row between syncs, the Merge Agent must decide which change wins. The DATETIME (Later Wins) resolver compares a specified datetime column on both versions of the row and keeps the version with the more recent timestamp.
All RCIIS business tables inherit UpdatedDateUTC from BaseEntity, which the application sets to DateTime.UtcNow on every write. This makes it the natural column for last-writer-wins resolution — the most recently modified version always wins.
Generate and execute resolver configuration for all tables with UpdatedDateUTC:
USE RCIIS;
GO
DECLARE @sql NVARCHAR(MAX) = N'';
SELECT @sql = @sql +
'EXEC sp_changemergearticle
@publication = N''RCIIS_Merge_Pub'',
@article = N''' + t.name + ''',
@property = N''article_resolver'',
@value = N''Microsoft SQL Server DATETIME (Later Wins) Conflict Resolver'';
EXEC sp_changemergearticle
@publication = N''RCIIS_Merge_Pub'',
@article = N''' + t.name + ''',
@property = N''resolver_info'',
@value = N''UpdatedDateUTC'';
'
FROM sys.tables t
INNER JOIN sys.columns c ON t.object_id = c.object_id
WHERE c.name = 'UpdatedDateUTC'
AND t.is_ms_shipped = 0
AND t.name <> '__EFMigrationsHistory'
ORDER BY t.name;
EXEC sp_executesql @sql;
GO
sp_changemergearticle Parameter |
Purpose |
|---|---|
@property = 'article_resolver' |
Sets the conflict resolver for this article. Microsoft SQL Server DATETIME (Later Wins) Conflict Resolver is a built-in resolver that compares a datetime column to determine the winner. |
@property = 'resolver_info' |
Tells the resolver which column to compare. UpdatedDateUTC is the datetime column from BaseEntity that the application updates on every write. |
The Identifier table does not inherit from BaseEntity and has no UpdatedDateUTC column. It uses DateTime Date instead. Set its resolver separately:
EXEC sp_changemergearticle
@publication = N'RCIIS_Merge_Pub',
@article = N'Identifier',
@property = N'article_resolver',
@value = N'Microsoft SQL Server DATETIME (Later Wins) Conflict Resolver';
EXEC sp_changemergearticle
@publication = N'RCIIS_Merge_Pub',
@article = N'Identifier',
@property = N'resolver_info',
@value = N'Date';
GO
Identifier Table Conflict Risk
The Identifier table uses sequential counters (Prefix + SequentialNumber). If both SQL Server instances generate identifiers for the same prefix simultaneously, they will produce the same sequential number. The DATETIME resolver will pick one and discard the other, which means one instance silently loses its generated identifier. Use node-aware prefixes in the application layer (e.g., JRN-A-001 on Cluster A, JRN-B-001 on Cluster B) to prevent this.
6.5 Verify Articles and Resolvers¶
USE RCIIS;
GO
SELECT
a.name AS ArticleName,
a.source_owner,
a.column_tracking,
a.article_resolver,
a.resolver_info
FROM dbo.sysmergearticles a
JOIN dbo.sysmergepublications p ON a.pubid = p.pubid
WHERE p.name = 'RCIIS_Merge_Pub'
ORDER BY a.name;
GO
Every article should show Microsoft SQL Server DATETIME (Later Wins) Conflict Resolver with UpdatedDateUTC (or Date for the Identifier table). If any article shows NULL for article_resolver, the default priority-based resolver will be used — re-run the sp_changemergearticle calls for that table.
Phase 7: Generate the Snapshot on Cluster A¶
The snapshot is the baseline that the Merge Agent uses to initialise Cluster B's SQL Server instance. It contains the schema definitions (CREATE TABLE, indexes, constraints), data in BCP bulk-copy format, and merge tracking metadata (triggers, ROWGUIDCOL, system tables) for every article in the publication. Without a snapshot, the subscription in Phase 9 has nothing to apply.
Run on Cluster A's SQL Server instance:
USE RCIIS;
GO
-- Trigger the Snapshot Agent job to generate snapshot files.
-- The agent reads every article in the publication, exports the schema and data,
-- and writes the files to the working_directory configured in Phase 4.
EXEC sp_startpublication_snapshot
@publication = N'RCIIS_Merge_Pub';
GO
Monitor snapshot generation. This can take several minutes depending on the size of the RCIIS database:
USE distribution;
GO
-- The MSsnapshot_history table records the progress and status of the Snapshot Agent.
-- status values: 1 = Started, 2 = Succeeded, 3 = In Progress, 4 = Idle, 5 = Retrying, 6 = Failed
SELECT id, name, status, start_time, duration, comments
FROM dbo.MSsnapshot_history
ORDER BY start_time DESC;
GO
Wait for status = 2 (Succeeded). Then verify that snapshot files were written:
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-a -- \
ls -la /var/opt/mssql/data/repldata/unc/
The directory should contain .sch, .bcp, .idx, and .pre files — one set per article.
Phase 8: Create Linked Servers¶
A linked server is a named reference to a remote SQL Server instance. It stores the remote server's network address and authentication credentials so that SQL Server can open connections to it. The Merge Agent uses linked servers to establish push subscription connections — when Cluster A's Merge Agent needs to push changes to Cluster B, it looks up the CLUSTER_B linked server to find the address and credentials.
On Cluster A's SQL Server Instance (link to Cluster B)¶
USE master;
GO
-- Create a linked server named CLUSTER_B that points to Cluster B's SQL Server instance.
EXEC sp_addlinkedserver
@server = N'CLUSTER_B', -- Logical name for the remote server (used in subscription commands)
@srvproduct = N'', -- Leave empty for SQL Server (used for non-SQL Server providers)
@provider = N'SQLNCLI', -- SQL Server Native Client OLE DB provider
@datasrc = N'mssql-b.rciis-prod.svc,1433'; -- Network address: the Cilium global service from Phase 1
-- Format: hostname,port (comma-separated, not colon)
GO
-- Configure the login mapping: when this instance connects to CLUSTER_B,
-- use SQL Server Authentication with the sa account.
EXEC sp_addlinkedsrvlogin
@rmtsrvname = N'CLUSTER_B', -- Which linked server this login applies to
@useself = N'false', -- Do not pass through the current user's Windows credentials
@rmtuser = N'sa', -- Remote username to authenticate as
@rmtpassword = N'<MSSQL_SA_PASSWORD>'; -- Remote password
GO
-- Verify the connection works by querying a system view on the remote instance.
-- This SELECT crosses the linked server connection to read from CLUSTER_B's master database.
SELECT * FROM CLUSTER_B.master.sys.databases;
GO
On Cluster B's SQL Server Instance (link to Cluster A)¶
USE master;
GO
EXEC sp_addlinkedserver
@server = N'CLUSTER_A',
@srvproduct = N'',
@provider = N'SQLNCLI',
@datasrc = N'mssql-a.rciis-prod.svc,1433';
GO
EXEC sp_addlinkedsrvlogin
@rmtsrvname = N'CLUSTER_A',
@useself = N'false',
@rmtuser = N'sa',
@rmtpassword = N'<MSSQL_SA_PASSWORD>';
GO
-- Verify
SELECT * FROM CLUSTER_A.master.sys.databases;
GO
If the verification query fails with "Login failed" or "Cannot connect", check that the sa password matches across both instances and that the Cilium Cluster Mesh services from Phase 1 are resolving correctly.
Phase 9: Subscribe Cluster B to Cluster A's Publication¶
This is the first direction of the bidirectional setup. Cluster A's SQL Server instance (the Publisher) creates a push subscription to Cluster B's instance. "Push" means the Merge Agent runs on the Publisher side (Cluster A) and connects to the Subscriber (Cluster B) to deliver changes.
On the first run, the Merge Agent detects that Cluster B does not yet have the RCIIS data initialised for this subscription, so it applies the snapshot from Phase 7 — creating all tables, indexes, merge tracking metadata, and copying all data rows.
Run on Cluster A's SQL Server instance:
9.1 Create the Subscription and Merge Agent¶
USE RCIIS;
GO
-- Create the subscription. This tells Cluster A's publication that Cluster B is a subscriber.
EXEC sp_addmergesubscription
@publication = N'RCIIS_Merge_Pub', -- Which publication Cluster B subscribes to
@subscriber = N'CLUSTER_B', -- The linked server name for the remote instance (from Phase 8)
@subscriber_db = N'RCIIS', -- The database name on the Subscriber that receives replicated data
@subscription_type = N'push', -- Push = Merge Agent runs on Publisher (Cluster A).
-- Pull = Merge Agent runs on Subscriber (Cluster B).
-- Push is preferred when the Publisher has better network access.
@sync_type = N'automatic', -- On the first sync, automatically apply the snapshot to initialise
-- the Subscriber. After initialisation, the agent syncs incrementally.
@subscriber_type = N'global', -- Global subscriber = full read/write participant in the topology.
-- The alternative is 'local', which has lower priority and can only
-- receive changes (not originate them).
@subscription_priority = 75; -- Priority for conflict resolution (0–99.99). The Publisher's default
-- priority is also 75. When priorities are equal, the DATETIME (Later
-- Wins) resolver from Phase 6.4 breaks the tie using UpdatedDateUTC.
GO
-- Create the SQL Server Agent job that runs the Merge Agent.
-- This job handles all synchronisation: uploading Subscriber changes to the Publisher,
-- then downloading Publisher changes to the Subscriber.
EXEC sp_addmergepushsubscription_agent
@publication = N'RCIIS_Merge_Pub',
@subscriber = N'CLUSTER_B',
@subscriber_db = N'RCIIS',
-- Credentials for connecting to the Subscriber (Cluster B):
@subscriber_security_mode = 0, -- 0 = SQL Server Authentication
@subscriber_login = N'sa',
@subscriber_password = N'<MSSQL_SA_PASSWORD>',
-- Credentials for connecting to the Publisher (this instance):
@publisher_security_mode = 0,
@publisher_login = N'sa',
@publisher_password = N'<MSSQL_SA_PASSWORD>',
-- Schedule: frequency_type = 64 means "auto-start" — the job starts when SQL Server
-- Agent starts. Combined with Phase 14's schedule, this ensures the agent runs
-- continuously on a 1-minute interval.
@frequency_type = 64,
@frequency_interval = 0,
@frequency_relative_interval = 0,
@frequency_recurrence_factor = 0,
@frequency_subday = 0,
@frequency_subday_interval = 0;
GO
9.2 Start the Merge Agent and Apply the Snapshot¶
SQL Server Agent creates the Merge Agent as a job with an auto-generated name. Find it and start it:
USE msdb;
GO
-- List all replication-related jobs on this instance.
-- The REPL% category filter matches all replication agent jobs
-- (Merge Agent, Snapshot Agent, Log Reader Agent, Distribution Agent).
SELECT name FROM dbo.sysjobs
WHERE category_id IN (
SELECT category_id FROM dbo.syscategories WHERE name LIKE 'REPL%'
)
ORDER BY name;
GO
-- Start the A → B Merge Agent job.
-- Replace <MERGE_AGENT_JOB_A> with the actual job name from the query above.
EXEC dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_A>';
GO
On the first run, the Merge Agent applies the snapshot — creating the RCIIS database schema, copying all data, and setting up merge tracking metadata on Cluster B. Monitor progress:
USE distribution;
GO
-- MSmerge_history records every Merge Agent run, including how many rows were
-- transferred and whether the run succeeded or failed.
SELECT TOP 20
time,
duration,
comments,
delivery_rate,
publisher_insertcount, -- Rows pushed from Publisher to Subscriber
subscriber_insertcount, -- Rows uploaded from Subscriber to Publisher
CASE runstatus
WHEN 1 THEN 'Started'
WHEN 2 THEN 'Succeeded'
WHEN 3 THEN 'In Progress'
WHEN 4 THEN 'Idle'
WHEN 5 THEN 'Retrying'
WHEN 6 THEN 'Failed'
END AS Status
FROM dbo.MSmerge_history
ORDER BY time DESC;
GO
Wait for Succeeded or Idle status. The initial snapshot application can take several minutes.
9.3 Verify Cluster B¶
Connect to Cluster B's SQL Server instance and confirm the RCIIS database was populated:
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-b -- \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C
USE RCIIS;
GO
-- Count user tables — must match Cluster A
SELECT COUNT(*) AS TableCount
FROM sys.tables
WHERE is_ms_shipped = 0;
GO
-- Verify merge tracking metadata was created.
-- These MSmerge_* system tables are added by the Snapshot Agent and are used by
-- the Merge Agent to track row versions, tombstones (deleted rows), and generations
-- (batches of changes).
SELECT name FROM sys.tables WHERE name LIKE 'MSmerge_%' ORDER BY name;
GO
Do Not Proceed Until Cluster B Is Fully Initialised
Verify the table count matches Cluster A and the MSmerge_* metadata tables exist. Phase 10 creates a publication on Cluster B, which requires the RCIIS database and all its tables to be present.
Phase 10: Create the Publication and Articles on Cluster B¶
Now that Cluster B's SQL Server instance has a complete copy of the RCIIS database (including the ROWGUIDCOL markings from Phase 3, delivered via the snapshot), create the second publication. This is the mirror of Phase 6 — the same publication settings, articles, and conflict resolvers, but on Cluster B's instance.
Run all commands in this phase on Cluster B's SQL Server instance.
10.1 Create the Publication and Snapshot Agent¶
The parameters are identical to Phase 6.1 and 6.2 — refer to those sections for detailed explanations of each parameter.
USE RCIIS;
GO
EXEC sp_addmergepublication
@publication = N'RCIIS_Merge_Pub',
@description = N'Merge publication for RCIIS bidirectional replication',
@retention = 14,
@sync_mode = N'native',
@allow_push = N'true',
@allow_pull = N'true',
@allow_anonymous = N'false',
@enabled_for_internet = N'false',
@snapshot_in_defaultfolder = N'true',
@compress_snapshot = N'true',
@conflict_logging = N'both',
@replicate_ddl = 1,
@allow_subscriber_initiated_snapshot = N'false',
@allow_web_synchronization = N'false',
@dynamic_filters = N'false',
@conflict_retention = 14,
@keep_partition_changes = N'false',
@max_concurrent_merge = 0,
@max_concurrent_dynamic_snapshots = 0;
GO
EXEC sp_addpublication_snapshot
@publication = N'RCIIS_Merge_Pub',
@frequency_type = 1,
@frequency_interval = 1,
@frequency_relative_interval = 1,
@frequency_recurrence_factor = 0,
@frequency_subday = 8,
@frequency_subday_interval = 1,
@active_start_time_of_day = 0,
@active_end_time_of_day = 235959,
@active_start_date = 0,
@active_end_date = 0,
@publisher_security_mode = 0,
@publisher_login = N'sa',
@publisher_password = N'<MSSQL_SA_PASSWORD>';
GO
10.2 Add Articles and Configure Conflict Resolution¶
The same dynamic scripts from Phase 6.3 and 6.4. Since Cluster B now has all the RCIIS tables (delivered by the snapshot), the sys.tables queries will find the same tables as on Cluster A.
USE RCIIS;
GO
-- Add all articles (same as Phase 6.3)
DECLARE @sql NVARCHAR(MAX) = N'';
SELECT @sql = @sql +
'EXEC sp_addmergearticle
@publication = N''RCIIS_Merge_Pub'',
@article = N''' + t.name + ''',
@source_owner = N''' + SCHEMA_NAME(t.schema_id) + ''',
@source_object = N''' + t.name + ''',
@type = N''table'',
@column_tracking = N''true'',
@identityrangemanagementoption = N''manual'',
@verify_resolver_signature = 0;
'
FROM sys.tables t
INNER JOIN sys.indexes i ON t.object_id = i.object_id AND i.is_primary_key = 1
WHERE t.is_ms_shipped = 0
AND t.name <> '__EFMigrationsHistory'
ORDER BY t.name;
EXEC sp_executesql @sql;
GO
-- Configure DATETIME (Later Wins) resolver for all BaseEntity tables (same as Phase 6.4)
SET @sql = N'';
SELECT @sql = @sql +
'EXEC sp_changemergearticle
@publication = N''RCIIS_Merge_Pub'',
@article = N''' + t.name + ''',
@property = N''article_resolver'',
@value = N''Microsoft SQL Server DATETIME (Later Wins) Conflict Resolver'';
EXEC sp_changemergearticle
@publication = N''RCIIS_Merge_Pub'',
@article = N''' + t.name + ''',
@property = N''resolver_info'',
@value = N''UpdatedDateUTC'';
'
FROM sys.tables t
INNER JOIN sys.columns c ON t.object_id = c.object_id
WHERE c.name = 'UpdatedDateUTC'
AND t.is_ms_shipped = 0
AND t.name <> '__EFMigrationsHistory'
ORDER BY t.name;
EXEC sp_executesql @sql;
GO
-- Identifier table resolver (uses Date instead of UpdatedDateUTC)
EXEC sp_changemergearticle
@publication = N'RCIIS_Merge_Pub',
@article = N'Identifier',
@property = N'article_resolver',
@value = N'Microsoft SQL Server DATETIME (Later Wins) Conflict Resolver';
EXEC sp_changemergearticle
@publication = N'RCIIS_Merge_Pub',
@article = N'Identifier',
@property = N'resolver_info',
@value = N'Date';
GO
10.3 Verify¶
SELECT
a.name AS ArticleName,
a.article_resolver,
a.resolver_info
FROM dbo.sysmergearticles a
JOIN dbo.sysmergepublications p ON a.pubid = p.pubid
WHERE p.name = 'RCIIS_Merge_Pub'
ORDER BY a.name;
GO
The article count and resolver configuration should be identical to Cluster A's publication (Phase 6.5).
Phase 11: Generate the Snapshot on Cluster B¶
Even though Cluster A already has all the data, Cluster B's publication still needs a snapshot. The snapshot is required for the subscription metadata and for future reinitialisation scenarios — if Cluster A's database ever becomes corrupted, the Merge Agent can reinitialise it from Cluster B's snapshot.
Run on Cluster B's SQL Server instance:
Monitor and wait for status = 2 (Succeeded):
USE distribution;
GO
SELECT id, name, status, start_time, duration, comments
FROM dbo.MSsnapshot_history
ORDER BY start_time DESC;
GO
Phase 12: Subscribe Cluster A to Cluster B's Publication¶
This is the second direction of the bidirectional setup. Cluster B's SQL Server instance (the Publisher) creates a push subscription to Cluster A's instance. Since Cluster A already has all the RCIIS data and merge tracking metadata, @sync_type = N'none' tells the Merge Agent to skip snapshot application entirely and begin synchronising changes immediately.
Run on Cluster B's SQL Server instance:
USE RCIIS;
GO
-- Create the subscription.
-- @sync_type = 'none' is critical here — it means "do not apply a snapshot, the
-- Subscriber (Cluster A) already has the data." If you used 'automatic' instead,
-- the Merge Agent would overwrite Cluster A's data with the snapshot, which could
-- cause data loss if changes were made on Cluster A since Phase 9.
EXEC sp_addmergesubscription
@publication = N'RCIIS_Merge_Pub',
@subscriber = N'CLUSTER_A', -- Linked server for Cluster A (from Phase 8)
@subscriber_db = N'RCIIS',
@subscription_type = N'push',
@sync_type = N'none', -- Skip snapshot — Cluster A already has all data
@subscriber_type = N'global', -- Full read/write participant
@subscription_priority = 75; -- Equal priority to the Publisher
GO
-- Create the Merge Agent job (same parameter pattern as Phase 9.1).
EXEC sp_addmergepushsubscription_agent
@publication = N'RCIIS_Merge_Pub',
@subscriber = N'CLUSTER_A',
@subscriber_db = N'RCIIS',
@subscriber_security_mode = 0,
@subscriber_login = N'sa',
@subscriber_password = N'<MSSQL_SA_PASSWORD>',
@publisher_security_mode = 0,
@publisher_login = N'sa',
@publisher_password = N'<MSSQL_SA_PASSWORD>',
@frequency_type = 64, -- Auto-start with SQL Server Agent
@frequency_interval = 0,
@frequency_relative_interval = 0,
@frequency_recurrence_factor = 0,
@frequency_subday = 0,
@frequency_subday_interval = 0;
GO
Start the Merge Agent:
USE msdb;
GO
-- List replication jobs to find the auto-generated Merge Agent job name
SELECT name FROM dbo.sysjobs
WHERE category_id IN (
SELECT category_id FROM dbo.syscategories WHERE name LIKE 'REPL%'
)
ORDER BY name;
GO
-- Start the B → A Merge Agent
EXEC dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_B>';
GO
Phase 13: Validate Bidirectional Replication¶
At this point, both directions are active:
- A → B: Merge Agent on Cluster A's SQL Server instance pushes changes to Cluster B
- B → A: Merge Agent on Cluster B's SQL Server instance pushes changes to Cluster A
Test Cluster A → Cluster B¶
Insert a test row on Cluster A's SQL Server instance:
-- On Cluster A's SQL Server instance
USE RCIIS;
GO
INSERT INTO dbo.Country (Id, Code, CreatedDateTimeUTC, UpdatedDateUTC, UpdatedBy)
VALUES (NEWID(), 'T01', GETUTCDATE(), GETUTCDATE(), '00000000-0000-0000-0000-000000000000');
GO
Trigger a sync (or wait for the scheduled agent run):
-- On Cluster A's SQL Server instance — manually start the Merge Agent
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_A>';
GO
Verify on Cluster B's SQL Server instance:
-- On Cluster B's SQL Server instance
USE RCIIS;
GO
SELECT Id, Code FROM dbo.Country WHERE Code = 'T01';
GO
-- Expected: 1 row
Test Cluster B → Cluster A¶
Insert a test row on Cluster B's SQL Server instance:
-- On Cluster B's SQL Server instance
USE RCIIS;
GO
INSERT INTO dbo.Country (Id, Code, CreatedDateTimeUTC, UpdatedDateUTC, UpdatedBy)
VALUES (NEWID(), 'T02', GETUTCDATE(), GETUTCDATE(), '00000000-0000-0000-0000-000000000000');
GO
Trigger a sync:
-- On Cluster B's SQL Server instance
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_B>';
GO
Verify on Cluster A's SQL Server instance:
-- On Cluster A's SQL Server instance
USE RCIIS;
GO
SELECT Id, Code FROM dbo.Country WHERE Code = 'T02';
GO
-- Expected: 1 row
Clean Up Test Data¶
-- On either SQL Server instance (the delete syncs on the next merge run)
DELETE FROM dbo.Country WHERE Code IN ('T01', 'T02');
GO
Phase 14: Configure Merge Agent Schedules¶
The Merge Agents are currently set to auto-start but do not run on a recurring schedule. Configure them to run on a 1-minute interval so that changes are synchronised continuously. The schedule is set independently on each SQL Server instance because each instance runs its own Merge Agent.
Cluster A's SQL Server Instance¶
USE msdb;
GO
EXEC sp_update_jobschedule
@job_name = N'<MERGE_AGENT_JOB_A>', -- The Merge Agent job name from Phase 9.2
@name = N'Merge Agent Schedule', -- Display name for the schedule
-- Frequency: run daily, every 1 minute.
@freq_type = 4, -- 4 = Daily (1 = One-time, 8 = Weekly, 16 = Monthly)
@freq_interval = 1, -- Every 1 day (applies to freq_type = 4)
@freq_subday_type = 4, -- 4 = Minutes (1 = At specified time, 2 = Seconds, 8 = Hours)
@freq_subday_interval = 1, -- Every 1 minute. Change to 5 for every 5 minutes, etc.
@active_start_time = 0; -- Start time: 00:00:00 (midnight). 0 means "start immediately."
GO
Cluster B's SQL Server Instance¶
USE msdb;
GO
EXEC sp_update_jobschedule
@job_name = N'<MERGE_AGENT_JOB_B>',
@name = N'Merge Agent Schedule',
@freq_type = 4,
@freq_interval = 1,
@freq_subday_type = 4,
@freq_subday_interval = 1,
@active_start_time = 0;
GO
With @freq_subday_interval = 1, each Merge Agent runs every minute. During each run it uploads the Subscriber's changes to the Publisher, then downloads the Publisher's changes to the Subscriber. The maximum replication lag between the two instances is approximately 1 minute under normal conditions.
Phase 15: Configure Monitoring¶
All monitoring queries should be run on both SQL Server instances to get a complete picture of both replication directions. Connect using the kubectl exec + sqlcmd pattern from Phase 2.
Merge Agent History¶
Shows the last 20 sync events for the Merge Agent running on this instance:
USE distribution;
GO
SELECT TOP 20
h.time, -- When the sync ran
h.duration, -- How long it took (seconds)
h.comments, -- Human-readable status message
h.delivery_rate, -- Rows per second
h.publisher_insertcount + h.publisher_updatecount + h.publisher_deletecount AS PublisherChanges,
h.subscriber_insertcount + h.subscriber_updatecount + h.subscriber_deletecount AS SubscriberChanges,
CASE h.runstatus
WHEN 1 THEN 'Started'
WHEN 2 THEN 'Succeeded'
WHEN 3 THEN 'In Progress'
WHEN 4 THEN 'Idle' -- No changes to sync
WHEN 5 THEN 'Retrying' -- Failed but will retry
WHEN 6 THEN 'Failed' -- Check comments for error details
END AS Status
FROM dbo.MSmerge_history h
ORDER BY h.time DESC;
GO
Check for Conflicts¶
Returns any rows that were involved in a merge conflict. The "losing" row's data is stored in a conflict table (one per article) for manual review:
USE RCIIS;
GO
-- Summary of all conflicts across all articles in the publication
EXEC sp_helpmergeconflictrows
@publication = N'RCIIS_Merge_Pub';
GO
Check for Errors¶
USE distribution;
GO
-- Show the 20 most recent Merge Agent failures
-- runstatus = 6 means the agent run failed
SELECT TOP 20 time, comments
FROM dbo.MSmerge_history
WHERE runstatus = 6
ORDER BY time DESC;
GO
Agent Job Status¶
USE msdb;
GO
-- Shows the last execution result for each replication job on this instance.
-- Check this after connectivity issues or agent restarts.
SELECT
j.name,
j.enabled, -- 1 = enabled, 0 = disabled
CASE jh.run_status
WHEN 0 THEN 'Failed'
WHEN 1 THEN 'Succeeded'
WHEN 2 THEN 'Retry'
WHEN 3 THEN 'Cancelled'
WHEN 4 THEN 'In Progress'
END AS LastRunStatus,
jh.run_date AS LastRunDate, -- YYYYMMDD format
jh.run_duration AS LastRunDuration, -- HHMMSS format
jh.message AS LastMessage
FROM dbo.sysjobs j
LEFT JOIN (
SELECT job_id, run_status, run_date, run_duration, message,
ROW_NUMBER() OVER (PARTITION BY job_id ORDER BY run_date DESC, run_time DESC) AS rn
FROM dbo.sysjobhistory
WHERE step_id = 0 -- step_id 0 = the overall job outcome (not individual steps)
) jh ON j.job_id = jh.job_id AND jh.rn = 1
WHERE j.category_id IN (
SELECT category_id FROM dbo.syscategories WHERE name LIKE 'REPL%'
)
ORDER BY j.name;
GO
Database Migrations¶
Database migrations are applied automatically by FluxCD during each reconciliation. The migration.sql script is baked into the harbor.devops.africa/nucleus/mssql-tools Docker image alongside a migration.sh entrypoint script. FluxCD runs this image as a Kubernetes Job (db-migration) with sync-wave "1" and hook type Sync — meaning the migration job runs before the application pods are updated. The job connects to the local SQL Server instance, creates the RCIIS database if it does not exist, and executes the migration script.
Each migration block in migration.sql is wrapped in IF NOT EXISTS (SELECT * FROM [__EFMigrationsHistory] WHERE ...), which makes the script idempotent — it can safely run multiple times without reapplying already-completed migrations.
How This Interacts with Merge Replication¶
Merge replication with @replicate_ddl = 1 replicates supported DDL changes (ALTER TABLE ADD/DROP/ALTER COLUMN, etc.) to existing articles. However, __EFMigrationsHistory is excluded from both publications, so each SQL Server instance maintains its own migration history independently.
Because FluxCD runs the db-migration job on each cluster during its own reconciliation, both SQL Server instances receive the same migration. The deployment process is:
- Update the
db-migrationjob's image tag invalues.yaml(or the environment override) to the new image that contains the updatedmigration.sql. - FluxCD reconciles Cluster A → the
db-migrationjob runs and applies the migration to Cluster A's SQL Server instance. - FluxCD reconciles Cluster B → the
db-migrationjob runs and applies the same migration to Cluster B's SQL Server instance.
Since each instance runs the migration independently and __EFMigrationsHistory is not replicated, there is no risk of the migration history row from one instance interfering with the other.
Coordinating Merge Agents During Migrations¶
If a migration contains DDL changes to tables that are already articles in the publication (e.g., ALTER TABLE ADD COLUMN), the Merge Agent should be paused to prevent it from encountering a partial schema state during the migration.
-
Pause both Merge Agents before the FluxCD reconciliation:
-
Reconcile both clusters via FluxCD. The
db-migrationjob runs on each cluster and applies the migration. -
Resume both Merge Agents after both syncs complete:
Adding New Tables¶
If the migration adds new tables, they are not automatically included in an existing publication — they must be registered as articles explicitly. After the FluxCD reconciliation has applied the migration on both clusters:
-
Pause both Merge Agents (if not already paused from the migration coordination above).
-
Add the new table as an article on both SQL Server instances:
USE RCIIS; GO -- Mark the GUID PK as ROWGUIDCOL (prevents merge replication from adding a rowguid column) ALTER TABLE [dbo].[NewTableName] ALTER COLUMN [Id] ADD ROWGUIDCOL; GO -- Register the table as an article in this instance's publication EXEC sp_addmergearticle @publication = N'RCIIS_Merge_Pub', @article = N'NewTableName', @source_owner = N'dbo', @source_object = N'NewTableName', @type = N'table', @column_tracking = N'true', @identityrangemanagementoption = N'manual', @verify_resolver_signature = 0; GO -- Set the conflict resolver EXEC sp_changemergearticle @publication = N'RCIIS_Merge_Pub', @article = N'NewTableName', @property = N'article_resolver', @value = N'Microsoft SQL Server DATETIME (Later Wins) Conflict Resolver'; EXEC sp_changemergearticle @publication = N'RCIIS_Merge_Pub', @article = N'NewTableName', @property = N'resolver_info', @value = N'UpdatedDateUTC'; GO -
Regenerate snapshots on both SQL Server instances so the new table is included for future reinitialisation:
-
Resume both Merge Agents:
Operational Runbooks¶
Pause Replication¶
Stop the Merge Agent on both SQL Server instances:
-- On Cluster A's SQL Server instance
EXEC msdb.dbo.sp_stop_job @job_name = N'<MERGE_AGENT_JOB_A>';
-- On Cluster B's SQL Server instance
EXEC msdb.dbo.sp_stop_job @job_name = N'<MERGE_AGENT_JOB_B>';
Resume Replication¶
-- On Cluster A's SQL Server instance
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_A>';
-- On Cluster B's SQL Server instance
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_B>';
Force a Manual Sync¶
Each Merge Agent handles one direction. To sync both directions immediately, start both agents:
-- On Cluster A's SQL Server instance (syncs A → B and uploads B → A)
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_A>';
-- On Cluster B's SQL Server instance (syncs B → A and uploads A → B)
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_B>';
Network Partition Recovery¶
After a Cilium Cluster Mesh outage, both Merge Agents will fail because they cannot reach the remote SQL Server instance. The agents will enter retry mode automatically (runstatus = 5). After connectivity is restored, verify the Cluster Mesh and cross-cluster SQL connectivity:
# Check Cilium Cluster Mesh health on each cluster
cilium clustermesh status --context cluster-a
cilium clustermesh status --context cluster-b
# Verify SQL connectivity from Cluster A → Cluster B
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-a -- \
/opt/mssql-tools18/bin/sqlcmd \
-S mssql-b.rciis-prod.svc,1433 \
-U sa -P "$MSSQL_SA_PASSWORD" -C \
-Q "SELECT 1"
# Verify SQL connectivity from Cluster B → Cluster A
kubectl exec -it statefulset/mssql -n rciis-prod --context cluster-b -- \
/opt/mssql-tools18/bin/sqlcmd \
-S mssql-a.rciis-prod.svc,1433 \
-U sa -P "$MSSQL_SA_PASSWORD" -C \
-Q "SELECT 1"
If the Merge Agents do not recover automatically after connectivity is restored, restart them:
-- On Cluster A's SQL Server instance
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_A>';
-- On Cluster B's SQL Server instance
EXEC msdb.dbo.sp_start_job @job_name = N'<MERGE_AGENT_JOB_B>';
Reinitialise a Subscriber¶
If one SQL Server instance's data becomes corrupted or falls out of sync beyond the retention period (14 days without a successful merge), its subscription must be reinitialised from the other instance's snapshot. This example reinitialises Cluster B from Cluster A's snapshot:
-- Run on Cluster A's SQL Server instance (the Publisher for the A → B direction)
USE RCIIS;
GO
-- Mark the subscription for reinitialisation.
-- @upload_first = 'true' tells the Merge Agent to upload any pending changes from
-- Cluster B before overwriting it with the snapshot. This prevents data loss for
-- changes that were written to Cluster B but not yet synced.
EXEC sp_reinitmergesubscription
@publication = N'RCIIS_Merge_Pub',
@subscriber = N'CLUSTER_B',
@subscriber_db = N'RCIIS',
@upload_first = N'true';
GO
-- Regenerate the snapshot so it includes the latest data
EXEC sp_startpublication_snapshot
@publication = N'RCIIS_Merge_Pub';
GO
The next Merge Agent run uploads pending changes from Cluster B, then reapplies the snapshot.
Schema Overhead¶
Merge replication adds the following to each replicated table:
| Object | Purpose |
|---|---|
ROWGUIDCOL property on Id column |
Row identity for merge tracking |
rowguid column (Identifier table only) |
Auto-added by merge replication where no GUID column exists |
| INSERT / UPDATE / DELETE triggers | Track every DML operation for merge synchronisation |
MSmerge_* system tables |
Merge metadata (generations, tombstones, contents) |
The triggers and system tables are managed automatically by SQL Server and do not require manual maintenance. They add overhead to every DML operation — roughly 10–20% for write-heavy workloads.
Limitations¶
- Standard Edition constraints. Maximum 128 GB RAM per instance, 24 CPU cores, no online index rebuild.
__EFMigrationsHistoryexcluded. This table is not added as an article on either SQL Server instance. Migrations must be applied independently on both instances.- Schema overhead. Merge replication adds triggers, tracking columns, and
MSmerge_*system tables to every replicated table. - Snapshot required for new articles. Adding a new table to either publication requires regenerating the snapshot on that instance.
- Conflict detection is eventual. Conflicts are detected during the Merge Agent sync, not at write time. The sync interval determines how quickly conflicts surface.
- NTP synchronisation required. The DATETIME (Later Wins) resolver depends on accurate clocks across both Kubernetes clusters.
Identifiertable requires partitioning. Use node-aware prefixes (e.g.,JRN-A-001on Cluster A,JRN-B-001on Cluster B) to avoid sequential counter collisions.
Key Files¶
| File | Description |
|---|---|
charts/rciis/values.yaml |
Helm values defining the MSSQL StatefulSet (statefulsets.mssql) |
charts/rciis/templates/backup-job.yaml |
BACKUP TO URL job template |
apps/rciis/secrets/{env}/mssql-admin.yaml |
SOPS-encrypted sa password |