16 Apr 2006

SQL Server 2005 Merge Replication Subscriber unexpectedly deleting during data Synchronisation.

Within my current project I have been using SQL Server 2005 merge replication.  While I have been testing some issues have been raised by our testers and I would like to write about another one of these in particular.

Data, which was entered at the subscriber, was being removed while synchronizing with the publisher.  This can happen when the data entered isn’t part of the partition for a given subscriber.  However, on this occasion we expected the data to remain in place.  After some investigation by me and my colleague, James Rowland-Jones, we found the issue.

We noted that the affected tables were organised in parent child relationships at a table / schema level but they were joined child parent in the replication topology.  They were joined via a join filter with a partition filter on the child record.  So Customers 1 -> N Orders but partitioned On Orders with a join filter back to Customers.  The topology was created in this way to help partition the data sent to the individual subscribers i.e. a subscriber would only see customers that it had created orders for.

To troubleshoot the problem we used SQL Server profiler to track all the SQL statements used in a synchronization session.  We found that the issue was to due to the synchronization process and how it processed the records sent from the subscriber. 

When the subscriber has data entered it doesn’t evaluate or set its current partition mapping. The mappings are worked out dynamically during the synchronization process.  Only the data in MsMerge_Contents is sent to the publisher for evaluation.  Whilst working this out the process does not take into account the fact that a child record at a data level was a parent with the partitioning key in the topology.  During the replication process the articles were processed in foreign key relationship order so the parent data was being added first. A check was then being made to see what partition this data belonged to.  As the child data had not yet been evaluated the partitioning data wasn’t available so the topology sent a delete back to the subscriber.

The way to resolve this was to enforce the processing order manually. We had to make a change to the database and replication.  The change to the database was to set NOT FOR REPLICATION on all foreign keys.  We also changed the topology to enforce processing order within replication. In TSQL you’d do this by setting the @processing_order parameter in sp_addmergearticle. Unfortunately in RMO this could not be set via a property or method.  I had to create a T-SQL statement to call sp_changemergearticle to update the processing order after the articles were created.

However, there is a gotcha to watch out for with this approach as well.  As soon as one starts setting the processing order manually you need to set it for all objects.  This is because any object without an explicit number will be evaluated before those that have. So the rule is: when enforcing a processing order you will need to enforce it for all objects: Tables, Views, Procedures and Functions. 

There are two basic approaches to setting up the processing order.  The first is to give all objects a unique number.  The second is to give all object types a unique number.  We went for the latter as we thought it’d be simpler to manage overall. It would also allow us to set big bands or ranges to allow for future changes.  However, we had to also make sure we processed the replication joins in a set order.  Therefore within the tables we gave another higher number in the range to each level of join.   Tables at the top of the replication tree were 10000 the tables they had join filters to were 20000 and so on. When we hit dependencies within objects we simply gave them another range.  We hit this with functions that referenced other functions.

Once we’d set the processing order correctly and set NOT FOR REPLICATION on all the foreign keys everything worked and we were able to return the issue as fixed back to the testers -  which was nice. I should also point out that we did try just changing the NOT FOR REPLICATION option only but this didn’t work for us. We had to make both changes for this to work.