30 Sep 2012

Missing the option of creating table dependent objects within separate files

I have been developing database for over 12 years for use with applications and business intelligence projects. One of the challenges I used to face was how to put the SQL Server database under source control.  Over the years I came across several methods: Direct Database development with nightly backup,  Script(s) place under source control, SSMS Projects. 
Microsoft help to solved the problem when they release Visual Studio Data Database Professional (DataDude), when Visual Studio 2005 was around,  which add a project structure, development methods, build and deployment methods.   Recently Microsoft has release a new replacement database development tool: SQL Server Data Tools, which has some improvements and some missing features. 
One of the missing features is the folder structure choice that which I talk about in another post :  http://zogamorph.blogspot.co.uk/2012/06/how-to-create-old-database-project.html.  Another change is the dropping of the development method of one object to each file, which was started to be relaxed within Visual Studio 2008 release. The option is still there, however it's not easy to apply it as self imposed rule when using the new table designer.   
Here are a few reason while I think having one object per-file is a good idea:
  • Forces all the objects to be named instead of have system created names.  This would help finding build problems within the database project.  Having all the objects name is a good coding practices.  Also make comparing database easier with 3rd party products.
  • Reduce the risk of code being lost as there are less chances of  check-in conflicts.  This  can also help to improve productivity by not having to waste time resolving them.
  • Can help relate / audit deployment changes with source control check-ins.  As if the file has been changed then the object would have been changed.  Meaning you can compare the file changes list within source control to deployment script to make sure that all generated changes within the deployment script are valid.
  • Also can help projects which chooses data model to be the is the version of truth for the database.  As tables can be under data model tool control and database developers can added operational indexes without them being lost each time the data model is synchronise to database project. 
  • Reduce the number for developer to do deployments.  When doing a get latest and if source control only brings down is indexes files then a developer might not have to waste time in deploying the database.
  • It the same style method that some code developers applied to their coding standards.  .Net developers, with the help of Jet Brains Reshaper, put all there objects in single files for some the same reason as I mention above.
I have raise a Microsoft Connect ticket to get a option add to the table designer to allow linked objects to be added as separate file to it easier when sticking to the rule one object per-file.  If you agree with me and would like to see the feature added please vote for it with your feedback against the following ticket: https://connect.microsoft.com/SQLServer/feedback/details/748552/create-table-dependent-objects-within-separate-files



11 Jun 2012

Cannot update a database that has been registered as a data-tier application

Within SQL Server 2008 R2 a new feature was introduced, which was the data-tier application database. This feature allow developers / database administrators to package a database schema so that it could be deployed to any other database server and versioned. More about this feature can be found here: http://msdn.microsoft.com/en-us/library/ee240739(v=sql.105).aspx or http://dacguy.wordpress.com/

This feature has been included, as I have just found out, to previous versions: SQL Server 2005 via Services Pack 4 and 2008 via Services Pack 2. However SQL Server Management Studio, for SQL Server 2008, hasn't been updated with the tools for the new feature, you have to use the SQL Server 2008 R2 version. The reason for including the support to the previous version might have something to do with the release of the new SQL Server Data Tools (SSDT), the replacement to visual studio database professional, which uses DAC Packs and the framework to deploy a database.

I have been using the SSDT to deploy a new SQL Server 2008 (SP3) database for my current project. When I was trying to deploy an update to my database I was getting the following error: Cannot update a database that has been registered as a data-tier application without also updating its registration.

I was surprised to be getting this error, even know I did once registered my database as a data-tier application, but since then I had re-created the database a few times without registering as data-tier application. I was even getting the error when I manually drop and re-create the database. Dropping the database within SQL Sever 2008 doesn’t delete the data-tier application registration. So if another database is created with the same name the registration is then maintain when using the DAC framework.

To get around this problem you need to remove the data-tier application registration to this with SQL Server 2008 management studio you will have to run the following SQL statement: DELETE FROM [msdb].dbo.[sysdac_instances_internal] WHERE [instance_name] = '<Databasename,varchar,>' .

5 Jun 2012

How to create the old database project folders for SQL Server Data Tools.

I have been working with the new SQL server data tools database project. This is going to be the successor to database professional tool, also known as data dude. There have been several differences to how this product works.

One of them is the new tool project folder structure. There is now no folder structure within the project which means any file is automatically created into the root folder of the project. This is unlike database professionals, which had 2 comprehensive folder structures, the options were: by object type folder tree or schema related folder tree. My preference was always the object type folder structure, were script files which contain similar objects were all stored in the same folder e.g.: tables under a table folder, primary keys under the keys folder, views under a view folder etc.

If you missed the object tree folder structure, like I have, there is a method to create this structure very quickly within your new data base project by using the following steps:

  1. Create a new SQL server data tools project
  2. Once the project has been created, edit the project file.
    • Right mouse clicking on the project
    • Select Unloading Project option
    • Right mouse clicking on the project (which will be grey out)
    • Select the Edit Project option
  3. Then copy the contents of the file:DBProFolderList.xml into your new SQL server within the project node. I would recommend inserting before the closing tag to project.
  4. Reload the project.

However, once you have folder structure in place the tool will not honour it. You will have to spend time ensuring that object script files are created or moved into the correct location.  Also be aware that the schema compare tool, when reverse engineering into the project, uses the schema related tree folder structure to place newly created objects from the database.

28 Feb 2012

SQL Server maintenance via PowerShell

I was asked to help create a maintenance plan for a SQL server. I chose to implement the plan using PowerShell for the following reasons: 
  • PowerShell is a procedural scripting language and the maintenance plan would have a few procedural steps and iterations through objects.
  • PowerShell was also built on top of the .Net framework. This means I could make use of the SQL Shared Management Object (SMO) library within my script.
  • SQL Server also has a SQL PowerShell provider, which is pre-loaded when you run SQLPS. Support also has been extended to SQL Server Agent as there is a native PowerShell, which uses SQLPS, job subsystem.
  • I have been using PowerShell for a while and found it really useful; it allows you to solve problems, much more easily, than using the standard SQL tools.
A copy of my script, and related scripts, are available from my SkyDrive using the following link: SkyDrive

The maintenance script, SQLServerMaintanceTask.PS1, does the following, depending of the day of execution:
  • Removes the old backup files from the previous week.
  • Backup the database by doing either a full backup or differential.
  • Checks the fragmentation level of indexes and perform if required either the rebuild or reorganisation.
  • Update any statistics which hadn't been updated.
While writing the maintenance script I also created a few reusable functions which are as follows:
  • Defrag-SQLServer-Indexes: iterates through a database index objects performs the required rebuild or reorganised. (Defrag-SQLServer-Indexes.ps1)
  • Update-SQLServer-DBStatistics: iterates through database statistics objects and performs an update is they are a day-old. (Update-SQLServer-DBStatistics.ps1)
  • BackUp-Database / BackUp-DatabaseLog: performs a database backup / performs a database log backup. (BackUp-Database.ps1 / BackUp-DatabaseLog.ps1)
  • Get-ManagedPassword / Set-ManagedPassword: functions used to create and store a password for the backup file. (ManagedPassword.ps1)
I needed to use a password to secure the backup file. Also the password needed to be stored in an encrypted manner and accessible to the PowerShell script. Some friends, from Endjin, pointed me to an article which helped me to create the functions to securely store a password for use within PowerShell: thepowershellguy.com

While working on previous projects I found I was always doing a common task / asking a common question, which was to get the file stats of a database. This task was easy to accomplish if there is only one database, but annoying if there were more than one. While working on the PowerShell SQL server maintenance script I also came up with this little function to help get the stats of all the databases on a server: Get-DBFileStats. (Get-DBFileStats.ps1)

There are some benefits using a PowerShell maintenance script:
  • One it is easier to have a centralised maintenance server. As the scripts can connect to remote servers, currently only using Windows authentication, without the need to have a SQL Server Agent multi-server administration configured.
  • Using a PowerShell script you're not tied into using SQL Server agent. The automation of the script can be done by using Windows Scheduler or some other scheduling component.
  • PowerShell can interact with interact with the file system and SQL server within the same script.
To be to execute the script the following components would need to be installed and pre-loaded into power shell:
  • Microsoft® Windows PowerShell Extensions for SQL Server® 2008 R2
  • Microsoft® SQL Server® 2008 R2 Shared Management Objects.
Which is available to download from the following link: Microsoft® SQL Server® 2008 R2 Feature Pack

Also by using power shell the outputs can be stored easier to file in a number of formats: HTML,CVS or XML

17 Oct 2011

A Business Intelligence project ALM's

Table of Contents

1   Introduction.
2   Automation process components
  2.1 The Build.
  2.2 The Deployment script.
  2.3 Automated deployment and execution.
3   Visual studio tips.
  3.1 Reducing unresolve reference errors.
  3.2 Composite Projects.
4   Resources.

1 Introduction

As I have mentioned in previous posts I have been working on a data warehouse project. One of my main roles in this project was to work on the build, deployment and development processes. Within this post, I would like to show how I, with help from my colleagues, was able to automate a build, deployment and execute the data warehouse code.

The reason for developing this automation was to try and find problems before we released to user acceptance testing (UAT). Once this automation started running it did capture errors. One error in particular, which would only appeared in release, was missing permissions. As the developers and the testing team all have different elevated rights, so that they can do their jobs, masked the problem ever existed.

2 Automation process components

A child with building blocks

2.1 The Build

To develop our data warehouse solution the project used the Microsoft development tools: Team Foundation Server (TFS) 2010 and Visual Studio (VS) 2010 / 2008. TFS 2010 was used as a source control repository and automated build platform (Team build). VS 2010 was used to develop our database code and VS 2008 - Business Intelligence Studio (BIDS) for creating the integration packages and analysis services cubes.

As we were using the above tools I was able to create, using my company's IP, a build that would drop deployable artefacts. This was then wrapped into a TFS build which would be triggered by a number of check-ins. This gave us our continuous build for the project which was helpful to find problems like: incompatible code between team members; incomplete check-ins missing files from the change set.

However, if a build was successful it didn't mean that the build was deployable. An example of this is an unresolved reference warning which shouldn’t have been a warning but an error. This happened to us, one of our builds which we tried, had several unresolved reference warnings, we tried to deploy but failed because a view referenced to a column within a table was invalid.

A parachutist

2.2 The Deployment script

We used a PowerShell script to manage the process of deploying, in the correct order, our databases, SSIS packages and analysis cubes. The script, which my colleagues had created, used a configuration file to list the artefacts for deploying with how and where. Also the script interaction used a text menu interface, which was created from the following location: http://mspowershell.blogspot.com/2009/02/cli-menu-in-powershell.html

This was helpful in reducing the work that we had to do for writing a release process. As we only had to write was: select menu option blah, select menu option blah etc. The script help reduce deployment errors from incorrect commands typed in or artefacts deployed in the wrong order.

A running nan

2.3 Automated deployment and execution

To have the automated build, deployment and execution process I reuse the build and deployment components. The build was refactored and reused within a scheduled trigger for out of office hours. The deployment script was extended to include support being called by automation tool. The automation tool was called TFSDeployer. TFSDeployer is a service which can be configured to listen for certain build events and then run a powershell script when captured. For more details about the application, please use the following link: http://tfsdeployer.codeplex.com/.

The script, which TFSDeployer would execute, when the out of office build completed, was configured to do the following steps:

  • Get the TFS build details.
  • Copy the contents of the build drop location to a deployment area.
  • Deploy the data warehouse solution using the deployment script
  • If the deployment failed then update the TFS build to failed and the status to broken
  • If the deployment was successful then execute a SQL Server job; which was configured to run the ETL.
  • If the SQL Server job reported failure then the TFS build was set as failed and the status to broken
  • If the SQL job reported success then the TFS build was set as successful and the status set for ready for testing

This then gave us the ability to test our builds and as mentioned before allowed us to find many problems before we went into UAT. This also gave us a platform to perform and automate regression testing.

3 Visual studio tips

3.1 Reducing unresolve reference warnings

Here are some tips on how we managed to reduce some of our unresolved reference warnings.

  • Converting some of derived tables to use common table expression with column definition syntax: WITH <CTEName>(Col1, Col2, Col3) AS
  • Adding a server and database variables for references which were going to go to be referred to via a link server. Within the code also wrapped the reference variables within square brackets.
  • Include objects which were created by code for example: select into tables and indexes. With indexes we also took another approach, where possible, which was to disable and enable them.

3.2 Composite Projects

During one of our early development cycles we came across a circular reference problem. This caused a problem at deployment time. I had to deploy the two databases about three times each to get a complete deployment.

The method I used to reduce a circular reference to happen was to use composite projects. The way I chose to split databases into composite projects was to put the storage objects like tables into one project. Then created another project for the code objects like views and referenced the storage project. The idea for this was based on some information I read from: http://vsdatabaseguide.codeplex.com/

4 Resources

Integrating a rules engine with integration services

Introduction

As I have mentioned in several previous posts, I have been working on a data warehouse project. One of my main roles within this project was to work with our clients JRules / Java consultants to help integrate IBM JRules, a rules engine, with Microsoft SQL Server 2008 R2: Integration Services (SSIS).
The goal of the integration was to have the rules engine as part of the ETL process to help evaluate business rules. The reasons to use a rules engine within the ETL process were as follows:
  • To allow business people to write the business rules for the data warehouse in a language they know: English
  • To give the business people the responsibility for maintaining the rules to ensure that the data warehouse is up to date.
  • To be able to update the rules without having to deploy the entire ETL solution.
IBM JRules was used within the project because it was already in use within their organisation. The challenge was how to cross the divide between Java and Microsoft .Net.

Options to consider for integrating with the rules engine

Below is a list of options which I and some of my other colleagues thought about on how to integrate JRules and SSIS
  • How the rules engine will access the data.
    • Receiving the data: Should the data be supplied with the request to execute the rule engine.
    • Reading the data: Is the rules engine responsible for reading the data when requested to run.
  • Which processing method
    • Batch: Process a batch of data within the rules engine.
    • Row by Row: Make separate requests to the rules engine row by row.
  • The communication / execution method
    • Using Web Services method, Restful Services method.
    • Directly coding against the rules engine API libraries. For our integration we would have to have used some sort of interoperability classes like: JNBridge, JIntegra.
  • How to apply the results of the rule engine. Does the rule engine apply the results directly or are they returned to the ETL for processing.
  • Error handling
    • Rules engine errors: how to record the errors from the rule engine.
    • Integration errors: how to handle errors caused by a failure of the integration. 

Our approach

Our approach to the integration was to use a batch process model with a SOAP Web method for triggering the rule processing. The rule processing would be responsible for reading, processing and recording the results of the rules execution. The SSIS packages would prepare the batch of data; trigger the rule processing and then process the results. Below is a diagram to illustrate the integration approach:

SSISIntergationJRules

Web service method
The web service method, which was written by our clients JRules/Java consultants, was defined with the following parameters: the batch identifier; the name of the collection of rules for evaluation.
This method would then start running the rules processing procedures which are as follows:
  • start reading the data
  • execute the rules over each complete source row once fetched
  • store the results back to the database
After the completion of the rule processing the web service method would give a response indicating the success of the rules execution. The web service method was optimise to use multiple threads and to stream read the data. Also the way that the rules are implemented within JRules also helps to optimise the rule processing.

Data Integration
The way that we passed data between SSIS and JRules was to use a couple of tables, which use a name value pair structure. One table, Inbound, was used to store the prepared data for rule processing. The other table, Outbound, to store the results from rule processing which the SSIS would process.
The reasons to use a generic structure are as follows:
  • If the rules parameters changed then only the ETL process would need to be updated
  • The rule integration was easier to code
  • To have the integration only use two tables to help debugging
SSIS Integration
SSIS packages would be responsible for creating the batch data. Afterwards trigger the rules engine processing method and wait for the response. Once the response was received the results would be applied or/and log any errors.
The method I use to call the Web method was to write a .Net control flow component. The main reason I chose to write a component was to have easier code reuse. Also maintenance was easier as I only had to update one component and all packages would pick up the new component.
The approach I took with my custom component was to separate the SSIS integration and the actual call to the web service. Creating a code library that handled the interaction with the web service. This was then wrapped with another code library that handled the SSIS integration.
Taking this approach also allowed me to write a console application to test the interaction with the rules engine using the same calling code as the packages would use. This console application allowed us to debug any integration problems that we came across.

Integration results

Currently we are able to process a batch of 100,000 source rows within a five minute period. This is adequate for a normal day runs where we would only have to process changes for a day. However when there is a requirement to re-run the rules against entire warehouse data the process can take a little while. This is because a number of batches are required to complete the rules processing. There is an improvement currently being investigated, by another team member, to reduce the amount of data sent to the rule engine for processing. The method which is currently being looked at is as follows: Find the unique rows to send to the rules engine for processing and then apply the results back to the required rows.

21 Sep 2011

Comparing Master Data Services instances

As I have mentioned in previous posts, I've been working on a data warehouse project. Within the project we decided to use SQL Server 2008 R2 Master Data Services (MDS) to store all the warehouse specific reference data. Here are some reasons why MDS was used:
  • Allowed us to manage the reference data.
  • Reference data could be loaded into the warehouse like a source system
  • Would allow the client’s data governance team to easily update the reference data and keep the warehouse up-to-date
For active development of entities and data loads we use a sandpit instance. When the ETL was ready to use the new entities or loaded data, a cut of the sandpit instance would be promoted to the development environment. We came across a problem when we needed to identify some changes which were accidentally made on the development instance.

I came up with a method which helped to identify the changes fairly easily, below is the method I used. I will say that it’s not a perfect solution and might not work for everyone or continue to work when MDS updates have been applied.

Run the following SQL script on both instance of the MDS database servers:

   1:  CREATE DATABASE MDSCompare
   2:   
   3:  DECLARE @vColList AS VARCHAR(MAX)
   4:  DECLARE @vViewName AS SYSNAME
   5:  DECLARE @vSQL AS VARCHAR(MAX)
   6:  DECLARE @vEntityID AS INT
   7:  DECLARE @vModelID AS INT = (SELECT id FROM mdm.tblModel where Name = '<ModelName,Char,Master Data>')
   8:   
   9:  DECLARE EntityID_Cursor CURSOR LOCAL FORWARD_ONLY FAST_FORWARD READ_ONLY
  10:  FOR SELECT ID FROM mdm.tblEntity e WHERE model_id= @vModelID order by id
  11:   
  12:  OPEN EntityID_Cursor
  13:   
  14:  FETCH NEXT FROM EntityID_Cursor
  15:  INTO @vEntityID
  16:   
  17:  WHILE @@FETCH_STATUS = 0
  18:  BEGIN
  19:   
  20:      SELECT @vViewName = REPLACE(e.Name,' ','')
  21:           , @vColList  = COALESCE(@vColList + ', [' + a.name + ISNULL('_' + la.name,'') +']', '[' + a.name + ISNULL('_' + la.name,'') +']')
  22:      FROM mdm.tblAttribute a
  23:          INNER JOIN mdm.tblEntity e
  24:              ON e.id = a.entity_id
  25:          LEFT OUTER JOIN mdm.tblAttribute la
  26:              ON a.domainEntity_Id = la.entity_id
  27:              AND la.attributeType_id = 1
  28:              AND la.IsSystem = 1
  29:      WHERE a.entity_id = @vEntityID   
  30:      AND a.attributeType_id <> 3
  31:   
  32:      SET @vSQL = 'SELECT ' + @vColList + ' INTO MDSCompare.dbo.' + @vViewName + ' FROM mdm.' + @vViewName
  33:   
  34:      EXEC (@vSQL)
  35:     
  36:      FETCH NEXT FROM EntityID_Cursor
  37:      INTO @vEntityID
  38:   
  39:      SELECT @vColList = null
  40:          , @vViewName = null
  41:          , @vSQL = null
  42:         
  43:  END
  44:     
  45:  CLOSE EntityID_Cursor
  46:  DEALLOCATE EntityID_Cursor

Then to test that all the entity which are required have been created use the following script:

   1:  DECLARE @vModelID AS INT = (SELECT id FROM mdm.tblModel where Name = '<ModelName,Char,Master Data>')
   2:   
   3:  ;WITH MDSEntity(Entityname)
   4:  AS
   5:  (
   6:      select REPLACE(e.Name,' ','')  Entityname
   7:      from mdm.tblEntity e WHERE model_id= @vModelID
   8:  )
   9:  , CompareTables (TableName)
  10:  AS
  11:  (
  12:      SELECT Table_Name
  13:      FROM MDSCompare.INFORMATION_SCHEMA.TABLES t
  14:      where t.table_schema = 'dbo'
  15:  )
  16:  SELECT *
  17:  FROM MDSEntity e
  18:  LEFT OUTER JOIN CompareTables c
  19:      ON e.Entityname = c.TableName
  20:  WHERE c.TableName IS NULL

Then use the visual studio 2010 schema compare tool against the two instances of the MDSCompare databases to highlight any structural changes which have been made to the entities.

To find data changes use the following script on the MDSCompare databases to create primary keys:

   1:  USE MDSCompare
   2:   
   3:  SELECT 'ALTER TABLE [' + table_schema + '].[' + table_name + '] WITH NOCHECK ADD CONSTRAINT PK_' + table_name+'_Code PRIMARY KEY CLUSTERED (Code)'
   4:  FROM INFORMATION_SCHEMA.COLUMNS
   5:  WHERE column_name = 'code'

Then use the visual studio 2010 data compare tool to highlight any data differences.

While using the above methods to do the comparisons I found some differences which I had to ignore. The reason we had to ignore them was because of the following scenario:

While loading data through the batch staging process we had to set the default name attribute of the entities to be an empty string. This is because it would not allow nulls and we did not want to use this attribute. However the development instance has the default name attribute set with nulls. I believe this was because the MDS deployment tool had converted the empty element of the deployment package, which was created because of the empty string, to null while uploading.

19 Sep 2011

Choosing the right CDC tool for the job

Introduction

I have, as mentioned in a previous blog, been working on a data warehouse project using CDC for extracting the source system data. I would like to share some of the experiences and evaluation criteria used for selecting a CDC tool for our project.
The reason for using a specialist tool was as follows:
  • Couldn't move all the source databases to SQL Server 2008
  • Needed the capture change data to be sent to another server instance
  • A consistent management experience for CDC between all our source systems. The two database vendors the tool needed to support was SQL Server and IBM AS400

Timescales

First we underestimated how long it would take to select a CDC tool. We plan for 1 month, for both evaluation criteria and running the tests, and it took us about 2/3 months. The time was taken in evaluating the tools against the criteria and gaining access to database systems from other projects.
However our development of the ETL solution was able to continue, while the CDC tool hadn’t been selected, as we used generic columns for the CDC metadata. Our requirements allowed this approach because only a selected set of metadata was needed from the CDC tool which most tools offered. The method to ensure that the changes were picked up in the right order was to use a row ID, which was a big integer identity seed, within the CDC destination table. This was able to work as most CDC tool records the changes in the same order as they are made.

Evaluation criteria

Here are some categories of criteria, which our client used, to select which CDC tool to evaluate, these are common sense categories:
  • Impact: How CDC would impact both source and target systems servers and databases. The Server impact was measured in how much resources it would take to use on the server: memory, CPU, etc.
The database impact was whether it made any schema changes and if it had performance implications for the application. This was important to know as there was a third-party tool where the support contract would have been invalidated if any schema changes were made to their database. To tests for schema changes I use a simple procedure which was to use VSDBCMD tool to import the schema into a schema file before installing CDC. After the install import again to another file and use VS2010 schema compare tool to compare against the schema files.
  • Schema changes: Does the CDC tool cope with schema changes being made on the source system. Could the tool continue to work if the schema changes were made onto columns which weren't being captured or if any new columns were added.
  • Speed: How quickly were changes committed to the target server. The metrics for this was by the volume of changes and speed to commit them. Also how quickly the initial synchronisation took to complete.
  • Management: What was the management tool, how easy was the management tool to use, how quickly CDC was recoverable from errors or disasters.
While evaluating CDC tool we also found that we had to consider some other factors. One consideration was how it impacted the operational procedures of the source systems, for example backups, recovery and deployments.
Another factor was the number of databases and tables which we wanted to capture data from. We had 20 source databases, which had about 15 tables each. Depending on the CDC tool more time would have to be spent on the development and deploying the CDC solution.

To develop or to configure that is the question?  (sorry couldn't resist the Shakespeare pun)
That is the question that we found ourselves asking. As some tools required development of code to create a CDC solution. Also with the tools which required development the experience and environment had to be considered as part of the evaluation. Such as the level of privileges required on the development computer; is there any integration with source control or how to protect from loss of work; how easy is it to transfer development from one developer to another.
Another aspect of the CDC tools evaluation was how easy it was to deploy from our development environment to production. How easy was the CDC solution to deploy and was there any automation through scripting or exporting.

Observations

While evaluating we found the same tool gave a different level of change data capture experience depending on the database vendor it was configured against. For example: when configured against the AS400 the tool was able to give a full row of data. But when configured against and SQL server it was only able to give the columns which had changed. The reason for this was how the tool had implemented CDC for the SQL server. There was no requirement for the replication components for SQL server to be installed, without this component the SQL server log file only records the columns which had changed. Hence why the tool was only able to give only the change columns data and not the full row of data.
We also found that the same CDC tool behaved differently when running against different processing editions (x86 and x64). The difference was with settings and memory requirements.

Useful links

Source system data to warehouse via CDC.

Introduction

I have been working on a data warehouse project with a difference. The difference is that the ETL is not doing the classic extract of source system data, instead the source system is going to send its data to the data warehouse by using change data capture (CDC). The decision for using CDC was as follows:
  • To have the ETL only process the data that has changed within the source system between each run of the ETL. This would help the ETL perform as it would only have to process the change data and not work out what the changes were first.
  • Not to have the ETL processes impact the central source system database. The CDC would be responsible for delivering the changes of the source system to another database server. Then the ETL could be run at any time within the day and wouldn't be responsible for blocking the transactional source system. As there was plans to have the ETL run at the end of the day for two time zones e.g. US and UK.
I would like to share some of the experiences, decisions and challenges that we face while trying to build a data warehouse using CDC. The first decision, which I will put into another post, was which CDC software to use. As a consequence of using CDC we faced a challenge on how to extract data from source systems where CDC couldn't be applied, One such system was Master Data Services (MDS) this was due to how the data is stored within its database. As this would mean the transforms for this source system would have to use a different approach to where CDC was used. What we decided to do was mocked CDC through use of SSIS and then stored the delta changes that we found through our dataflow process. The reason for choosing this approach was so that all the transforms had a consistent approach.

Development

Early into development we discovered that we had to define what was meant by processing intraday changes. Did this mean every row which was captured by CDC within that Day? In other words, that every change which was captured should create a history change within the data warehouse leaving the last change as current. Or did this mean capture events changes within the day? For example scanned the changes for that day and find the flags for the records that are required. For us this meant capture event changes. Importance of understanding the meaning of intraday changes had an impact on how we approached coding the transform procedures.

Another challenge that we faced was how to deal with transforms which require data from two tables. The reason why this was a challenge is because of the following scenario: An entity within the warehouse has a business rule which requires data from 2 tables from the source system before it can be evaluated. However within the source system data has been changed and captured for one table only. For example: business rule order status requires data from the OrderHeader table and the OrderLine table.  As illustrated below.

BusinessRules

The source system updates the OrderLine table and a CDC captures the changes and sends them to the ETL. Then the ETL process runs with only the OrderLine changes tries to evaluate the order status business rule but how to evaluate this when OrderHeader data has not been supplied? As illustrated below.

MissingBusinessRules

There are many ways this can be resolved, here are some examples:
  • Change the application process to create updates to the required linked tables. This might not solve the problem as the CDC software may not have committed all of the changes before the ETL process starts running, unless the CDC is part of the ETL.  Also this may cause performance problems for the application team.
  • To design the data warehouse scheme and business rules so that the transforms don't require joins between tables to create a data warehouse. This I will highlight may not be practical as it could mean it basically becomes a copy of source systems schemas which would give you no value for reporting and analysing.
  • To get transform to read the missing data from the data warehouse if CDC hasn’t passed on the data. This was a good option to consider however there are few considerations to be aware of:
    • Could have performance issues with the transform for some of the following reason: Using left outer joins on warehouse tables and depending on the warehouse schema could require a large number of tables; have use of function calls to IsNull or Coalesce; The size of the tables data warehouse could make query tuning very difficult.
    • Adding complexity to untranslate business rules or transforms if source system data isn’t being natively stored
    • Harder to read and maintain as the transform isn’t just transforming source data
    • The data warehouse could become even wide or larger if the source systems native data is stored within the warehouse to help the transform.  This will have an impact for querying and processing performance; also add overhead cost for storage.
We solve the problem by extending one of our client's requirements. The requirement was: keep a copy of the source system data changes for each day for a further 30 days. To meet this we configured the CDC to send all the changes to an archive database. Then have the ETL process to extract the day changes from the archive database. Rather than have CDC drop the changes to a landing database and then have the ETL copy the data again to the archive database as well as transform the data. The extension of the requirement was to keep all changes of the latest version of each primary key within the archive database. This then allowed us to write a post-extract process to extract any missing data required for any of the transforms where data wasn't present. This then allowed us to write the transforms in a manner that always assumed that all data was in the same location.

Testing

We found that we had to change the way that we prepared to test our ETL process by using datasets. Normally we would create some datasets which would mirror the source system in a particular state for testing certain scenarios for transforms and business rules. However this time we only needed to create change records for testing but the problem was how to ensure that the change records were fed in a controlled manner to ensure consistent mirrored states of source system and valid transitions were maintained. We resolved this problem by investing time to create a test harness solution. Which mimic our CDC tool of choice by inserting the test data into the archive database and then running the ETL process. One thing I would recommend is that you trial your CDC tool over your source systems to see what data changes will be actually captured. This will help to create valid test data for testing transforms and business rules within your ETL process

Related posts.

9 Apr 2010

Configuring Excel Services & PowerPivot on multi server Topology

I have been working with a couple of colleagues, James Dawson and Russell Seymour , on installing and configuring PowerPivot within a SharePoint 2010 beta 2 farm.

We used the following instructions to install PowerPivot on one of the application tier server: http://msdn.microsoft.com/en-us/library/ee210616(SQL.105).aspx.

After following the instructions we were able to verify the installation as we had a SharePoint site with PowerPivot gallery.  We managed to uploaded and open a couple of Excel workbooks, with PowerPivot embedded, within the web browser.  Also we were able to connect to the PowerPivot service and could see the PowerPivot data files loaded. When I tested the slicers I got greeted with the following error:

The data connection uses Windows Authentication and Excel Services is unable to delegate user credentials. The following connections failed to refresh: Sandbox

We did some research and found the following blog post: http://powerpivotgeek.com/2009/12/11/excel-services-delegation/.  I implemented the changes it recommended to our Excel workbooks and tried again this time getting the following error:

The data connection uses None as the external data authentication method and Unattended Services has not been configured on Excel Services. The following connections failed to refresh: SandboxHowever at the time of this error we had already configured the Excel Services with the Secure Store Service application identity that we had created. 

The problem proved to be with the configuration of all the service applications. At the start all the services applications had their own application pools with there own domain users. 

The solution in the end was to run all the services applications under the following app pool configuration: SharePoint Web Services System.