DB Schema Testing

  • strict warning: Non-static method view::load() should not be called statically in /hermes/walnaweb12a/b57/moo.greydragoncom/nodsw/sites/all/modules/views/views.module on line 906.
  • strict warning: Declaration of views_handler_argument::init() should be compatible with views_handler::init(&$view, $options) in /hermes/walnaweb12a/b57/moo.greydragoncom/nodsw/sites/all/modules/views/handlers/views_handler_argument.inc on line 744.
  • strict warning: Declaration of views_handler_filter::options_validate() should be compatible with views_handler::options_validate($form, &$form_state) in /hermes/walnaweb12a/b57/moo.greydragoncom/nodsw/sites/all/modules/views/handlers/views_handler_filter.inc on line 607.
  • strict warning: Declaration of views_handler_filter::options_submit() should be compatible with views_handler::options_submit($form, &$form_state) in /hermes/walnaweb12a/b57/moo.greydragoncom/nodsw/sites/all/modules/views/handlers/views_handler_filter.inc on line 607.
  • strict warning: Declaration of views_handler_filter_boolean_operator::value_validate() should be compatible with views_handler_filter::value_validate($form, &$form_state) in /hermes/walnaweb12a/b57/moo.greydragoncom/nodsw/sites/all/modules/views/handlers/views_handler_filter_boolean_operator.inc on line 159.
Leeland's picture

== Overview Database Schema Version Control ==

It seems that version control for database schema changes has traditionally received little developer attention. Almost everyone just goes ahead and changes database schema, trying to keep it backwards-compatible.

That’s possible if the change is small or non-intrusive (as it is in most cases). For more complex changes you could create ad-hoc script which makes necessary alterations and data conversion. In the most serious cases you stop the service, backup the database (this step is usually skipped by everyone once in lifetime, and only once), upgrade the code, run the conversion procedure, start the service back and hope that it all went ok.

This path of less resistance works reasonably well, and rarely leads to disasters. However, it is still error-prone, and the trend is to make the schema upgrades more relaxed, both error developers and production admins.

More integrated development environments seem to get built-in database schema versioning naturally. Less integrated ones get consistent migration strategy only at individual product level, and the tools are usually ad-hoc.

For example, Ruby on Rails has a very refreshing idea of [http://api.rubyonrails.com/classes/ActiveRecord/Migration.html Active Migrations]. They work seamlessly, because:

* Rails is tightly integrated with underlying database;
* migrations are not tied to using SQL: to create table or add new field to it, you use Ruby itself; this helps with thinking in right way about data changes;
* keeping Rails application under Subversion is the encouraged practice;
* rake utility knows about migrations, and helps in creating and deploying them;
* [http://manuals.rubyonrails.com/read/book/17 capistrano] utility also knows about migrations, and helps in pushing them to remote production/testing servers;

Perl is, of course, less integrated, and its primary offer, [http://search.cpan.org/~daniel/DBIx-Migration-0.05/ DBIx::Migration] could use more attention. Being tied to using SQL is its main drawback.

I believe that mimicking Active Migrations functionality should be a new standard for every major open-source language, such as Perl, Python, or PHP. That’s just like everyone currently expects something similar to CPAN for every new language which tries to attract serious attention.

Meanwhile, Microsoft is taking advantage of its integrated development environment, and makes it easier to keep database schema under common version control.

[http://codebetter.com/blogs/jeff.lynch/default.aspx Jeff Lynch] writes in [http://codebetter.com/blogs/jeff.lynch/archive/2006/10/17/Visual-Studio-Team-Edition-for-Database-Professionals.aspx Microsoft Visual Studio Team Edition for Database Professionals]:

…with Visual Studio Team Edition for Database Professionals, database development now becomes a fully supported part of your application’s development lifecycle. Now all database development is done “off line” and in a “sandbox” environment (this should make your dba stand up and sing!). All user defined functions and stored procedures can be fully unit tested using representative test data automatically generated by the toolset. And best of all, this new Visual Studio sku fully integrates into Team Foundation Server so your database schema (SQL scripts) can be put under source control just like any other C#, BizTalk or Web Application!

[http://blogs.msdn.com/sachinre/default.aspx Sachin Rekhi] writes in [http://blogs.msdn.com/sachinre/archive/2006/08/30/732298.aspx Extensibility in Team Edition for Database Professionals]:

'''Test Conditions.''' The database unit testing feature allows you verify tests using either SQL assertions or easily configurable UI client-side test conditions. We ship a set of test conditions in the box, including row count, scalar value, empty resultset, etc. But these test conditions are completely extensible so you can imagine creating your own to do more powerful test verification.

'''Check-in Policies.''' Team System also allows you to create custom check-in policies that require certain actions to be performed prior to check-in. For example, a testing policy that ships with TFS enforces that a specific set of tests is run prior to checking in your code. You can implement other such db specific policies if you desired.

Several third-party tools are available to take control of your database schema for MS SQL Server, e.g.:

* Skilled Software: [http://www.skilledsoftware.com/sqlsourcecontrol.htm SQL Source Control 2003];
* Best SoftTool Inc.: [http://www.bestsofttool.com/SQLSourceSafe/SSS_Introduction.aspx SQLSourceSafe];

[http://www.otug.org/meeting/200209/Agility_and_the_Database_v7.pdf Pramod Sadalage and Peter Schuh (2002)] [http://www.springerlink.com/content/hfyrc9dke44u6dhv/] suggest that Agile DBAs maintain what they call a database change log and an update log, the minimum that you require for simple stovepipe projects where a single application accesses your database. However, to support more complex environments where many applications access the your database you also require a data migration log. Let’s explore how you use each log:

# Database change log. This log contains the data definition language (DDL) source code that implements all database schema changes in the order that they were applied throughout the course of a project. This includes structural changes such as adding, dropping, renaming, or modifying things such as tables, views, columns, and indices.
# Update log. This log contains the source code for future changes to the database schema that are to be run after the deprecation period for database changes. [http://www.agiledata.org/essays/databaseRefactoring.html The Process of Database Refactoring] argues that changing your database schema is inherently more difficult than changing application source code – other developers on your project team need time to update their own code and worse yet other applications may access your database and therefore need to be modified and deployed as well. Therefore you will find that you need to maintain both the original and changed portions of your schema, as well as any scaffolding code to keep your data in sync, for a period of time called the “deprecation period.”
# Data migration log. This log contains the data manipulation language (DML) to reformat or cleanse the source data throughout the course of your project. You may choose to implement these changes using data cleansing utilities, often the heart of extract-transform-load (ETL) tools, examples of which are listed in Table 1.

You may choose to implement each logical script as a collection of physical scripts, perhaps one for each development iteration or even one for each individual database refactoring, or you may choose to implement as a single script that includes the ability to run only a portion of the changes. You need to be able to apply subsets of your changes to be able to put your database schemas into known states. For example you may find yourself in development iteration 10 to discover that you want to roll back your schema to the way it was at the beginning of iteration 8.

== Introduction ==

Have you seen this situation before?

* Your team is writing an enterprise application around a database
* Since everyone is building around the same database, the schema of the database is in flux
* Everyone has their own "local" copies of the database
* Every time someone changes the schema, all of these copies need the latest schema to work with the latest build of the code
* Every time you deploy to to a staging or production database, the schema needs to work with the latest build of the code
* Factors such as schema dependencies, data changes, configuration changes and remote developers muddy the water

How do you currently address this problem of keeping the database versions in working order? Do you suspect this its taking more time than necessary? There are many ways to approach this problem, and the answer depends on the workflow in your environment. The following article describes a distilled and simplistic methodology you can use as a starting point.

* '''Since it can be implemented with ANSI SQL, it is database agnostic'''
* '''Since it depends on scripting, it requires negligible storage management, and it can fit in your current code version management program'''

== Background ==

Any way you look at it, managing development is a hard problem. As your development team and application grow in size, as the requirements change, your overhead costs of management inevitably grows as well. There is a constant need of documentation, regression testing, and version management within this flux to maintain
stability, all without losing "momentum".

As evolving applications change build versions, there are many "application hosts" that need to be synchronized. Often, this leads to a large amount of unnecessary bugs, and an excess percentage of time dedicated solely to the synchronization of versions of the "latest build" among various test/staging/and production platforms.

This article is one of many designed to help you effectively manage your changes with minimal time involved. Since code version management programs have been around before I was born, this article deals with database version management.

== Your options ==

CVS, SourceSafe, ClearCase, SourceGear, SubVersion... The code versioning programs go on and on... What about databases? How do we seamlessly upgrade the schema/data of our production database with minimal or no downtime? How do multiple developers "get latest" on the database schema and data?

You should be aware that some great database "diff" programs already exist. [http://www.starinix.com/sqlcompare01.htm" StarInix], [http://www.red-gate.com/products/SQL_Compare/index.htm RedGate], and [http://www.sqldelta.com/ SQL Delta] come readily to mind. Some of the higher end ones come with a hefty price tag, because they generate synchronization scripts for you. ''Remember that time you trusted an all-purpose program to generate your HTML for you?'' ''Do you really want to trust an all-purpose program to automatically change your database schema for you?''

Don't get me wrong, these programs are a great starting point for viewing and solving differences, but I highly recommend a more "hands-on" approach to a versioning management policy. Why?'''''When you write the scripts yourself, you not only have a higher degree of control, you have a higher degree of consciousness of what is changing in your database, when it changed, and who initiated the change.'''''

== A Proposed Solution ==

As each database presents its own unique challenge, I don't offer you a one-size-fits-all downloadable application. I offer a time-tested and provably effective methodology that's flexible enough to adapt to the workflow of your development cycle, provided you're ready to get your hands a little dirty with sql scripting code!

=== What you will need ===
* ''An existing source control system''
** Anything, really. You can use the same repository as your code base, or simply a single text file on an accessible network share as a last resort.
* ''Database permissions to run powerful scripts''
** (Nominated developers will need something such as "DBOwner" access where they need to make changes)
* ''A competent knowledge of SQL''
** You will need to know how to write scripts to update the schema and data of your specific brand of database. 90% of the time these are relatively simple scripts.

=== Known limitations to this methodology ===
* There are some data types that can't be manipulated through the use of text-based scripts. These include:
** Binary (image, bitmap, blob)
** Variable-length text (Text-16)
** This may vary depending on your Database platform. Can you employ replication?
* Data changes that rely on temporal data (''datetimes, automatically generated IDs genrated by seeds'') can't be guaranteed consistent throughout multiple databases.
** If your code is relying on consistent autonumbers between database versions, make sure this isn't a conceptual design flaw!

== The Strategy ==
=== The Versioning Table (in essence) ===

For the sake of flexibility and verbosity, declare a table within the database (I'll call it DBVersion) serving as a version indicator. Each row should be able to store some form of a version identifier (EG: if you use FogBugz, this would be your [http://www.fogcreek.com/FogBugz/ CASEID]) , a brief description, when the change happened, who initiated the change, and any extra information you may find convenient.

This table will take the guesswork out of determining the last update to the database.
PK | DBVersionID

=== The Script ===

In conjunction with the table, we need a script that performs actions driven by the values in the table. The following is the script algorithm:

Start transaction [ALL]
    For each version change I to N Do:
    If(not exists (select (1) from DBVersion where DBVersionID=I))
        Start transaction [i]
            <[do your update stuff here]>
            If failure, rollback and exit
            Else Insert into DBVersion values(I,<>,getDate(),User_ID())
        Commit transaction [i]
    Next [i]
Commit transaction [ALL]

Now observe this closely. This script is designed to update only as much as it needs to, without performing the same update twice. If the database already has record of a certain version change, the script will simply move on. The script will run on databases under different versions with a consistent outcome, and a clear audit trail of how and when the changes occurred. If, for some reason, the script fails, the nested transaction structure enforces the database to roll back to its state before execution.

If the script is built properly in this manner, synchronizing the database can be as easy as "getting the latest" version of the code, and running the script before testing. Is someone complaining the latest build doesn't work? Tell them to run the script again and take a hike!

If there is one fundamental rule to the script, it is '''NEVER DELETE FROM THE SCRIPT'''. Occasionally, it seems tempting to delete past mistakes in the script, but this ruins the serial record of changes inside DBVersioning table! Instead of deleting, append an additional version change that amends previous mistakes. After all, most scripts will (hopefully) not require a critically long execution time.

Even though its not necessary that the DBVersioning methodology is implemented from the inception of the database, there are clear advantages of having a build script from the ground up.

=== Real-Life Example Using SQL Server ===

Every major project begins with a single step. Adopting this versioning and "getting into the habit" is the hardest part, so I will supply some code for our beloved SQL Server Northwind database as a starting example.

--Enforce this is Northwind
USE Northwind
--To ensure ACID, begin the main transaction

   --make sure the DBVersion table exists
   if(not exists (select (1) from dbo.sysobjects where id = object_id(N'[dbo].[DBVersion]') and OBJECTPROPERTY(id, N'IsUserTable') = 1))
     print N'Building the foundation ''DBVersion'' Table'
     --build the table
     CREATE TABLE DBVersion (
      [DBVersionID] int NOT NULL ,
      [Description] varchar (255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
      [ExecutionDate] datetime NOT NULL ,
      [UserID] int NOT NULL

     --any potential errors get reported, and the script is rolled back and terminated
     if(@@error <> 0)

     --insert the first version marker
     INSERT INTO DBVersion values (1,'Build the DBVersion Table',getDate(),User_ID())
   --finished step one
   --Continuing, adding to this script is relatively simple! Observe...
   if(not(exists(select (1) from DBVersion where DBVersionID = 2)))
     print N'Adding a ''DateCreated'' field onto the Customers table for auditing purposes'

     --add the column, or whatever else you may need to do
     ALTER TABLE Customers add DateCreated DateTime not null default getDate()

     --any potential errors get reported, and the script is rolled back and terminated
     if(@@error <> 0)

     --insert the second version marker
     INSERT INTO DBVersion values (2,'Adding the DateCreated field to customers',getDate(),User_ID())


--from here on, continually add steps in the previous manner as needed.

Note that as the above code becomes more verbose, these scripts can reach a daunting length in a short amount of time, so it may be to your benefit to store multiple "chapters" of the conceptual evolution of your database.

=== Modifications ===

The above script is a bare-bones example for you to build upon for your specific needs. Change it as you like, the core concept lies within the algorithm.

* Maybe you could write a stored procedure that takes in a version identifier and some update batch code as arguments.
* Maybe some parts of your data will rely on Replication services for updates
* Maybe theres some extra grunt work involved with updating legacy databases

== Database Unit Testing ==

Whenever you change your database schema, perhaps as the result of a database refactoring, you must be able to regression test your database to ensure that it still works.

* [http://gojko.net/fitnesse/dbfit DBFit]
* [http://www.dbunit.org/ DBUnit]
* [http://www.ndbunit.org/ NDbUnit]
* [http://www.ounit.com/ OUnit] for Oracle (being replaced soon by [http://www.unit-test.com/ Qute])
* [http://sqlunit.sourceforge.net/ SQLUnit]
** [http://tsqlunit.sourceforge.net/ TSQLUnit] (for testing T-SQL in MS SQL Server)
* [http://msdn.microsoft.com/vstudio/teamsystem/products/dbpro/ Visual Studio Team Edition for Database Professionals]
* [http://www.sqledit.com/ DTM Database Tools]

== References and Suggested Readings ==

* [http://www.agiledata.org/essays/databaseTesting.html Database Testing: How to Regression Test a Relational Database]
* [http://www.agiledata.org/essays/bestPractices.html Agile Database Best Practices]
* [http://www.agiledata.org/essays/agileDataModeling.html Agile/Evolutionary Data Modeling]
* [http://www.agiledata.org/essays/culturalImpedanceMismatch.html The Cultural Impedance Mismatch Between Data Professionals and Application Developers]
* [http://www.agiledata.org/essays/databaseRefactoring.html The Process of Database Refactoring]
* [http://www.ambysoft.com/surveys/ Survey Results (Agile and Data Management)]
* [http://www.codeproject.com/useritems/DatabaseSchemaVersioning.asp Database Schema Versioning Management 101]

Thread Slivers eBook at Amazon