Laundering dirty data
Two problems have been occupying most of my time this month: one was with database design (lack of) and the other was with mapping. The database issue arose when one of our customers decided it was time to give their line-of-business application some much-needed attention. This application started out as a single Access MDB file, but was later split into two Access MDB files: one for the data and the other containing the user interface. Lastly, the data tables were upsized to SQL Server, leaving only the front end in Access.
This system’s problems were legion, but the people using it couldn’t really see them. There were data type and length mismatches; text held in non-standard collation sequences; almost no relationships defined; and no sensible defaults for any fields. The Access front end worked, but no-one wanted to make any more changes to it because it was getting increasingly fragile. Patches-on-patches, queries-on-queries, a byzantine labyrinth of forms and subforms, macros and VBA code made it too difficult to predict how long even a small change would take, and almost impossible to avoid unintended consequences. Hence, for extra reporting power, the company has now turned to SQL Server Reporting Services and is seeing the raw data tables in all their glory – an eye-opening, even eye-watering, experience.
Although there were no relationships defined in the database, there were relationships implied by fieldnames, data and within the front end. Unfortunately, though, sometimes the related data was stored as an integer in one table and as characters in the other. Elsewhere, this same data was stored as either nvarchar(50) or nvarchar(100), depending on which table you were looking at. (Nvarchar columns hold variable-length Unicode text up to the maximum number of characters defined in the brackets.) Wherever data got copied by the Access front end from an nvarchar(100) to an nvarchar(50), it was possible that all characters beyond the 50th would be truncated without warning.
Now if data types and lengths don’t match, SQL Server prevents you from defining relationships between the columns, so all of these problems needed to be fixed before we could define the relationships. To find all the mismatches, we listed all the columns in all the tables, along with their data types, then sorted them by column name. All columns with the same or similar names would appear next to each other, and we could see where their types and lengths didn’t match. Fixing these problems meant editing the table definitions wherever we found a mismatch, which we did on a copy of the database using SQL Server Management Studio, making sure we saved the change script every time. Once these changes were done, we could test that the application still worked and, when satisfied, replay the change scripts onto the live database.
We also took the opportunity to correct inappropriate usage of certain data types. Many columns containing financial data had been defined as FLOAT data types, which are fine for scientific numbers whose magnitude may get very large or small but useless for money. Floats don’t store numbers exactly, and the conversion from binary to decimal can cause anomalies, such as the value 36.2 being stored as 36.199999996754. Defining a column as MONEY rather than FLOAT allows the values to be stored exactly to four decimal places – 36.2 gets stored as 36.2000 and base conversion errors are avoided.
Where the value isn’t monetary but still needs to be stored exactly, you can specify instead DECIMAL(18,4), which is exactly the same format as MONEY but without the implication that this value is monetary. We used this data type for things like discount rates, VAT rates or exchange rates. The numbers in brackets after the word DECIMAL indicate that we want to store 18 digits, of which the last four are after the decimal point; for example, a number like 12,345,678,901,234.5678.