Ditch messy spreadsheets and switch to a database
We’ve looked at the pitfalls of using a spreadsheet application such as Excel to store lists of data. This approach may seem like the best solution at first, but you can run into problems sharing that data with multiple users, validating the content or even navigating your data. Why? Because you’re using a tool that wasn’t designed to do the job.
Now we’ll consider an imaginary (but typical) case of a business using a spreadsheet-based list, and look at how this could be converted to a database application to overcome such problems.
How workbooks get out of hand
Our list began as a simple record of projects undertaken for clients. As the company grew, so too did the number of clients, with names and contact details added to the workbook. Also, some way was needed of recording what various members of staff were doing on these projects, so even more data was added into this workbook.
At this point the spreadsheet approach became unworkable: there were far too many people trying to keep it up to date, often at the same time. The company tried instituting a rota, so that people took it in turns to update the workbook, but this meant that some tasks were forgotten about before they were recorded.
In the end, people set up their own workbooks to keep track of their tasks, sometimes remembering to copy the data into the main workbook at the end of the week. Employees developed their own shorthand for these books, and some changed the formatting and the order of the columns to suit their way of working. Copying this data into the main workbook resulted in a horrible mess.
This may be a made-up example, but I’ve actually seen all of these practices in real life. Let’s take a closer look at some of the issues thrown up by this method of working.
Plenty of problems
You can see the first sheet of our imaginary spreadsheet. The first column details the name of the project to which each entry refers. Some of these names are long, however, so staff may have been tempted to use abbreviations; as a result, typos have crept in. This makes it difficult to tie up which tasks belong to which project. The solution doesn’t have to be difficult: you could choose a short name for each project that everyone agrees on, or give each project an ID number and translate this to the project name automatically.
There’s a similar problem with the Started column. Some cells contain a date, but others record only a month – and one or two records just say “Yes”. Excel does support data validation, so it’s possible to ensure that particular cells always contain data of a particular type – but when a spreadsheet is developed in an ad hoc fashion, it’s rarely used.
“At this point the spreadsheet approach becomes unworkable: there were far too many people trying to keep it up to date”
You won’t have this problem in a database application, since the data type of the field will be fixed from the outset. If you don’t know the exact date when work began, you can use the first of the month, or 1 January if you only know the year. If the project hasn’t yet been started, you might leave the field blank – a NULL in database terms. If you knew the project had been started but didn’t know when, you can use a date that would ordinarily be impossible for your data, such as 1/1/1900. Immediately it becomes easy to sort projects and gain a chronological overview of activity.
A more subtle challenge is presented by the column labelled Client. The entries in this column aren’t linked to anything else in the workbook, but there’s a list of Customers on Sheet 1, which is probably what it refers to. Storing multiple lists of the same items, referred to by different names, is confusing. You need to clarify the naming and settle on an unambiguous name for this entity: are they clients or customers?
The Status column is another one where there’s been no validation, so people have again opted to write whatever they want. It would be better to establish a short list of all the permissible values.
The second sheet – Sheet 1 – is just as problematic. For a start, the sheet name isn’t descriptive. What it actually contains is a list headed Customers, but this isn’t formatted as a table in Excel: the address is in one field, which limits your ability to use Excel’s built-in tools to search or sort it. You could, for example, filter for addresses that contain “Cardiff”, but the results would also include those on Cardiff Road in Newport.
When it comes to addresses, the best approach is to use separate fields for the postcode, county, city, and street (although county information is optional for UK addresses – see No counties, please, we’re British). Street should contain everything that isn’t in the other parts of the address.
There’s a Contact field, which presents problems too. Where we have several contacts within a single-client business, their names have all been lumped into this field, with their phone numbers and email addresses similarly placed into the other fields. Separating these out will be challenging – especially if there are three names in the Contact field but only two phone numbers.
The final column in this sheet is headed Last Contacted: employees are supposed to update this each time they make contact with a customer. Since this information is an extra thing for the employee to remember, and there’s no guarantee they will – especially since it’s hidden out of the way on a second sheet– it’s unreliable. This is really something the computer should be tracking automatically.
Finally we come to the Tasks sheets, which detail the tasks and comments for each worker. These aren’t named consistently, and don’t contain the same columns in the same order. While it makes sense for individual users to enter their data on their own sheets, the lack of coherence makes it difficult to collate and analyse the data. When a manager wants to see what work has been done on each project, for example, all the tasks have to be copied by hand from the individual sheets into one list before they can be sorted and reported on.
Building your database
Sorting out these issues will take some work, possibly several days. Since users will probably have to continue to use the old system while we’re building a new one, it’s best to make a copy of the existing workbooks from which to work. This means we’ll want to document every step in converting the data, so we can quickly do it again when the time comes to switch over to the new system.
The first thing you need to do is clean the data in your Excel workbook. Using Find & Replace can help, and you should delete any column or row that doesn’t contain data (except for the column heading row, which must be kept). Add an ID column to each sheet, in column A, and populate it with incremental numbers by typing 1 in the first cell, selecting to the bottom of the data (Shift+End, Down) then using the Fill Down command (Ctrl+D). Create a master list of project names, and wherever a project name is recorded, use the VLookup() function to confirm its master ID number; if there’s no number, there’s an inconsistency in your data.
Once your data is clean, it’s time to design a new database to hold it. We’ll use Access 2013, because in our theoretical example it’s available to all our users through our Office 365 subscription. When you create a new Access database, you get a choice of creating it as an Access Web App or an Access Desktop Database. Web Apps have a simplified interface and can be used only if you have Office 365 with SharePoint Online or SharePoint Server 2013 with Access Services and SQL Server 2012. We’ll use the traditional Desktop Database, since it offers more options and greater control over the user experience.
Select to create a new Desktop Database and name it: Access creates a new table called “Table 1”, and places you into the Design View with one column, called “ID”. Here you can design the tables you’ll need in your database. Every table should have an ID field (an automatically incremental integer), but to avoid confusion it’s best to give it a more descriptive name. In the Projects table it would be “ProjectID”, “CustomerID” in the Customers table, and so on.
You can set the data type for every column created, and you need to give each column a name and set any other properties and formatting as appropriate for the field. As with the ID field, make sure the column names make it obvious what data should go in the field – so, for example, use ProjectName rather than just Name, DueDate rather than Due. You can use the Name & Caption button on the ribbon to create an abbreviated caption as well as the explicit name. You can use spaces in column names, but you’ll have to surround them with square brackets when writing queries and reports.
“While it makes sense for users to enter their data on their own sheets, the lack of coherence makes it difficult to analyse”
Set the formatting on columns such as PercentageComplete to be Percent and dates to be ShortDate, and also the maximum length of text fields to a sensible value, or they’ll all be 255 characters long. Remember that some words (such as Date) are reserved, so you can’t use them as column names: use TaskDate or something else more descriptive instead.
When it comes to columns where you want to look up a value in another table (such as the Customer column in the Projects table), define those other tables in Access before you add the lookup column. When it comes to Status, the simplest option is to just type the values to be shown in the dropdown list – but this makes it difficult to add or edit the list of possible values later. Unless you’re dealing with a short list where possible values are unlikely to change – such as a field recording someone’s sex – it’s a better idea to create another table for entries such as ProjectStatus. This allows you to easily add extra options to the list in future without a programming change.
While we’re designing our database, we can implement improvements over the old spreadsheet-based way of doing things. One complaint our users had with their Excel workbooks was that each task contained only one cell for comments, and sometimes they needed to make more than one comment on a task – or, the supervisor needed to make a comment about a task and then the user reply to this. Cramming everything into a single cell made it difficult to see when, and by whom, comments were made. We can do better by creating a separate table for comments, linked to the Tasks table. In this way, each task can have as many comments as necessary, with separate fields for the date, username and text of each one.
Another enhancement we can make is to set entries such as ProjectStatus to display in a particular order, rather than alphabetically – for example, you might want “Completed” to go at the bottom of the list. To do this, add a DisplayOrder column and use it to sort the lookup list. Don’t be tempted to use the ID field; with this, any new records could only go on the end of the list.
To ensure our data remains clean, we can mark fields that the user must fill in as “Required”, and add validation to ensure that the data entered is in the correct form. You can make life easier by setting sensible default values: the CommentDate field on the Comments table could have its default value set to “=Date()”, which will automatically set it to today’s date whenever a new Comment is created. You can use validation along with a “Withdrawn” column in a table (a Boolean) to stop users adding new records with specific values. This allows you to keep historic values that used to be valid, but that aren’t used any more. These features can all be found on the Table Tools | Fields tab on the ribbon or in the Field Properties in Table Design View.
Importing your data
Once your tables are set up, you can use the External Data | Import & Link | Excel button on the ribbon to append the data from your Excel workbook to the tables in your Access database. Make a backup of your blank Access database before you start, in case anything goes wrong, and start by populating the small tables by hand if necessary. Take another backup once this is done, so you can get back to this point if anything goes wrong in the following steps.
Now import the main tables that don’t rely on any other tables, such as Customers, before finishing with the tables that do have relationships, such as Projects and Tasks. If you rearrange and rename the columns in your Excel workbook to match the fields in your Access database as closely as possible, you shouldn’t have any difficulty importing the data. Remember to make a note of everything you do so you can repeat it later if you need to convert the data again.
Once the data is imported, the tables in Datasheet View should work much as the Excel worksheets did – but with much better data validation, searching and sorting. If you wish, you can now start to design new forms and reports based on this data: for example, a Master/Detail form for Projects might show the data of one Project at the top of the form and a grid of the Tasks for that project at the bottom.
You could also set up a “My Tasks” form that lists all the outstanding tasks for the current user and an Overdue Tasks report that lists all the outstanding tasks for all users that are past their due date.
No counties, please, we’re British
If you’re storing addresses in your database, it’s important to understand what information you actually need. Although county information can be useful for marketing – and may be needed for some overseas addresses – it’s no longer officially used in UK addresses.
The reason is that UK postal addresses rely on the concept of a “post town”, where post for you is sent and sorted before it’s delivered to your door. Not all towns or villages are served by post towns in the same county – for example, Melbourn (in Cambridgeshire) gets its mail through Royston (in Hertfordshire) – so specifying a county in the address doesn’t necessarily help anyone.
To avoid confusion, the Post Office stopped using counties in addresses back in 1996, relying on postcode information instead – and by 2016, it plans to remove county names from the “alias data file” of supplementary address information. So, if you include a county in a UK address it will simply be ignored.