When migrating a site there is never a shortage of bad data that you need to deal with, but there is not any place more troublesome than when you encounter it for the first time while attempting to migrate a website. Dealing with Bad Data, in this case, is defined as any data that cannot conform to the expectations of you or your migration team and the code/structure of the new site. Regardless if that is because the data is missing something or because it is formatted in an unexpected way dealing with bad data can really have a negative impact on your site.
Being alert to potential issues will help to reduce issues and make your migration go faster and smoother. Creating a system/process for this pre-migration stage will be critical in ensuring your team is able to move data without losing it or having lots of manual steps to massage the data. Here are our approaches and tips to make it easier for you.
Locating Problematic Data
One of the first steps on the road to migration is creating a set of spreadsheets for the new site that will indicate where the content from the old site is coming from and where it should be placed in the new site and what changes need to be done if any. That is a good start, but just assuming that you know what the data will look like on the old site can be a pitfall that will cost you your budget.
For many fields, it can be useful to do queries on a snapshot of the old database. Fields that have a limited number of distinct values should have a distinct() query run against them. For example, a field that should have a ‘Y’ or ‘N’ value, may also have ‘y’ or ‘n’ values, too, or even ‘true’, ‘false’, 0, 1, ‘’, NULL, ‘t’, ‘f’, and so on. Find all the outliers before you begin coding, so you can take them into account. For a yes/no field like this, you may want to code a process plugin that can interpret all of them into a single pair of values that you can then reuse on other fields in your migration.
Dates are particularly troublesome fields. If the old database field is not in ISO date format, you may find that dates are recorded in all sorts of ways. Converting them all correctly for use in a MySQL date or DateTime field can be difficult. Often the easiest thing to do is not try to do it yourself, but rather use PHP’s strtotime() function. But that may not be good enough if your dates might be outside of the timestamp epoch (roughly, 1970 - 2038). Then you’ll have to fall back on PHP’s date class.
This brings up the next issue with dates: they may have contextual problems. Is it ok if you find a date that’s 100 years in the future or 100 years in the past? Often, dates need to be range checked somehow.
Another problem dates can have is timezones. Often, dates don’t have an associated timezone in a database. You may need to know the default timezone for the old system, or perhaps where the client is located. Timezone date calculations are surprisingly finicky to get right.
If there are unstructured text fields in the old data, you may need to watch out for data that is outside the “normal” 7 bit ASCII code page. Even Western European languages have codes outside of this range, things like ø, ç, ü and so on, including currency signs like £ and €. Then there are multibyte string data, used for representing characters on the vast number of code pages for languages that use other alphabets or other special-purpose glyphs, like for math and physics.
There can also be problems with characters that have been entered for various symbols: emojis that employ character codes that are specific to Windows or the Mac and cause problems in their native forms.
If your MySQL database is not set up right, it might reject inserting such data. Choosing a “reasonable” format can be a challenge sometimes, but usually, the better answer is to detect these situations and write some custom code to fix them, to convert the wacky characters into something acceptable and displayable.
A common problem often seen when migrating older Drupal sites, but can also happen with other technologies as well, is that text fields will have self-referencing data encoded in it. A very common situation is that images on the site are inserted into the text with an
, for example. This may not work correctly on the new site, especially if, as a part of the migration, the existing files have been reorganized.
In Drupal 7 sites, you may instead see a blob of JSON that is a media entity reference. You will have to find these and use the mapping tables created when the media entities were migrated to change them. You will probably have to find another way to deal with them, as media entities in Drupal 8 are not yet integrated into the default CKEditor WYSIWYG fields.
Getting Cozy with Your Data
While investigating the existing data, be sensitive to changes in patterns. Maybe you’re dealing with an e-commerce system and the states that an order goes through start having a different pattern at some point in time in the data. This can indicate a point when the old system was modified or upgraded, and your assumptions about what data you’ll see and how to migrate it across to the new system need to take this into account.
This is true especially if you’re migrating off a system that no longer has any technical support. Maybe because the company that implemented isn’t around anymore or is otherwise not available, or the software in use is obsolete. You may have to guess how data objects are constructed out of the database tables. There may be extra tables no longer in use - but do they have historic information that you should be capturing and moving? You will need to have several sessions with the client to ask them about particular examples in their data. Puzzling this out when the client themselves doesn’t know how their system works can keep you up at night.
You’ll have to try to rebuild the original designer’s mindset to understand how they structured the data. If the data has any complexity to it, you will have to learn how related data objects are referenced, and rebuild these connections correctly in Drupal.
Sometimes references are conditional on the state of one or more fields. For example, if there’s an original order field in the order table, maybe that is only filled in when this order is actually a return. What does the order total field represent in this case? Should it be positive or negative? These are the kinds of issues to be sensitive to. Often an initial audit of an old site won’t find these items.
Data Issues can come at many levels
Here are just a few of the areas where most problems tend to pop up when migrating data into a new Drupal site. Usually, with Raw data, the issues are around the context of the data and in the actual meaning of the data. Finding and fixing these can require a great deal of careful testing and creative solution development. Being aware of those issues ahead of time and following through with a plan during development can help your team have a smoother path forward during the migration process.
Try to use your favorite database tool to investigate the incoming data fields for the following
- Look at select and checkbox lists to verify that the values are consistent throughout
- Check date fields for good and consistent date formats and reasonable data ranges, and consider if there are any timezone issues
- Look at your content and especially that of unstructured text. See if there are character set issues, internal HTML references to pages on the old site, and image URLs.
- Check deeply on how the old system is structured, especially if you do not have access to any technical resources for that system. Verify that you know how and why the data in the old system is structured the way they are
- Create a UML map of the old platform, and one of the new ones, then create a diagram of where things go and what things are being dropped.