Skip to main content

Migrating a Wordpress blog to Drupal whilst importing embedded images and files for use with the Media Module

One of our clients is in the process of standardizing on Drupal for their various web-presences and micro-sites. This is inline with the ‘One Drupal to rule them all’ phenomenon Dries has written and presented on. In this case the client’s web presence is a mixture of php, Zend applications, Wordpress and various other non-PHP platforms.

by lee.rowlands /

Background

As part of this process they wanted to migrate their old Wordpress blog to Drupal. At the time of migration the Wordress Migrate module was evaluated and didn’t meet our needs for a number of reasons including needing additional flexibility to match the Wordpress users to our existing Drupal accounts and some additional data massaging we needed to do. Note that our migration work started back in September 2011 and continued on and off since then, improvements to both Migrate and Wordpress Migrate since may have changed this outcome.

In case you’re not familiar with the migrate module, there are a wide array of resources available on the project page to get you started. It’s offers a very powerful approach to migrations with support for a wide range of Drupal content out-of-the-box that can be easily extended to meet practically any use-case.

The blog being migrated in this case featured a large number of embedded images as well as links to files. These images and links were in the body of the Wordpress posts and were added using the Wordpress WYSIWYG. The destination Drupal site was running the media module with the Media WYSIWYG plugin to allow the authors to embed media assets into the Drupal node body. As a result the migration had to identify media assets (images and files) in the Wordpress post bodies and import them as managed files into Drupal, allowing them to be managed with the Media module; embedded into new posts using the Media WYSIWYG plugin; and exposing them to views to be added to other content-displays as needed.

Note that the upcoming 7.x-2.4 release of Migrate will include improved support for files migration and so this may alter the approach required.

So to recap our requirements:

  • Migrate Wordpress blog posts by directly accessing the database (no WXR file)
  • Parse blog posts for links and images referencing media assets (files and images) on a particular domain
  • Migrate these files into Drupal as managed files (available to Media module)
  • Update the body of the newly created nodes to reference the newly imported files (note there are no file or image fields involved here).

Enter QueryPath

To handle the heavy lifting of parsing the blog pots for images and links we turned to QueryPath, available to Drupal thanks to the QueryPath Module.

For the uninitiated QueryPath is like jQuery for PHP. Much of the syntax is similar so if you'te familiar with jQuery, you should hit the ground running with QueryPath.

Interestingly Four Kitchens recently wrote about using QueryPath in a static html-file migration. It's always nice when your peers arrive up the same solution independently.

Putting it all together

We won't go into detail about the basics of migration, there are plenty of other posts and resources describing how to perform a migration from an SQL source. Instead we'll focus on how the handling of the file assets.

When you create a migration class you can implement a complete method to perform last-minute modifications on the imported data. This is where we chose to perform or file-handling. Our first step was to parse the document with QueryPath. QueryPath expects an xhtml document, not a fragment, so we had to start by wrapping our post body to resemble a full document.

 $body = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>' . $entity->body[LANGUAGE_NONE][0]['value'] . '</body></html>';

Then we can pass this document to QueryPath

 $qp = @htmlqp($body, NULL, array('convert_to_encoding' = 'utf-8')); 

We made two passes over the body, the first one to find links to media assets (files, images) and the second to find image tags. Whilst looping we kept track of whether we'd changed the markup so we didn't need to update the body content if nothing changed. For performance reasons we also kept an array of found files, in case one file was linked twice in the one body - we didn't want to fetch it twice, however that logic isn't shown here. Here's some example code for the files logic. We won't bother to show the image logic as it's basically the same thing, but instead of finding a tags, we're instead looking for img tags; and instead of updating the href of the a tag, we update the src of the img tag.

// Grab all a tags in the body. 
$links = $qp.find('a');
$markup_changed = FALSE;

// Loop over the links.
foreach ($links as $link) {
// Match on href, we're looking for items relative to wp-content.
if (preg_match('@wp-content/uploads(.*)\.(pdf|txt|xls|doc|odf|jpg|png|gif|jpeg)@', $link->attr('href'))) {
$src = $link->attr('href');
try {
// Open the file using.
if (($handle = fopen($src, 'r')) &&
// Parse the url.
($url = parse_url($src)) &&
// Get the url path.
!empty($url['path']) &&
// Get the filename.
($filename = basename($url['path']))) {
// Construct a uri.
$uri = 'public://' . $filename;
// Save the item as a managed file directly from the stream wrapper.
$file = file_save_data($handle, $uri);
if ($file) {
// Now we update the markup with the new url.
$new_url = file_create_url($uri);
// Use QueryPath to update the url.
$link->attr('href', $new_url);
// Flag things have changed.
$markup_changed = TRUE;
// Record this file as used by this node/our module.
file_usage_add($file, 'MYMODULE', 'node', $entity->nid);
}
fclose($handle);
}
}
catch(Exception $e) {
// We use error log for convenience as we invoke the import using drush on the cli
error_log('Failed to fetch file: ' . $src . 'found in post ' . $row->id);
}
}
}

After we’ve looped over images and links, we just need to do the final update of the imported entity. The code looks like this.

// Did we update our markup? 
if ($markup_changed) {
// We need to update the node's body markup.
$entity--->body[LANGUAGE_NONE][0]['value'] = $qp->top()->find('body')->innerXHTML();
}
// Save the node.
node_save($entity);

Summary

All in all migrate + QueryPath make a very powerful toolbox for performing content migrations, particularly when a degree of massaging is required.

Posted by lee.rowlands
Senior Drupal Developer

Dated

Comments

Comment by David Stanley

Dated

Awesome post. I wish I would have found this last week. I just migrated ~1700 posts from wordpress into Drupal 7. Everything went well, except the media isn't managed. I'd like to try it again - this time following the path you described. Can you elaborate on how you connected to the sql source and those steps? I used the WXR file, so I'm a bit in the dark. Thanks! (now back to reading the rest of the posts on this site - wow!)

Comment by lee.rowlands

Dated

Drupal lets you nominate additional database connections from your settings.php, then you can switch between the two. Eg $databases['default'] is the default, but you can nominate others.

Comment by David Stanley

Dated

Thanks. I ended up doing the image manipulation after the wordpress import. Just grabbed each node, loaded it as an entity, created a file object, attached it to the node, and then save. Along the way I edited the body text and some other stuff. Thanks for the great article -- I would not have know where to start without it.

Comment by Si

Dated

Hi Lee, this is great, and I've modified somewhat to embed Media module JSON, rather than straight links.

What I am finding is if I have html entities in the link text (eg nbsp, amp) then I have some issues with some querypath operations breaking due to the way the utf-8 conversion renders them. Can you expand on the purpose of the UTF-8 conversion? I'm assuming there's a good reason for it and I've been working around it by manually cleaning my data before running operations such as $querypath->replaceWith()

Comment by lee.rowlands

Dated

Hi Si, yeah I've found that QueryPath (as with most PHP Dom tools) chokes on malformed HTML, if you have the full html, including the encoding, then it might be better to let querypath read that from the html, especially if it is well formed.

Comment by Chris

Dated

Hi Lee,

Thanks for the informative post. Can you please elaborate where you executed the calls to QueryPath: was this under prepareRow(), or prepare()?

Thank you.

Pagination