Creating a custom process plugin for Drupal 8 Migrate

By Kevin, June 28th, 2019

I've done a lot of migrations in Drupal 5, 6, and 7. Migrating into 5 and 6 was pretty rough. Version 5 basically required raw script power and chunked looping. 6 did as well. By the time I had to do migrations for projects again, I was well into developing projects in Drupal 7. 

At that point there was a tool available in the community for making migrations into the platform easier. Community veteran Mike Ryan provided the pathway with the Migrate Framework. I learned a lot from this module. I also did a lot of major work in this period. Some of the biggest Drupal 7 migrations I had done involved moving a 10 year old MSSQL database, full of hundreds of thousands of records, into Drupal Commerce for Sports at the Beach. Another involved migrating ~18 websites made up of static HTML, ASP.NET Classic and Wordpress into CSV (using a Ruby script to scrape the sites into a CSV!), then into Google Sheets, and then using the Google Sheet itself as the source to migrate into Drupal 7 for Erickson Living.

Not one byte of data was missed, and it was largely due to the fantastic tools provided by Mike and other contributors over the years.

The main idea behind this Migrate framework now resides in Drupal 8 core as an official module. Migration definitions are done in YAML files, and like before, your main job is to define a source, destination, and process. Source contains the information about the datasource and other configuration connecting to or dealing with the source, destination contains information about where the data will be stored (node, user, taxonomy, media, etc), and process contains the information about how to map source data to fields and properties on the destination entity.

The migrate module also provides a plugin pipeline system for transforming and mapping data. It is this area where you will find a lot of delight and sheer power in the Migrate module. I will admit, the initial learning curve here to doing definitions entirely in YAML files, versus writing PHP classes (Drupal 7) took time for me to get used to. After about 2 weeks of trial and error, I had a much better handling on it. I have spent the last few months working with it on a project, and in a few weeks will have a new site launched that will contain 200,000 nodes and users comprising over 20 years of content and data, without missing a single thing.

Here is a practical example that happened today. One of the migrations is using a JSON API as a migration source. The application uses the URL to fetch the JSON response, and beings iterating over the data with the JSON data parser plugin. This response has 36,000 items - among the data are PDFs that we want to capture, but there was a problem. The web server the files were located on today had stopped responding to http, and would only respond to https requests. This means that the plugin that downloads the PDF files to the local file system is going to fail, unless we can tell it to use https.

I wound up writing a new process plugin called "protocol_switch". This plugins only job would be to rewrite any instance of http:// to https://. There are two different cases here. One of the source fields provides the URL to a published PDF, but the other field provides an array of arrays with named keys pointing at URLs and dates. So the plugin needs to handle those cases, too.

Before we jump to the code, first we write up a unit test to cover the cases we can think of that the plugin needs to cover:

  • Given a string, the plugin should convert a 'from' to the 'to' value in the string
  • Given an array, the plugin should loop and convert a 'from' to the 'to' value in the string
  • Given an array with a keyed name, the plugin should loop and convert a 'from' to the 'to' value in the string
  • No configuration given should throw an exception
  • Bad configuration (i.e., saying http instead of http://) should throw an exception
  • String replacement should be case insensitive
<?php

namespace Drupal\Tests\mymodule_migration\Unit\Plugin\migrate\process;

use Drupal\mymodule_migration\Plugin\migrate\process\ProtocolSwitch;
use Drupal\Tests\migrate\Unit\process\MigrateProcessTestCase;

/**
 * Tests the protocol_switch process plugin.
 *
 * @group mymodule_migration
 * @coversDefaultClass \Drupal\mymodule_migration\Plugin\migrate\process\ProtocolSwitch
 */
class ProtocolSwitchTest extends MigrateProcessTestCase {

  /**
   * {@inheritdoc}
   */
  protected function setUp() {
    $this->configuredPlugin = new ProtocolSwitch(['from' => 'http://', 'to' => 'https://', 'key' => 'url'], 'protocol_switch', []);
    $this->misconfiguredPlugin = new ProtocolSwitch(['from' => 'http', 'to' => 'https'], 'protocol_switch', []);
    $this->noConfigPlugin = new ProtocolSwitch([], 'protocol_switch', []);
    parent::setUp();
  }

  /**
   * Data provider for testProtocolSwitched().
   *
   * @return array
   *   An array containing input values and expected output values.
   */
  public function urlDataProvider() {
    return [
      [
        'http://www.google.com',
        'https://www.google.com',
      ],
      [
        'HTTP://www.google.com',
        'https://www.google.com',
      ],
      [
        'https://httpwww.google.com',
        'https://httpwww.google.com',
      ],
      [
        [
          'http://www.google.com',
          'http://www.google.com',
          'http://www.google.com',
        ],
        [
          'https://www.google.com',
          'https://www.google.com',
          'https://www.google.com',
        ],
      ],
    ];
  }

  /**
   * Data provider for testProtocolSwitchedArray().
   *
   * @return array
   *   An array containing input values and expected output values.
   */
  public function urlArrayDataProvider() {
    return [
      [
        [
          'date' => '2019-01-01',
          'url' => 'http://www.google.com',
        ],
        [
          'date' => '2019-01-01',
          'url' => 'https://www.google.com',
        ],
      ],
      [
        [
          'date' => '2019-01-01',
          'url' => 'HtTp://www.google.com',
        ],
        [
          'date' => '2019-01-01',
          'url' => 'https://www.google.com',
        ],
      ],
    ];
  }

  /**
   * Test protocol_switch plugin.
   *
   * @param string $input
   *   The input value.
   *
   * @param string $expected
   *   The expected output.
   *
   * @dataProvider urlDataProvider
   */
  public function testProtocolSwitched($input, $expected) {
    $output = $this->configuredPlugin->transform($input, $this->migrateExecutable, $this->row, 'destinationproperty');
    $this->assertSame($output, $expected);
  }

  /**
   * Test protocol_switch plugin.
   *
   * @param array $input
   *   The input value.
   *
   * @param array $expected
   *   The expected output.
   *
   * @dataProvider urlArrayDataProvider
   */
  public function testProtocolSwitchedArray(array $input, array $expected) {
    $output = $this->configuredPlugin->transform($input, $this->migrateExecutable, $this->row, 'destinationproperty');
    $this->assertArrayEquals($output, $expected);
  }

  /**
   * Test protocol_switch plugin.
   *
   * @param array $input
   *   The input value.
   *
   * @param array $expected
   *   The expected output.
   *
   * @dataProvider urlDataProvider
   *
   * @expectedException \Drupal\migrate\MigrateException
   */
  public function testNoConfigurationThrowsException($input, $expected) {
    $this->noConfigPlugin->transform($input, $this->migrateExecutable, $this->row, 'destinationproperty');
  }

  /**
   * Test protocol_switch plugin.
   *
   * @param array $input
   *   The input value.
   *
   * @param array $expected
   *   The expected output.
   *
   * @dataProvider urlDataProvider
   *
   * @expectedException \Drupal\migrate\MigrateException
   */
  public function testBadConfigurationThrowsException($input, $expected) {
    $this->misconfiguredPlugin->transform($input, $this->migrateExecutable, $this->row, 'destinationproperty');
  }
}

This looks good. We create a few plugin instances with various configuration to test the cases outlined above.

Hopping over to the implementation, the plugin class itself is pretty straightforward:

<?php

declare(strict_types = 1);

namespace Drupal\mymodule_migration\Plugin\migrate\process;

use Drupal\migrate\MigrateException;
use Drupal\migrate\ProcessPluginBase;
use Drupal\migrate\MigrateExecutableInterface;
use Drupal\migrate\Row;

/**
 * @MigrateProcessPlugin(
 *   id = "protocol_switch",
 *   handle_multiples = FALSE
 * )
 */
class ProtocolSwitch extends ProcessPluginBase {

  /**
   * {@inheritdoc}
   */
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) {
    if (!isset($this->configuration['from'])) {
      throw new MigrateException('No "from" value specified for the protocol_switch plugin.');
    } elseif (!stripos($this->configuration['from'], '://')) {
      throw new MigrateException('"://" not specified for the "from" value of the protocol_switch plugin.');
    }

    if (!isset($this->configuration['to'])) {
      throw new MigrateException('No "to" value specified for the protocol_switch plugin.');
    } elseif (!stripos($this->configuration['to'], '://')) {
      throw new MigrateException('"://" not specified for the "to" value of the protocol_switch plugin.');
    }

    return (is_array($value)) ? $this->process($value) : $this->convert($value);
  }

  /**
   * @param array $value
   * @return array
   */
  protected function process(array $value) : array {
    foreach ($value as $i => $item) {
      if (isset($this->configuration['key']) && isset($item[$this->configuration['key']])) {
        $item = $item[$this->configuration['key']];
      }

      $value[$i] = $this->convert($item);
    }

    return $value;
  }

  /**
   * @param string $value
   * @return string
   */
  protected function convert(string $value) : string {
    return str_ireplace($this->configuration['from'], $this->configuration['to'], $value);
  }

}

Regardless of what or how it is passed in, the plugin class will return a string or array containing strings with values replaced. All tests pass, so we are good to roll!

To affect our migration, we just need to slot our new plugin into the process pipeline for each field with our configuration:

field_published_paper:
  - plugin: protocol_switch
    from: 'http://'
    to: 'https://'
    source: published_pdf
  - plugin: file_import
    destination: 'private://published'
field_paper_revisions:
  - plugin: protocol_switch
    from: 'http://'
    to: 'https://'
    key: 'url'
    source: pdf_revisions
  - plugin: file_import
    destination: 'private://revisions'

In each case, the 'source' is either a URL string or array of URLs. The plugin modifies the values and passes the result to the next plugin in the chain, file_import. The result was a fully passing migration for thousands of records with no errors and no exceptions, averting what was almost a show-stopping Apache issue. All in all, it only took roughly an hour to put that together - most of the time thinking and writing the test cases. There are some simple refactors that can be targeted now too, like moving the configuration checks to the constructor - but its current state gets us there for now.

This method of plugin chaining and transforming is extremely flexible and powerful. At any time you can add new plugins or remove ones you don't need anymore and rerun migrations without having to track down classes comprised of many methods and inheritance chains - its all contained in the migrate definition file.

I wrote over a dozen custom plugins for this particular project to solve different issues like this and will be sharing more over the coming weeks. The Migrate / Migrate Plus modules are awesome.