Skip to main content

A DOM for PHP

PHP provides the function token_get_all() for working with PHP source code, however it is too low level and requires a lot of work to extract information about the source code. In this article, it shows why token_get_all() is inadequate and introduces a higher level approach using Pharborist to working with source code.

by cameron.zemek /
Problem

In working on porting the Token module from Drupal 7 to Drupal 8, a lot of the changes were mundane changes such as changing calls from element_children($element)to Element::children($element) and adding a use declaration for the Element class. There are lot of these changes to make in Drupal 8 core and other PHP code bases could also benefit from tools to help automate such changes.

Tokens

token_get_all() is the first step in understanding PHP source code, and is what is called a Lexer. It splits up individual characters into a sequence of tokens. For example, lets run token_get_all over the source:

<?php
$x = 2 * calculate($a, $b);

The result is:

Array(
    [0] => Array(
            [0] => T_OPEN_TAG
            [1] =>  Array(
            [0] => T_VARIABLE
            [1] => $x
        )
    [2] => Array(
            [0] => T_WHITESPACE
            [1] =>  
        )
    [3] => =
    [4] => Array(
            [0] => T_WHITESPACE
            [1] =>  
        )
    [5] => Array (
            [0] => T_LNUMBER
            [1] => 2
        )
    [6] => Array(
            [0] => T_WHITESPACE
            [1] =>  
        )
    [7] => *
    [8] => Array(
            [0] => T_WHITESPACE
            [1] =>  
        )
    [9] => Array (
            [0] => T_STRING
            [1] => calculate
        )
    [10] => (
    [11] => Array
        (
            [0] => T_VARIABLE
            [1] => $a
        )
    [12] => ,
    [13] => Array(
            [0] => T_WHITESPACE
            [1] =>  
        )
    [14] => Array(
            [0] => T_VARIABLE
            [1] => $b
        )
    [15] => )
    [16] => ;
)

The issue with using just token_get_all() for working with PHP source code is a token can be used in multiple different contexts. For example, if you want to find all procedural function calls you might think look for T_STRING followed by '(' and ignore T_WHITESPACE, T_COMMENT, and T_DOC_COMMENT. This is not correct cause it also matches object method calls (e.g. $obj->myMethod()) and class method calls (e.g. MyClass::staticMethod()). Also it fails to match dynamic function calls such as $callback(). Even more problematic is dealing with keywords such as static, which is used in 3 different contexts: static variables, static method modifier, static class keyword (e.g. static::$classProperty).

DOM

Imagine instead of working with tokens PHP had a Document Object Model and you could work with PHP source code like you did with DOM or one of the many javascript libraries like jQuery. Pharborist is a library attempting todo this. The library parses the source code into a Syntax tree retaining all whitespace and comments so you can make changes to the tree and then output the source code back out exactly the same except for the change (and therefore can make a patch file with just the change). For example, to find procedural function calls in a source file:

<?php
$tree = Parser::parseFile('example.php');
/** @var \Pharborist\FunctionCallNode $function_call */
foreach ($tree->find('\Pharborist\FunctionCallNode') as $function_call) {
  echo $function_call->getName() . ' called on line ' . $function_call->getSourcePosition()->getLineNumber() . PHP_EOL;
  // fix spelling mistake
  if ($function_call->getName() === 'chcek_plain') {
    $function_call->getName()->setText('check_plain');
  }
}
Conclusion

Pharborist is still in early alpha and is subject to API changes (in particular class and property names), so now is a great time to provide input and help out :)

Senior Drupal Developer