Drupal World

Simple, Safe, Fast: A New Tool for Data Cleaning

You can’t hear about data without hearing about safety nowadays. And rightly so! Ensuring that data — continue reading
Posted by Alexander Varwijk
August 7, 2018

You can’t hear about data without hearing about safety nowadays. And rightly so! Ensuring that data stays safe should be a priority for any online business. However, keeping data safe sometimes conflicts with other organizational goals. To ensure that bugs can be fixed quickly and feature development runs smoothly, developers rely on a realistic environment (as close to the original) for development and testing. We know that creating a realistic environment while keeping user data safe can be challenging, and that’s why we built the entity sanitizer module for data cleaning.

You don’t want to download production data to local developer machines or use it in online development environments for various safety reasons. To avoid using production data, you can create demo content for your platform instead that tries to emulate a production environment. This content can be used for development and testing. However, when the number of features increases, so do the different possibilities of interactions between those features. We’ve found that this can make it difficult to keep the demo content up to date or representative. This problem is even more intense for Open Social’s enterprise and support team because enterprise sites could have so many customizations that the default demo content is no longer an adequate testing environment.

While trying to keep our demo content up to date, we quickly realized that our users were sometimes creating interactions between different pieces of their platform that were difficult to capture in a demo content description. The only way to quickly reproduce their bug was to manually recreate their scenario. However, recreating their exact scenario was sometimes difficult and time-consuming. For quicker reproduction, we needed the structure of their production data. But how do we this safely? We needed to find a way to clean their data without the risk of exposing name or address information.

When you search in Drupal for “data sanitizing”, there are quite a few results that you’ll find. There was one thing that they had in common, which was that you are required to specify which data should be sanitized. We didn’t like this approach because it meant that if we missed an entity or a field in our description we risked having production data in our local environment.

Introducing the entity sanitizer module

We realized that Drupal is already aware of all the data that it manages, as well as how it is stored in the database. We used this to invert the process of ‘specifying what should be cleaned’ to ‘cleaning everything by default’ and specifying what shouldn’t be cleaned instead. Our results have been published on Drupal.org as the Entity Sanitizer module. By default, this module will query Drupal APIs to figure out which data is available and sanitize everything. The module does not have an interface but instead fully relies on drush. Simply enable the module and run the entity-sanitize (es) drush command.

Of course, if sanitizing a large database takes too long then developers will refuse to perform this crucial step. Loading and saving thousands of entities can be incredibly slow. To avoid this, we’ve chosen to work with raw database update queries which can be a lot faster. Ideally, when going this route you have only one query per database table (usually corresponding to a field).

Using raw SQL queries to update fields in place is possible, but it has a few trade-offs: we’re no longer database agnostic and no PHP can be used for individual field instances to create placeholder values.

In the Open Social installations that we run, we almost exclusively use SQL databases (MariaDB specifically) as a storage backend. Therefore, losing the ability to use the sanitizer for other databases was not a big problem for us. By setting up the module in a flexible enough manner (swappable plugins) we could always add support for other databases at a later date.

Find all the Drupal data

In Drupal 8, almost all data is stored in entities. These are either content entities or configuration entities. A few exceptions exist in the form of “simple configuration” (a configuration that’s not a configuration entity) but that is usually stored either as an environment variable or in the codebase already, so it can be ignored for sanitizing.  This makes the entity system a prime candidate to be used to find all the data that exists in a Drupal platform.

Most of the work to find all the fields as well as the tables where data is stored is done through the Sanitizer service of the entity_sanitizer module. First, we ask the EntityTypeManager for all the entities that it knows about. At this point, we look at whether the entity we found is a fieldable entity by checking if it implements the FieldableEntityInterface. Next, we ask the EntityFieldManager for all the storage definitions for the fields of the entity and the bundle.

Finally, we use our own implementation of the protected DefaultTableMapping::generateFieldTableName method to convert the field to a table name for the SQL query. The columns that should be altered are hardcoded in the field type plugins which are called by the Sanitizer service.

The check for a FieldableEntityInterface is necessary because fields that are added using the Field API are different than those stored on the entity itself (known as “base fields”). Unfortunately, our logic is currently only wired to work with non-base fields. We do have plans to improve this, which we could do by better utilizing the DefaultTableMapping class that is responsible for mapping fields to database tables.

Dynamic values without PHP

Not being able to use PHP to generate random values for our fields was slightly more problematic. For example, simply setting all string values as “Sanitized” would be very unhelpful for a developer that suddenly can’t distinguish two nodes from one another.

Thankfully MySQL has a few string manipulation methods that we can use. We concatenate the bundle, field name, field type, entity id, revision id, language code, and value delta together into a giant string, which we then use as the new field value. The combination of entity id, revision id, language code and delta value should always be unique.

In our StringSanitizer plugin this creates the query as follows:

“CONCAT_WS(‘ ‘, ‘Sanitized ‘, {$table_name}.bundle, ‘{$field_name} field of type {$this->getPluginId()}’, {$table_name}.entity_id, {$table_name}.revision_id, {$table_name}.langcode, {$table_name}.delta)”,.

Although this works well for creating unique strings for all field occurrences that are recognizable for a developer, they are fairly long. For example, for a user profile where both the first and last names are sanitized this could lead to a display name such as “Sanitized profile first_name field of type string 5 1 en 0 Sanitized profile last_name field of type string 5 1 en 0”.

We hope to fix the length problem in the future. One way we could achieve this is by creating a dictionary table in the database that only exists during the sanitization. The table could consist of an id column and a string column where each row is a single word of the Lorem Ipsum series of text. Inserting this can be done in a single static query. All update queries could then reference words in that table based on the entity id. Using offsets, queries can be created to select only one or multiple words based on the field type. By keeping all data operations in the database they can still be highly performant.

Handling missing local files

Although we were quite pleased with the sanitized data, one thing that could improve the local developer experience was to make sure that files that were uploaded on the platform worked locally. We didn’t want to simply copy all the files or set-up a proxy because the files may be private. Instead, we chose to create a submodule for the entity sanitizer project.


no data cleaning in Pachamama community

The entity_sanitizer_image_fallback module uses Drupal’s PublicStream and PrivateStream (if enabled) services to inject a wrapper around all of a website’s image requests. When an image request is made it will quickly figure out if a file is available locally or whether it would result in a 404 error. If it finds that the file is not available then it reroutes the request to its own FileController.

The FileController class uses PHP’s gd toolkit to draw a gray rectangle in the desired size with a black cross and the actual size of the image displayed in the center. The size that is requested is determined by the module’s stream wrapper and is usually determined by the requested image style.
Data cleaning used in Pachamama community

This module allows a website to look natural without having the local files present while still allowing the upload of new files for testing. As an added bonus, the generated images clearly show the size of a requested image. This allows us to spot places where a requested image size is much larger than what is actually required in the display.

Bringing it all together

We have started to add the entity sanitizer module to all the projects that we now set-up, as well as to all the existing projects we already had. By using Platform.sh, we can ensure that all environments that are not the production environment automatically get sanitized. This allows any developer to easily grab a copy of the development environment database. After locally enabling the entity_sanitizer_image_fallback module, they now have a working local environment up and running quickly.

As the module page lists at the time of writing, there are a few improvements that we want to make in the future. Reworking how we find the fields in the database is one of the first things on our list to ensure that base fields (or unfieldable entities) are also properly sanitized. Another change we will welcome is to implement the idea of a database dictionary to improve the usefulness and brevity of our sanitized strings. Finally, we want to provide users with a tool to whitelist certain entities/bundles/fields from sanitization so that data that is known to be safe could be preserved.

How do provide your developers with a safe and useful environment to develop in? How is your experience with the entity sanitizer module? Anything you would like to see improved? Let us know in the comments below!


feature list

Further reading

In this article we discuss

Related articles