Internals
Basics
Anonymization processus is based upon a single UPDATE SQL query, each anonymizer adds it own SET statements, and a few JOIN clauses if necessary.
Enum and multiple-column anonymizers are base implementations for anonymizers that use pre-generated sample data lists as source values for anonymizing your database. In order to make those efficient, the sample data lists are first inserted into a temporary table in database.
Once data is inserted into temporary tables, anonymization will use an SQL JOIN statement on the anonymizing SQL UPDATE query.
Joining on the target table to anonymize is only possible if you have an arbitrary row identifier to JOIN on. Without an identifier, RDBMS will optimise data-unrelated joins to a single selected row from the sample table, and all target table rows will have the same value.
In order to achieve this, the anonymizator will add a temporary integer column on each target table and populate it using a sequence, that will be dropped once anonymization is done.
Some anonymizers will generate their own SET statements instead of using a sample table join, they are not covered by this documentation page.
Targeted UPDATE query
Standard SQL
Considering that:
"client"is the target table the user wants to be anonymized,"sample_1"is a sample table, corresponding to a first anonymizer,"sample_2"is another sample table, corresponding to a second anonymizer,
the nearest SQL standard variant we target, which actually is for PostgreSQL is:
UPDATE
"client"
SET
"nom" = "sample_1"."value",
"civilite" = "sample_2"."value"
-- See note 1: self-join using FROM.
FROM "client" AS "_target_table"
LEFT JOIN (
SELECT
"value",
-- See note 2: where the random happens.
ROW_NUMBER() OVER (ORDER BY random()) AS "rownum"
FROM "sample_1"
-- See note 4: minor micro-optimisation.
LIMIT 171224
) AS "sample_1"
-- See note 1: we join over the FROM table, not the target one.
-- See note 3: MOD() function usage for joining.
ON MOD("_target_table"."_db_tools_id", 589) = "sample_1"."rownum"
LEFT JOIN (
SELECT
"value",
-- See note 2: where the random happens.
ROW_NUMBER() OVER (ORDER BY random()) AS "rownum"
FROM "sample_2"
-- See note 4: minor micro-optimisation.
LIMIT 171224
) AS "sample_2"
-- See note 1: we join over the FROM table, not the target one.
-- See note 3: MOD() function usage for joining.
ON MOD("_target_table"."_db_tools_id", 2) = "sample_2"."rownum"
WHERE
-- See note 1: self-join WHERE condition
"client"."_target_table" = "_target_table"."_db_tools_id" -- Pour le self-join
;SQL Standard
UPDATEquery cannotJOINover the updated table, we only can use a cross join using theFROMstatement. OtherJOINstatements can only target tables from theFROMclause.In order to work around this, we do a cartesian production of the target table with itself, using the anonymizer identifier as
JOINcondition in theWHEREclause.In order to randomly select samples from the sample table, we add an
ORDER BY random()clause.random()expression will vary depending upon the RDBMS implementation.Because we need to
JOINover the target table, otherwise most RDBMS would use a single row and apply it to all update row as an optimization, we select theROW_NUMBER()over the ordered by random partition. We will use this row number toJOINon the anonymizer identifier.Because the sample table in most case will have less rows than the target table, we need to reduce the anonymizer identifier to the sample table row count in order for each target row to have a corresponding sample value. This is done using the
MOD()(modulo) magic in theJOINcondition.When the target table has less rows than the sample table, we restrict each sample table
JOINLIMITstatement to the target table count. Table is counted prior to anonymization.
MySQL specifics
MySQL has many limitations:
it has a very particuliar way of joining in
UPDATEqueries allowing toJOINover the updated table directly,it does not optimize joining over a self table if the joined column has no index, it does not detect the redundancy and will execute a cartesian product of the target table over the target table, which explodes the complexity to a point where a single second SQL query on PostgreSQL will literally take hours to execute on MySQL.
In order to workaround this, we chose to create an index over the anonymizer identifier sequence, which wouldn't be necessary otherwise.
SQLite specifics
SQLite does not permit DDL statements that alter tables, it forces you to create a new table with the new schema, copy data, then remove the previous version and rename the new one.
In order to avoid data copy, we are going to use the rowid magic column instead for joining, which allows us to work without adding the unique identifier on tables.
Hence the following SQL variant:
UPDATE
"client"
SET
"nom" = "sample_1"."value",
"civilite" = "sample_2"."value"
FROM (
SELECT
*,
"rowid" AS "_db_tools_id"
FROM "client"
) AS "_target_table"
-- ...
WHERE
"client"."rowid" = "_target_table"."_db_tools_id"Which then allows all anonymizers working consistently with SQLite.
SQL Server specifics
When using the same table as the updated one in the FROM clause, SQL Server will shadow the updated table for the benefit of the one from the FROM clause, hence the generated UPDATE not working.
For working around this, SQL Server has its own variant which is:
UPDATE
"client"
SET
"nom" = "sample_1"."value",
"civilite" = "sample_2"."value"
FROM (
SELECT * FROM "client"
) AS "_target_table"
-- ...Which is semantically equivalent and solve the table reference shadowing issue.
Other variants
Only PostgreSQL and MySQL are extensively tested for now, other SQL dialects will be supported in the future.