or How Decentriq's Data Clean Rooms enable cross-company data collaboration in a GDPR-compliant way
[This blog has been co-authored by Anna Maria Tonikidou and Christian Meisser from the legal services firm LEXR]
Mergers and acquisitions (M&A) totalled in worth almost $2.49 trillion in the first three quarters of 2019. Some prominent examples are Saudi Aramco/Saudi Basic Industries Corporation, AbbVie/Allergan, Bristol-Myers Squibb/Celgene, and United Technologies/Raytheon1. For any M&A, due diligence is essential to confirm information about the two partners and to estimate the future value of the potentially combined companies2.
Identifying the size of the shared customer base is an essential task done early in the process. At this stage however, the two companies usually are legally not allowed to exchange their customer databases.
So far, the only workaround has been to hire a trusted third party who gathers the data from both companies and compares them. This comes with a lengthy legal process in itself, especially when the companies are in different jurisdictions and different privacy regulations apply.
What if the two companies could determine the size of their shared customer base without having to share their data with anyone?
In this blog, we will first introduce the concept of private set intersection as a generalization of the shared customers problem above. We then show how its solution has been implemented in Decentriq’s platform in a way that removes the need for a trusted third party (including Decentriq and any cloud provider). With Decentriq Data Clean Rooms, both parties provide their customer databases into the platform, receiving very particular security and privacy guarantees: Provably, nobody (not even Decentriq) can access their unencrypted data and only the size of the shared customer base is output. While we focus here on private set intersection, Decentriq's platform extends to many other use-cases such as combining multiple hospital data, and helping brands share data with their ad partners.
Together with the legal services company LEXR, we will argue that this and similar processing in the Decentriq platform is in line with the General Data Protection Regulation (GDPR) because individual-level data are not shared.
Private set intersection
A private set intersection (PSI) is the process of determining the intersection of two or more datasets (think lists of customer names) without revealing any of the data to anyone. In the M&A case described above, this means calculating the number of shared customers of two companies without disclosing any customer information to any of the companies or any third party.
A private set intersection (PSI) is the process of determining the intersection of two or more datasets (think lists of customer names) without revealing any of the data to anyone.
This situation is illustrated below. The customer databases are exemplified on the left and the right, while the output is the number of shared customers which in this case is two. As the set intersection should be private, the two customer databases should be kept confidential at all times.
This is not a trivial task. Using a trusted third party comes with the lengthy processes and costs discussed above. For avoiding the use of third parties, traditionally hashing approaches have been applied. Unfortunately, none of them are really satisfying:
- Naïve approaches apply the same hashing function to the names in both databases, exchange the result and compare the hashes3. The identical name will have the same hash and can thus be identified as shared. As each party knows the hashes of their customers, they can also infer the names of the shared customers. This can already represent a violation of local privacy laws.
- More involved approaches use double-hashing techniques. These are more complicated, susceptible to privacy attacks and most importantly still fail in the common case of slight differences in the names – think “Freddy Mercury” in one database vs “Fred Mercury” in the other.
New developments come to the rescue. Recent advances in hardware-based cryptography enable new, strictly superior solutions to the private set intersection problem.
The key to privacy-preserving PSI is encrypted in an enclave
The Decentriq platform leverages Intel’s Software Guard Extensions (Intel SGX) technology to create so-called secure enclave programs. These are isolated computer programs which can provide additional security and privacy guarantees even when running on public cloud infrastructure.
Using Decentriq's Data Clean Rooms provides Mobiliar and Ringier with a simple and safe way of performing the private set intersection. Compared to other approaches, it does not require a trusted third party or complicated algorithms.
The figure below illustrates the situation. The Decentriq users Mobiliar and Ringier want to compute their shared customer base in a privacy-preserving and GDPR compliant way. First they define the computations to be performed, and the participants along with their permissions in the Data Clean Room. Once the Data Clean Room is published, any connecting party receives the relevant security proofs before they establish connection. Then, they locally encrypt their customer databases and submit them into the secure enclave. Provably, this particular secure enclave is the one and only program that can ever decrypt this data. In the enclave, the identifiers are matched, and the number of shared customers in a given time-frame is sent back to both Mobiliar and Ringier.
While postponing the technical details to the appendix section “How to provide the security guarantees”, the use of the Decentriq Data Clean Rooms gives Mobiliar and Ringier the following security and privacy guarantees:
- Only the particular enclave Mobiliar and Ringier are connected to can decrypt their customer databases.
- Nobody can access the decrypted data, including Decentriq and potential infrastructure providers such as cloud providers.
- The secure enclave only outputs privacy-preserving aggregate statistics such as the number of shared customers.
Using the Decentriq Data Clean Rooms provides Mobiliar and Ringier with a simple and safe way of performing the private set intersection. Compared to other approaches, it does not require a trusted third party or complicated algorithms while making it possible to use more sophisticated matching algorithms (fuzzy matching) and outputting additional privacy-preserving statistics. Crucially, as long as the above guarantees hold and the output is non-personal data (e.g. the number of shared customers), the described use of Decentriq Data Clean Rooms is in line with GDPR. This is discussed in detail in the following section written by Anna Maria Tonikidou and Christian Meisser from the legal services firm LEXR.
Why using the Decentriq platform is in line with GDPR
The term 'personal data' is the entryway to the application of the GDPR. 'Personal data' is defined in Article 4 (1) GDPR as any information relating to an identified or identifiable natural person. Such a person is referred to as a data subject. The data subjects are identifiable if they can be directly or indirectly identified. The definition of personal data is based on the realistic risk of identification, and the applicability of data protection rules should be based on risk of harm and likely severity.4
The Decentriq Data Clean Rooms as hosts of encrypted data are not processing personal data under the definition of the GDPR. Decentriq cannot access that data, and even if its servers were breached, data subjects would be at little risk from a privacy standpoint […]
According to Recital 26 (5) GDPR, the principles of data protection should not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
In contrast to anonymous information, there is no mention of the qualification of encrypted information in the GDPR, and so far, no EU/EEA court has explicitly decided whether encrypted data is personal or not. However, the highest agency for data protection regulation in Bavaria (Landesamt für Datenschutzaufsicht) has concluded that encrypted data does not fall under the category of personal data, under the premise that it is encrypted with strong state-of-the-art cryptographic methods.5
Whether encrypted data are personal data therefore depends on the circumstances, particularly on the means reasonably likely to be used (fair or foul) to re-identify individuals.6 Factors affecting encrypted data's security against decryption include the following:
- Strength of encryption method (the algorithm's cryptographic strength)
- Key management, such as security of decryption key storage, and key access control.7
Under WP136, ‘anonymised’ data may be considered anonymous in a provider’s hands if ‘within the specific scheme in which those other controllers (e.g. providers) are operating, re-identification is explicitly excluded and appropriate technical measures have been taken in this respect’.8
According to the UK Information Commissioner's Office (ICO), if (i) a party has encrypted personal data itself and (ii) is responsible for managing the key, it is processing data covered by the GDPR, since it has the ability to re-identify individuals through decryption of that dataset.9 On that basis, Hon/Millard/Walden suggest that if a party cannot view data, it cannot identify data subjects, and therefore identification may be excluded by excluding others from being able to access or read data.10 By analogy with key-coded data, to the person encrypting personal data, such as a cloud user with the decryption key, the data remain 'personal data'.11 However, in another person’s hands, such as a cloud-based platform provider storing encrypted data without access to the key and no means 'reasonably likely' to be used for decryption, the data may be considered anonymous.12 This removes cloud providers from the scope of data protection legislation, at least where data have been strongly encrypted by the controller before transmission, and the provider cannot access the key.
With encryption, many of the parties who are processing the data do not have the encryption key. The encryption key stays with the generator of the data. This is the case with the Decentriq Data Clean Rooms, meaning that encryption in this case bears similarities to the effects of anonymization, as Decentriq has no means of reversing the process to access the raw data. In fact, Decentriq has no way of knowing whether personally identifiable information is contained in the sets transferred to the platform, and as such it would be impossible to define the scope of processing within a data processing agreement with its clients. Decentriq also has no more chances of accessing the data than anyone who finds the key by accident. Decentriq's Data Clean Rooms' strong encryption therefore bears effects similar to anonymization, i.e. it renders personal data in the sense of the GDPR into non-personal data from the point of encryption.
As a result of the above, for all intents and purposes, the Decentriq platform as a host of encrypted data is not processing personal data under the definition of the GDPR. Decentriq cannot access that data, and even if its servers were breached, data subjects would be at little risk from a privacy standpoint since the data would also be unintelligible to the wrongdoers.
In this blog we have introduced the private set intersection problem and motivated it with the use-case of a potential merger of two companies where the number shared customers should be computed privately.
We have argued that traditional approaches to this problem are not satisfactory and that new technologies such as Intel SGX enable strictly superior solutions. One such solution is Decentriq’s Data Clean Rooms platform which enables provably privacy preserving computation on data.
We argued that the use of Decentriq's Data Clean Rooms is in line with GDPR, even when the computation is performed on personally identifiable data such as in the outlined case. Even though we have used the example of private set intersection, this generalizes to the many more confidential computing use-cases supported by the Decentriq platform.
If you are interested to learn more about Decentriq's Data Clean Rooms and to receive the full GDPR assessment done by LEXR reach out to firstname.lastname@example.org
Appendix - How to provide the security guarantees
This section focuses on the technical details of performing a private set intersection with Decentriq's Data Clean Rooms. The section describes how the security and privacy guarantees can be achieved.
Coming back to the security and privacy guarantees, the following points must be ensured to achieve them
- The secure enclave program must only compute the number of shared customers and delete all input data afterwards.
- The cryptographic keys Anna and Paul use to encrypt their customer databases must only be known to the particular secure enclave they are connected to.
- The data decrypted by the secure enclave must not be accessible, even to administrators with potential access to the operating system such as decentriq and infrastructure providers.
This can be achieved using Intel’s SGX technology. Already in place for seven years, the technology is available on most modern Intel CPUs. As outlined in much more detail in this blog, Intel SGX-based secure enclave programs are founded on two main security pillars.
As a first security pillar, a process called remote attestation allows a user to confirm for a remotely running program: i) the fact that it is a secure enclave program; ii) its program logic (source code); iii) a cryptographic key that enables encrypting data in a way that only can be decrypted by that particular program. A user can perform remote attestation by inspecting a particular piece of data received from the enclave. In the above figure these data are indicated as security proofs. Using these security proofs, i) is confirmed by checking a cryptographic signature that only can be obtained by a secure enclave program; ii) is checked by comparing a hash of the secure enclave’s source code to an expected value; and iii) is achieved by using a public key sent as part of the security proofs and whose private counterpart is known only to the enclave (it has been randomly generated when the secure enclave was started). As remote attestation allows Anna and Paul to verify what program is running remotely and encrypt data only for this particular secure enclave, this satisfies points 1. And 2.
As a second security pillar, enclave memory isolation and enclave memory encryption protect the data sent into the secure enclave even from potential system administrators. In order to process data, a CPU must read data from memory. In traditional computing, these data must be unencrypted in order for the CPU to perform computation on them. With Intel SGX, the CPU can read encrypted data from memory because a dedicated decryption/encryption chip inside the CPU handles the memory access of secure enclave programs. The encryption/decryption is done on the fly within the CPU itself when enclave data or code is leaving/entering the processor package. As a final protection, an additional layer of memory address translation prevents access by non-enclave programs to the secure enclave’s memory. Together, enclave memory encryption and enclave memory isolation satisfy point 3.
We hope that this section has shed some light on how the security and privacy guarantees can be achieved. The underlying technical details are quite involved. It requires expert knowledge to leverage the powerful Intel SGX technology, and decentriq specializes in that. If you have any questions regarding the security or use of the Decentriq Data Clean Rooms, reach out to email@example.com
4) Ustaran E, European Data Protection Law and Practice, 44.
5) Tätigkeitsbericht 2017/18 - Bayerisches Landesamt für Datenschutzaufsicht, 89.
6) Mourby M, Are pseudonymized data always personal data? Implications of the GDPR for administrative data research in the UK, in Computer Law & Security Review, 2018, Vol. 34, 224.
8) Opinion 4/2007 on the concept of personal data, WP136 (2007).
10) Hon/Millard/Walden, The problem of 'personal data' in cloud computing: what information is regulated? – the cloud of unknowing, in International Data Privacy Law, 2011, Vol. 1, No. 4, 219.