or How decentriq enables cross-company data collaboration in a GDPR-compliant way
[This blog has been co-authored by Anna Maria Tonikidou and Christian Meisser from the legal services firm LEXR]
Mergers and acquisitions (M&A) totalled in worth almost $2.49 trillion in the first three quarters of 2019. Some prominent examples are Saudi Aramco/Saudi Basic Industries Corporation, AbbVie/Allergan, Bristol-Myers Squibb/Celgene, and United Technologies/Raytheon1. For any M&A, due diligence is essential to confirm information about the two partners and to estimate the future value of the potentially combined companies2.
Identifying the size of the shared customer base is an essential task done early in the process. At this stage however, the two companies usually are legally not allowed to exchange their customer databases.
So far, the only workaround has been to hire a trusted third party who gathers the data from both companies and compares them. This comes with a lengthy legal process in itself, especially when the companies are in different jurisdictions and different privacy regulations apply.
What if the two companies could determine the size of their shared customer base without having to share their data with anyone?
In this blog, we will first introduce the concept of private set intersection as a generalization of the shared customers problem above. We then show how its solution has been implemented in decentriq’s avato platform in a way that removes the need for a trusted third party (also not decentriq). With avato, both parties provide their customer databases into the platform, receiving very particular security and privacy guarantees: Provably, nobody (not even decentriq) can access their unencrypted data and only the size of the shared customer base is output. While we focus here on private set intersection, avato extends to many other use-cases, in particular privacy-preserving machine learning.
Together with the legal services company LEXR, we will argue that this and similar processing in avato is in line with the General Data Protection Regulation (GDPR) because individual-level data are not shared.
Private set intersection
A private set intersection (PSI) is the process of determining the intersection of two or more datasets (think lists of customer names) without revealing any of the data to anyone. In the M&A case described above, this means calculating the number of shared customers of two companies without disclosing any customer information to any of the companies or any third party.
A private set intersection (PSI) is the process of determining the intersection of two or more datasets (think lists of customer names) without revealing any of the data to anyone.
This situation is illustrated below. The customer databases are exemplified on the left and the right, while the output is the number of shared customers which in this case is two. As the set intersection should be private, the two customer databases should be kept confidential at all times.
This is not a trivial task. Using a trusted third party comes with the lengthy processes and costs discussed above. For avoiding the use of third parties, traditionally hashing approaches have been applied. Unfortunately, none of them are really satisfying:
- Naïve approaches apply the same hashing function to the names in both databases, exchange the result and compare the hashes3. The identical name will have the same hash and can thus be identified as shared. As each party knows the hashes of their customers, they can also infer the names of the shared customers. This can already represent a violation of local privacy laws.
- More involved approaches use double-hashing techniques. These are more complicated, susceptible to privacy attacks and most importantly still fail in the common case of slight differences in the names – think “Freddy Mercury” in one database vs “Fred Mercury” in the other.
New developments come to the rescue. Recent advances in hardware-based cryptography enable new, strictly superior solutions to the private set intersection problem.
The key to privacy-preserving PSI is encrypted in an enclave
The avato platform leverages Intel’s Software Guard Extensions (Intel SGX) technology to create so-called secure enclave programs. These are isolated computer programs which can provide additional security and privacy guarantees even when running on public cloud infrastructure.
Using avato provides Anna and Paul with a simple and safe way of performing the private set intersection. Compared to other approaches, it does not require a trusted third party or complicated algorithms.
The figure below illustrates the situation. Anna and Paul work at the two companies and are tasked with computing the size of their shared customer base in a privacy-preserving and GDPR compliant way. They decide to use an avato secure enclave. After receiving the relevant security proofs, they locally encrypt their customer databases and submit them into the secure enclave. Provably, this particular secure enclave is the one and only program that can ever decrypt this data. In the enclave, the identifiers are matched, and the number of shared customers is sent back to Anna and Paul.
While postponing the technical details to the appendix section “How to provide the security guarantees”, the use of an avato secure enclave gives Anna and Paul the following security and privacy guarantees:
- Only the particular enclave program Anna and Paul are connected to can decrypt their customer databases.
- Nobody can access the decrypted data, including decentriq and potential infrastructure providers running avato.
- The secure enclave only outputs privacy-preserving aggregate statistics such as the number of shared customers.
Using avato provides Anna and Paul with a simple and safe way of performing the private set intersection. Compared to other approaches, it does not require a trusted third party or complicated algorithms while making it possible to use more sophisticated matching algorithms (fuzzy matching) and outputting additional privacy-preserving statistics. Crucially, as long as the above guarantees hold and the output is non-personal data (e.g. the number of shared customers), the described use of avato is in line with GDPR. This is discussed in detail in the following section written by Anna Maria Tonikidou and Christian Meisser from the legal services firm LEXR.
Why using avato is in line with GDPR
The term 'personal data' is the entryway to the application of the GDPR. 'Personal data' is defined in Article 4 (1) GDPR as any information relating to an identified or identifiable natural person. Such a person is referred to as a data subject. The data subjects are identifiable if they can be directly or indirectly identified. The definition of personal data is based on the realistic risk of identification, and the applicability of data protection rules should be based on risk of harm and likely severity.4
avato as a host of encrypted data is not processing personal data under the definition of the GDPR. decentriq cannot access that data, and even if its servers were breached, data subjects would be at little risk from a privacy standpoint […]
According to Recital 26 (5) GDPR, the principles of data protection should not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
In contrast to anonymous information, there is no mention of the qualification of encrypted information in the GDPR, and so far, no EU/EEA court has explicitly decided whether encrypted data is personal or not. However, the highest agency for data protection regulation in Bavaria (Landesamt für Datenschutzaufsicht) has concluded that encrypted data does not fall under the category of personal data, under the premise that it is encrypted with strong state-of-the-art cryptographic methods.5
Whether encrypted data are personal data therefore depends on the circumstances, particularly on the means reasonably likely to be used (fair or foul) to re-identify individuals.6 Factors affecting encrypted data's security against decryption include the following:
- Strength of encryption method (the algorithm's cryptographic strength)
- Key management, such as security of decryption key storage, and key access control.7
Under WP136, ‘anonymised’ data may be considered anonymous in a provider’s hands if ‘within the specific scheme in which those other controllers (e.g. providers) are operating, re-identification is explicitly excluded and appropriate technical measures have been taken in this respect’.8
According to the UK Information Commissioner's Office (ICO), if (i) a party has encrypted personal data itself and (ii) is responsible for managing the key, it is processing data covered by the GDPR, since it has the ability to re-identify individuals through decryption of that dataset.9 On that basis, Hon/Millard/Walden suggest that if a party cannot view data, it cannot identify data subjects, and therefore identification may be excluded by excluding others from being able to access or read data.10 By analogy with key-coded data, to the person encrypting personal data, such as a cloud user with the decryption key, the data remain 'personal data'.11 However, in another person’s hands, such as a cloud-based platform provider storing encrypted data without access to the key and no means 'reasonably likely' to be used for decryption, the data may be considered anonymous.12 This removes cloud providers from the scope of data protection legislation, at least where data have been strongly encrypted by the controller before transmission, and the provider cannot access the key.
With encryption, many of the parties who are processing the data do not have the encryption key. The encryption key stays with the generator of the data. This is the case with avato, meaning that encryption in this case bears similarities to the effects of anonymization, as decentriq has no means of reversing the process to access the raw data. In fact, decentriq has no way of knowing whether personally identifiable information is contained in the sets transferred to avato, and as such it would be impossible to define the scope of processing within a data processing agreement with its clients. decentriq also has no more chances of accessing the data than anyone who finds the key by accident. avato's strong encryption therefore bears effects similar to anonymization, i.e. it renders personal data in the sense of the GDPR into non-personal data from the point of encryption.
As a result of the above, for all intents and purposes, avato as a host of encrypted data is not processing personal data under the definition of the GDPR. decentriq cannot access that data, and even if its servers were breached, data subjects would be at little risk from a privacy standpoint since the data would also be unintelligible to the wrongdoers.
In this blog we have introduced the private set intersection problem and motivated it with the use-case of a potential merger of two companies where the number shared customers should be computed privately. We have argued that traditional approaches to this problem are not satisfactory and that new technologies such as Intel SGX enable strictly superior solutions. One such solution is decentriq’s avato platform which enables provably privacy preserving computation on data. We argued that the use of avato is in line with GDPR, even when the computation is performed on personally identifiable data such as in the outlined case. Even though we have used the example of private set intersection, this generalizes to the many more confidential computing use-cases supported by avato.
If you are interested to learn more about avato and to receive the full GDPR assessment done by LEXR reach out to firstname.lastname@example.org.
Appendix - How to provide the security guarantees
This section focuses on the technical details of performing a private set intersection with avato. The section describes how the security and privacy guarantees can be achieved.
To make things more concrete, this demo video shows how Anna and Paul use the dedicated Python API and web application to create an avato secure enclave, get the security proofs, encrypt their data and privately compute the number of shared customers.
Coming back to the security and privacy guarantees, the following points must be ensured to achieve them
- The secure enclave program must only compute the number of shared customers and delete all input data afterwards.
- The cryptographic keys Anna and Paul use to encrypt their customer databases must only be known to the particular secure enclave they are connected to.
- The data decrypted by the secure enclave must not be accessible, even to administrators with potential access to the operating system such as decentriq and infrastructure providers.
This can be achieved using Intel’s SGX technology. Already in place for seven years, the technology is available on most modern Intel CPUs. As outlined in much more detail in this blog, Intel SGX-based secure enclave programs are founded on two main security pillars.
As a first security pillar, a process called remote attestation allows a user to confirm for a remotely running program: i) the fact that it is a secure enclave program; ii) its program logic (source code); iii) a cryptographic key that enables encrypting data in a way that only can be decrypted by that particular program. A user can perform remote attestation by inspecting a particular piece of data received from the enclave. In the above figure these data are indicated as security proofs. Using these security proofs, i) is confirmed by checking a cryptographic signature that only can be obtained by a secure enclave program; ii) is checked by comparing a hash of the secure enclave’s source code to an expected value; and iii) is achieved by using a public key sent as part of the security proofs and whose private counterpart is known only to the enclave (it has been randomly generated when the secure enclave was started). As remote attestation allows Anna and Paul to verify what program is running remotely and encrypt data only for this particular secure enclave, this satisfies points 1. And 2.
As a second security pillar, enclave memory isolation and enclave memory encryption protect the data sent into the secure enclave even from potential system administrators. In order to process data, a CPU must read data from memory. In traditional computing, these data must be unencrypted in order for the CPU to perform computation on them. With Intel SGX, the CPU can read encrypted data from memory because a dedicated decryption/encryption chip inside the CPU handles the memory access of secure enclave programs. The encryption/decryption is done on the fly within the CPU itself when enclave data or code is leaving/entering the processor package. As a final protection, an additional layer of memory address translation prevents access by non-enclave programs to the secure enclave’s memory. Together, enclave memory encryption and enclave memory isolation satisfy point 3.
We hope that this section has shed some light on how the security and privacy guarantees can be achieved. The underlying technical details are quite involved. It requires expert knowledge to leverage the powerful Intel SGX technology, and decentriq specializes in that. If you have any questions regarding the security or use of the avato platform, reach out to email@example.com.
4) Ustaran E, European Data Protection Law and Practice, 44.
5) Tätigkeitsbericht 2017/18 - Bayerisches Landesamt für Datenschutzaufsicht, 89.
6) Mourby M, Are pseudonymized data always personal data? Implications of the GDPR for administrative data research in the UK, in Computer Law & Security Review, 2018, Vol. 34, 224.
8) Opinion 4/2007 on the concept of personal data, WP136 (2007).
10) Hon/Millard/Walden, The problem of 'personal data' in cloud computing: what information is regulated? – the cloud of unknowing, in International Data Privacy Law, 2011, Vol. 1, No. 4, 219.