Privacy-Preserving Onion Site Popularity Measurement

Privacy-Preserving Onion Site Popularity Measurement

View on GitHub

Privacy-Preserving Onion Site Popularity Measurement

This page provides details about a small, short-term, safe, and secure measurement study of the popularity of Tor onion sites. In particular, we aim to show that it is feasible to estimate the popularity of onion sites by using machine learning techniques to classify accesses to the Facebook onion site front page, while providing strong user privacy guarantees.

We believe that this study is safe: we only learn the total site usage, with added noise. Because we collect data from Tor middle relays, we can not learn the identity of individual users. Because we use PrivCount, we do not learn site usage on single relays, or total site usage without any noise. All unblinded data is kept on the collecting relay, and is automatically destroyed after processing. See the safety section for more details.

Who are we?

The results of this work will appear in the following publication:

Inside Job: Applying Traffic Analysis to Measure Tor from Within
25th Symposium on Network and Distributed System Security (NDSS 2018)
Rob Jansen, Marc Juarez, Rafael Galvez, Tariq Elahi, and Claudia Diaz

What is the general idea?

We are working on understanding how machine learning techniques that have traditionally been used to fingerprint websites for the purpose of deanonymizing Tor users can instead be used to measure the popularity of Tor onion sites. Although website fingerprinting is usually done by someone in a position to observe part of the network path between the client and its guard (including someone that runs the guard itself), our popularity measurement is less intrusive and only requires running middle relays.

Our popularity measurement involves:

  1. running a middle relay and observing circuits;
  2. guessing if an observed circuit is a hidden service circuit (already done by Kwon et al., albeit from a guard node position);
  3. for hidden service circuits, guessing the position of the relay in the circuit (we focus on the middle position next to the client-side guard); and
  4. for hidden service circuits observed from the correct relay position, guessing the onion service website (i.e. the Facebook login page) based on a trained classifier.

If guessing the onion site from the middle is successful, then it can be used to discover onion service popularity by measuring the frequency with which each onion site is accessed. The goal of our small study is to provide a proof-of-concept by measuring the popularity of a single onion page: the Facebook onion site login page.

What are we doing? How is this safe?

Our measurement study explicitly prioritizes user safety as a primary goal. We practice data minimization, limit measurement granularity, and provide additional security to the measurement process as described below. We have incorporated feedback from the Tor Research Safety Board into our methodology. See our feedback request and the safety board response.

Because this measurement is done from the middle relay position, onion-encryption technically prevents us from learning any client-identifying information. Although this protects users to some extent, we further protect users by utilizing the state-of-the-art in safe Tor measurement tools and techniques. Specifically, we use PrivCount and the techniques set out by Jansen and Johnson in “Safely Measuring Tor” to provide differential privacy and securely aggregate measurements across all of our relay data collectors.

During the measurement process, circuit and cell metadata will be used by the classifiers to make their guesses. Circuit meta-data includes internal middle relay state that is used to identify the previous and next relay in the circuit (the circuit ID, channel ID, and public relay fingerprint and flags). Cell metadata includes whether the cell was sent or received and to/from which side of the circuit, the previous and next circuit ID and channel ID, the cell type and relay command type (if known), and a timestamp of when the cell was sent or received.

The meta-data is sent in real time from Tor to PrivCount where it will be temporarily stored in volatile memory (RAM); the longest time that PrivCount will store the data in RAM is the lifetime of the circuit (normally around 10 minutes). When the circuit closes, PrivCount will pass the meta-data to a previously-trained classifier, which will make the guesses as appropriate. PrivCount will increment basic circuit counters that will allow us to compute the following statistics:

  1. The fraction of all middle circuits that we predict are client-side rendezvous circuits
  2. The fraction of above, in which we predict that our middle is the second relay from the client (next to the client’s guard)
  3. The fraction of above, in which we predict the circuit was used to access the Facebook onion site front page

During our measurement, we visit the Facebook onion site front page with our own client to generate some ground truth circuits which we can use to check the accuracy of our predictions. Additionally, we collect some counters when our relays serve in the rendezvous position that can also be used to check our predictions. These include:

  1. The fraction of entry/middle/exit circuits that are rendezvous circuits
  2. The fraction of rendezvous circuits that connect to a known Facebook ASN

Once these counters are incremented, all meta-data corresponding to the circuit and its cells are destroyed.

The PrivCount counters are initiated to noisy values to ensure differential privacy is maintained (cf. “Safely Measuring Tor”), and are then blinded and distributed across several share keepers to provide secure aggregation. At the end of the process, we learn only the value of these noisy counts aggregated across all data collectors, and nothing else about the information that was used during the measurement process. Specifically, we do not learn relay-specific inputs to the final counter value, and client usage of Tor during our measurement will be protected under differential privacy (i.e., the final counter values include noise to protect the true counts).

Why are we doing this?

This work has value to the community that we believe offsets the potential risks associated with the measurement.

  • We highlight the positive use of Tor and onion services by focusing our measurement on the Facebook onion site. Understanding which parts of the Tor protocol are used most often can help Tor researchers and developers focus their effort on improvements that can have the largest impact on the widest set of users.
  • We believe that showing how website fingerprinting can be applied for purposes other than client deanonymization (the focus of most recent website fingerprinting research) is novel and interesting and may spur additional research that may ultimately help us better understand the real world risks associated with fingerprinting techniques. This may, in turn, lead to the development of better fingerprinting defenses.
  • We believe that the risk from middle nodes is too often overlooked in the literature, and we think there is value in showing what can be discovered from the relay position with the fewest requirements.

Which relays are involved in the measurement?

The following relays are part of our PrivCount deployment. The data collected by these relays is blinded and securely aggregated while also being protected by differential privacy. Each individual relay learns nothing except the final combined results from all relays.

What is the measurement status?

We have previously performed three measurements:

  • Measurement 1
    • start time 2017-08-02 01:54:12 UTC
    • end time 2017-08-03 01:54:12 UTC
  • Measurement 2:
    • start time 2017-08-07 01:23:43 UTC
    • end time 2017-08-08 01:23:43 UTC
  • Measurement 3:
    • start time 2017-08-09 15:01:59 UTC
    • end time 2017-08-10 15:01:59 UTC

We do not plan to perform any additional measurements for this project.