January 29, 2021
Did you know 65.1% of people think that Internet service providers should protect them from unwanted tracking? For more insights like this one, see our online privacy and tracking perception survey.
The static vs. dynamic distinction is also present in other areas. For example, static malware detectors are based on analyzing the raw bytes of executable files. On the other hand, dynamic malware detectors run the executable in a sandboxed environment and inspect its actions. The static approach is simpler and cheaper, but the dynamic approach can provide higher accuracy.
Implementing the Dynamic Approach to Fingerprinting Detection
Then, we visited web sites using our customized Chrome browser. To extract function call data from the browser, we utilized the browser’s console log. We wrote annotated messages into the console log and processed these messages from Python.
Building the Data Set
- Another difficulty of data collection was the difference between headed and headless browsers. The analyst used a headed browser for his/her investigations, but the web crawling was performed by a headless browser. Therefore, we needed to double check if the headed and the headless approach produces the same (or nearly the same) function calls. If there was a discrepancy, we removed the given case from the labeled data set.
- We applied de-duplication on the data set. If multiple data points had the same feature vector, then we kept only 1 data point.
After all this effort, we obtained 409 labeled examples in total. The data set is fairly balanced: 258 examples are negative (63%), 151 are positive (37%). An example negative case is https://www.roomkey.com/js/connector/connector.js (web site: https://wyndhamhotels.com) An example positive case is https://www.mobile.de/resources/c0ad9057f4200d85dea57fe1e15731 (web site: https://mobile.de).
Fingerprint Detector Feature Engineering
We derived the features from property access and function call events that can be associated with fingerprinting. We selected 76 properties and 30 functions in total and assigned a counter to each of them. If a property was accessed or a function was called, the corresponding counter was incremented. The features can be divided into the following groups:
- window.navigator properties (43 features): This group includes counters for well-known indicators of fingerprinting like plugins and javaEnabled, and for less conventional ones too like mediaCapabilities and maxTouchPoints.
- window.screen properties (33 features): Screen attributes have been used for fingerprinting for a long time. Some example features from this group are availHeight, availWidth, colorDepth and fontSmoothingEnabled.
- canvas functions (23 features): Canvas fingerprinting  works by exploiting the HTML5 canvas element. The fingerprinting script draws text with the font and size of its choice and adds background colors. Then, the hash code of the canvas pixel data is used as the fingerprint. Some example counter features that we defined in this group are fillText, fillRect and toDataURL.
- audio functions: (6 features): Audio fingerprinting  is conceptually similar to canvas fingerprinting but it exploits the audio context and the oscillator node elements instead of the canvas. Some example counter features that we defined in this group are createOscillator, createDynamicsCompressor and oscillator_start.
- other (11 features): This group contains additional features that are not related to window properties or canvas/audio functions.
Machine Learning Experiments
After defining the features, it is time to train machine learning models. We will compare 7 different models. The first one is a logistic regression, the remaining 6 are tree-based nonlinear models. The applied evaluation scheme is 20-fold cross validation. The evaluation metric is the accuracy (#correct decisions / #all decisions). To implement the experiment, we used scikit-learn. The results are summarized in the following table:
The cross-validation accuracy is pretty high for all algorithms (above 90%). The best score was achieved by the most complex model (#7: GradientBoosting with max_depth=3). This suggests that our proposed counter features are strong indicators of fingerprinting, and that the relationship with the label is not trivial.
The GradientBoosting algorithm provides an importance value for each input feature. Let’s investigate, what model #7 thinks about feature importance:
The two most predictive features are cpuClass and fillRect. The distribution of labels for each value of cpuClass × fillRect are shown below (the feature values are integers, but the dots were randomly perturbed within grid cells for better visibility):
Protecting Homes with AI
The question arises: How can we utilize the machine learning classifier to provide practical value in the real world? One solution is to automatically generate an AI-based domain blacklist. The outline of the approach is as follows:
- We periodically scan the web, searching for new trackers.
- In the end, we can post-process the candidate list:
- A human analyst double checks the candidates and filters out false positives.
An advantage of the AI-based approach is that it can detect zero-day trackers that are not yet included in the publicly available sources. As a consequence, we can offer stronger protection against trackers.