Search | arXiv e-print repository

Google's Chrome Antitrust Paradox

Authors: Shaoor Munir, Konrad Kollnig, Anastasia Shuba, Zubair Shafiq

Abstract: This article delves into Google's dominance of the browser market, highlighting how Google's Chrome browser is playing a critical role in asserting Google's dominance in other markets. While Google perpetuates the perception that Google Chrome is a neutral platform built on open-source technologies, we argue that Chrome is instrumental in Google's strategy to reinforce its dominance in online adve… ▽ More This article delves into Google's dominance of the browser market, highlighting how Google's Chrome browser is playing a critical role in asserting Google's dominance in other markets. While Google perpetuates the perception that Google Chrome is a neutral platform built on open-source technologies, we argue that Chrome is instrumental in Google's strategy to reinforce its dominance in online advertising, publishing, and the browser market itself. Our examination of Google's strategic acquisitions, anti-competitive practices, and the implementation of so-called "privacy controls," shows that Chrome is far from a neutral gateway to the web. Rather, it serves as a key tool for Google to maintain and extend its market power, often to the detriment of competition and innovation. We examine how Chrome not only bolsters Google's position in advertising and publishing through practices such as coercion, and self-preferencing, it also helps leverage its advertising clout to engage in a "pay-to-play" paradigm, which serves as a cornerstone in Google's larger strategy of market control. We also discuss potential regulatory interventions and remedies, drawing on historical antitrust precedents. We propose a triad of solutions motivated from our analysis of Google's abuse of Chrome: behavioral remedies targeting specific anti-competitive practices, structural remedies involving an internal separation of Google's divisions, and divestment of Chrome from Google. Despite Chrome's dominance and its critical role in Google's ecosystem, it has escaped antitrust scrutiny -- a gap our article aims to bridge. Addressing this gap is instrumental to solve current market imbalances and future challenges brought on by increasingly hegemonizing technology firms, ensuring a competitive digital environment that nurtures innovation and safeguards consumer interests. △ Less

Submitted 26 June, 2024; v1 submitted 4 April, 2024; originally announced June 2024.

arXiv:2406.07647 [pdf, other]

FP-Inconsistent: Detecting Evasive Bots using Browser Fingerprint Inconsistencies

Authors: Hari Venugopalan, Shaoor Munir, Shuaib Ahmed, Tangbaihe Wang, Samuel T. King, Zubair Shafiq

Abstract: As browser fingerprinting is increasingly being used for bot detection, bots have started altering their fingerprints for evasion. We conduct the first large-scale evaluation of evasive bots to investigate whether and how altering fingerprints helps bots evade detection. To systematically investigate evasive bots, we deploy a honey site incorporating two anti-bot services (DataDome and BotD) and s… ▽ More As browser fingerprinting is increasingly being used for bot detection, bots have started altering their fingerprints for evasion. We conduct the first large-scale evaluation of evasive bots to investigate whether and how altering fingerprints helps bots evade detection. To systematically investigate evasive bots, we deploy a honey site incorporating two anti-bot services (DataDome and BotD) and solicit bot traffic from 20 different bot services that purport to sell "realistic and undetectable traffic". Across half a million requests from 20 different bot services on our honey site, we find an average evasion rate of 52.93% against DataDome and 44.56% evasion rate against BotD. Our comparison of fingerprint attributes from bot services that evade each anti-bot service individually as well as bot services that evade both shows that bot services indeed alter different browser fingerprint attributes for evasion. Further, our analysis reveals the presence of inconsistent fingerprint attributes in evasive bots. Given evasive bots seem to have difficulty in ensuring consistency in their fingerprint attributes, we propose a data-driven approach to discover rules to detect such inconsistencies across space (two attributes in a given browser fingerprint) and time (a single attribute at two different points in time). These rules, which can be readily deployed by anti-bot services, reduce the evasion rate of evasive bots against DataDome and BotD by 48.11% and 44.95% respectively. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06958 [pdf, other]

Turning the Tide on Dark Pools? Towards Multi-Stakeholder Vulnerability Notifications in the Ad-Tech Supply Chain

Authors: Yash Vekaria, Rishab Nithyanand, Zubair Shafiq

Abstract: Online advertising relies on a complex and opaque supply chain that involves multiple stakeholders, including advertisers, publishers, and ad-networks, each with distinct and sometimes conflicting incentives. Recent research has demonstrated the existence of ad-tech supply chain vulnerabilities such as dark pooling, where low-quality publishers bundle their ad inventory with higher-quality ones to… ▽ More Online advertising relies on a complex and opaque supply chain that involves multiple stakeholders, including advertisers, publishers, and ad-networks, each with distinct and sometimes conflicting incentives. Recent research has demonstrated the existence of ad-tech supply chain vulnerabilities such as dark pooling, where low-quality publishers bundle their ad inventory with higher-quality ones to mislead advertisers. We investigate the effectiveness of vulnerability notification campaigns aimed at mitigating dark pooling. Prior research on vulnerability notifications has primarily focused on single-stakeholder scenarios, and it is unclear whether vulnerability notifications can be effective in the multi-stakeholder ad-tech supply chain. We implement an automated vulnerability notification pipeline to systematically evaluate the responsiveness of various stakeholders, including publishers, ad-networks, and advertisers to vulnerability notifications by academics and activists. Our nine-month long multi-stakeholder notification study shows that notifications are an effective method for reducing dark pooling vulnerabilities in the online advertising ecosystem, especially when targeted towards ad-networks. Further, the sender reputation does not impact responses to notifications from activists and academics in a statistically different way. In addition to being the first notification study targeting the online advertising ecosystem, we are also the first to study multi-stakeholder context in vulnerability notifications. △ Less

Submitted 11 June, 2024; originally announced June 2024.

ACM Class: K.4.1; K.4.3; K.4.4; D.2.0; D.2.4; G.3; H.3.7; K.1; K.6.1; K.6.5

arXiv:2406.05310 [pdf, other]

COOKIEGUARD: Characterizing and Isolating the First-Party Cookie Jar

Authors: Pouneh Nikkhah Bahrami, Aurore Fass, Zubair Shafiq

Abstract: As third-party cookies are going away, first-party cookies are increasingly being used for tracking. Prior research has shown that third-party scripts write (or \textit{ghost-write}) first-party cookies in the browser's cookie jar because they are included in the website's main frame. What is more is that a third-party script is able to access all first-party cookies, both the actual first-party c… ▽ More As third-party cookies are going away, first-party cookies are increasingly being used for tracking. Prior research has shown that third-party scripts write (or \textit{ghost-write}) first-party cookies in the browser's cookie jar because they are included in the website's main frame. What is more is that a third-party script is able to access all first-party cookies, both the actual first-party cookies as well as the ghost-written first-party cookies by different third-party scripts. Existing isolation mechanisms in the web browser such as SOP and CSP are not designed to address this lack of isolation between first-party cookies written by different third-parties. We conduct a comprehensive analysis of cross-domain first-party cookie retrieval, exfiltration, and modification on top-10K websites. Most notably, we find 18\% and 4\% of the first-party cookies are exfiltrated and overwritten, respectively, by cross-domain third-party scripts. We propose \name to introduce isolation between first-party cookies set by different third-party scripts in the main frame. To this end, \name intercepts cookie get and set operations between third-party scripts and the browser's cookie jar to enforce strict isolation between first-party cookies set by different third-party domains. Our evaluation of \name shows that it effectively blocks all cross-domain cookie read/write operations to provide a fully isolated cookie jar. While it generally does not impact appearance, navigation, or other website functionality, the strict isolation policy disrupts Single Sign-On (SSO) on just 11\% of websites that rely on first-party cookies for session management. Our work demonstrates the feasibility of isolating first-party cookies. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.18385 [pdf, other]

Blocking Tracking JavaScript at the Function Granularity

Authors: Abdul Haddi Amjad, Shaoor Munir, Zubair Shafiq, Muhammad Ali Gulzar

Abstract: Modern websites extensively rely on JavaScript to implement both functionality and tracking. Existing privacy enhancing content blocking tools struggle against mixed scripts, which simultaneously implement both functionality and tracking, because blocking the script would break functionality and not blocking it would allow tracking. We propose Not.js, a fine grained JavaScript blocking tool that o… ▽ More Modern websites extensively rely on JavaScript to implement both functionality and tracking. Existing privacy enhancing content blocking tools struggle against mixed scripts, which simultaneously implement both functionality and tracking, because blocking the script would break functionality and not blocking it would allow tracking. We propose Not.js, a fine grained JavaScript blocking tool that operates at the function level granularity. Not.js's strengths lie in analyzing the dynamic execution context, including the call stack and calling context of each JavaScript function, and then encoding this context to build a rich graph representation. Not.js trains a supervised machine learning classifier on a webpage's graph representation to first detect tracking at the JavaScript function level and then automatically generate surrogate scripts that preserve functionality while removing tracking. Our evaluation of Not.js on the top 10K websites demonstrates that it achieves high precision (94%) and recall (98%) in detecting tracking JavaScript functions, outperforming the state of the art while being robust against off the shelf JavaScript obfuscation. Fine grained detection of tracking functions allows Not.js to automatically generate surrogate scripts that remove tracking JavaScript functions without causing major breakage. Our deployment of Not.js shows that mixed scripts are present on 62.3% of the top 10K websites, with 70.6% of the mixed scripts being third party that engage in tracking activities such as cookie ghostwriting. We share a sample of the tracking functions detected by Not.js within mixed scripts not currently on filter lists with filter list authors, who confirm that these scripts are not blocked due to potential functionality breakage, despite being known to implement tracking. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2309.12591 [pdf, other]

Before Blue Birds Became X-tinct: Understanding the Effect of Regime Change on Twitter's Advertising and Compliance of Advertising Policies

Authors: Yash Vekaria, Zubair Shafiq, Savvas Zannettou

Abstract: Social media platforms, including Twitter (now X), have policies in place to maintain a safe and trustworthy advertising environment. However, the extent to which these policies are adhered to and enforced remains a subject of interest and concern. We present the first large-scale audit of advertising on Twitter focusing on compliance with the platform's advertising policies, particularly those re… ▽ More Social media platforms, including Twitter (now X), have policies in place to maintain a safe and trustworthy advertising environment. However, the extent to which these policies are adhered to and enforced remains a subject of interest and concern. We present the first large-scale audit of advertising on Twitter focusing on compliance with the platform's advertising policies, particularly those related to political and adult content. We investigate the compliance of advertisements on Twitter with the platform's stated policies and the impact of recent acquisition on the advertising activity of the platform. By analyzing 34K advertisements from ~6M tweets, collected over six months, we find evidence of widespread noncompliance with Twitter's political and adult content advertising policies suggesting a lack of effective ad content moderation. We also find that Elon Musk's acquisition of Twitter had a noticeable impact on the advertising landscape, with most existing advertisers either completely stopping their advertising activity or reducing it. Major brands decreased their advertising on Twitter, suggesting a negative immediate effect on the platform's advertising revenue. Our findings underscore the importance of external audits to monitor compliance and improve transparency in online advertising. △ Less

Submitted 21 September, 2023; originally announced September 2023.

ACM Class: K.4.1; K.4.2; K.4.3

arXiv:2308.08160 [pdf, other]

Benchmarking Adversarial Robustness of Compressed Deep Learning Models

Authors: Brijesh Vora, Kartik Patwari, Syed Mahbub Hafiz, Zubair Shafiq, Chen-Nee Chuah

Abstract: The increasing size of Deep Neural Networks (DNNs) poses a pressing need for model compression, particularly when employed on resource constrained devices. Concurrently, the susceptibility of DNNs to adversarial attacks presents another significant hurdle. Despite substantial research on both model compression and adversarial robustness, their joint examination remains underexplored. Our study bri… ▽ More The increasing size of Deep Neural Networks (DNNs) poses a pressing need for model compression, particularly when employed on resource constrained devices. Concurrently, the susceptibility of DNNs to adversarial attacks presents another significant hurdle. Despite substantial research on both model compression and adversarial robustness, their joint examination remains underexplored. Our study bridges this gap, seeking to understand the effect of adversarial inputs crafted for base models on their pruned versions. To examine this relationship, we have developed a comprehensive benchmark across diverse adversarial attacks and popular DNN models. We uniquely focus on models not previously exposed to adversarial training and apply pruning schemes optimized for accuracy and performance. Our findings reveal that while the benefits of pruning enhanced generalizability, compression, and faster inference times are preserved, adversarial robustness remains comparable to the base model. This suggests that model compression while offering its unique advantages, does not undermine adversarial robustness. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2308.03417 [pdf, other]

PURL: Safe and Effective Sanitization of Link Decoration

Authors: Shaoor Munir, Patrick Lee, Umar Iqbal, Zubair Shafiq, Sandra Siby

Abstract: While privacy-focused browsers have taken steps to block third-party cookies and mitigate browser fingerprinting, novel tracking techniques that can bypass existing countermeasures continue to emerge. Since trackers need to share information from the client-side to the server-side through link decoration regardless of the tracking technique they employ, a promising orthogonal approach is to detect… ▽ More While privacy-focused browsers have taken steps to block third-party cookies and mitigate browser fingerprinting, novel tracking techniques that can bypass existing countermeasures continue to emerge. Since trackers need to share information from the client-side to the server-side through link decoration regardless of the tracking technique they employ, a promising orthogonal approach is to detect and sanitize tracking information in decorated links. To this end, we present PURL (pronounced purel-l), a machine-learning approach that leverages a cross-layer graph representation of webpage execution to safely and effectively sanitize link decoration. Our evaluation shows that PURL significantly outperforms existing countermeasures in terms of accuracy and reducing website breakage while being robust to common evasion techniques. PURL's deployment on a sample of top-million websites shows that link decoration is abused for tracking on nearly three-quarters of the websites, often to share cookies, email addresses, and fingerprinting information. △ Less

Submitted 6 March, 2024; v1 submitted 7 August, 2023; originally announced August 2023.

arXiv:2307.00143 [pdf, other]

Centauri: Practical Rowhammer Fingerprinting

Authors: Hari Venugopalan, Kaustav Goswami, Zainul Abi Din, Jason Lowe-Power, Samuel T. King, Zubair Shafiq

Abstract: Fingerprinters leverage the heterogeneity in hardware and software configurations to extract a device fingerprint. Fingerprinting countermeasures attempt to normalize these attributes such that they present a uniform fingerprint across different devices or present different fingerprints for the same device each time. We present Centauri, a Rowhammer fingerprinting approach that can build a unique… ▽ More Fingerprinters leverage the heterogeneity in hardware and software configurations to extract a device fingerprint. Fingerprinting countermeasures attempt to normalize these attributes such that they present a uniform fingerprint across different devices or present different fingerprints for the same device each time. We present Centauri, a Rowhammer fingerprinting approach that can build a unique and stable fingerprints even across devices with homogeneous or normalized/obfuscated hardware and software configurations. To this end, Centauri leverages the process variation in the underlying manufacturing process that gives rise to unique distributions of Rowhammer-induced bit flips across different DRAM modules. Centauri's design and implementation is able to overcome memory allocation constrains without requiring root privileges. Our evaluation on a test bed of about one hundred DRAM modules shows that system achieves 99.91% fingerprinting accuracy. Centauri's fingerprints are also stable with daily experiments over a period of 10 days revealing no loss in fingerprinting accuracy. We show that Centauri is efficient, taking as little as 9.92 seconds to extract a fingerprint. Centauri is the first practical Rowhammer fingerprinting approach that is able to extract unique and stable fingerprints efficiently and at-scale. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2302.01182 [pdf, other]

Blocking JavaScript without Breaking the Web: An Empirical Investigation

Authors: Abdul Haddi Amjad, Zubair Shafiq, Muhammad Ali Gulzar

Abstract: Modern websites heavily rely on JavaScript (JS) to implement legitimate functionality as well as privacy-invasive advertising and tracking. Browser extensions such as NoScript block any script not loaded by a trusted list of endpoints, thus hoping to block privacy-invasive scripts while avoiding breaking legitimate website functionality. In this paper, we investigate whether blocking JS on the web… ▽ More Modern websites heavily rely on JavaScript (JS) to implement legitimate functionality as well as privacy-invasive advertising and tracking. Browser extensions such as NoScript block any script not loaded by a trusted list of endpoints, thus hoping to block privacy-invasive scripts while avoiding breaking legitimate website functionality. In this paper, we investigate whether blocking JS on the web is feasible without breaking legitimate functionality. To this end, we conduct a large-scale measurement study of JS blocking on 100K websites. We evaluate the effectiveness of different JS blocking strategies in tracking prevention and functionality breakage. Our evaluation relies on quantitative analysis of network requests and resource loads as well as manual qualitative analysis of visual breakage. First, we show that while blocking all scripts is quite effective at reducing tracking, it significantly degrades functionality on approximately two-thirds of the tested websites. Second, we show that selective blocking of a subset of scripts based on a curated list achieves a better tradeoff. However, there remain approximately 15% `mixed` scripts, which essentially merge tracking and legitimate functionality and thus cannot be blocked without causing website breakage. Finally, we show that fine-grained blocking of a subset of JS methods, instead of scripts, reduces major breakage by 3.8$\times$ while providing the same level of tracking prevention. Our work highlights the promise and open challenges in fine-grained JS blocking for tracking prevention without breaking the web. △ Less

Submitted 23 March, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Journal ref: petsymposium 2023

arXiv:2210.08136 [pdf, other]

A Utility-Preserving Obfuscation Approach for YouTube Recommendations

Authors: Jiang Zhang, Hadi Askari, Konstantinos Psounis, Zubair Shafiq

Abstract: Online content platforms optimize engagement by providing personalized recommendations to their users. These recommendation systems track and profile users to predict relevant content a user is likely interested in. While the personalized recommendations provide utility to users, the tracking and profiling that enables them poses a privacy issue because the platform might infer potentially sensiti… ▽ More Online content platforms optimize engagement by providing personalized recommendations to their users. These recommendation systems track and profile users to predict relevant content a user is likely interested in. While the personalized recommendations provide utility to users, the tracking and profiling that enables them poses a privacy issue because the platform might infer potentially sensitive user interests. There is increasing interest in building privacy-enhancing obfuscation approaches that do not rely on cooperation from online content platforms. However, existing obfuscation approaches primarily focus on enhancing privacy but at the same time they degrade the utility because obfuscation introduces unrelated recommendations. We design and implement De-Harpo, an obfuscation approach for YouTube's recommendation system that not only obfuscates a user's video watch history to protect privacy but then also denoises the video recommendations by YouTube to preserve their utility. In contrast to prior obfuscation approaches, De-Harpo adds a denoiser that makes use of a "secret" input (i.e., a user's actual watch history) as well as information that is also available to the adversarial recommendation system (i.e., obfuscated watch history and corresponding "noisy" recommendations). Our large-scale evaluation of De-Harpo shows that it outperforms the state-of-the-art by a factor of 2x in terms of preserving utility for the same level of privacy, while maintaining stealthiness and robustness to de-obfuscation. △ Less

Submitted 16 June, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

arXiv:2210.06654 [pdf, other]

The Inventory is Dark and Full of Misinformation: Understanding the Abuse of Ad Inventory Pooling in the Ad-Tech Supply Chain

Authors: Yash Vekaria, Rishab Nithyanand, Zubair Shafiq

Abstract: Ad-tech enables publishers to programmatically sell their ad inventory to millions of demand partners through a complex supply chain. Bogus or low quality publishers can exploit the opaque nature of the ad-tech to deceptively monetize their ad inventory. In this paper, we investigate for the first time how misinformation sites subvert the ad-tech transparency standards and pool their ad inventory… ▽ More Ad-tech enables publishers to programmatically sell their ad inventory to millions of demand partners through a complex supply chain. Bogus or low quality publishers can exploit the opaque nature of the ad-tech to deceptively monetize their ad inventory. In this paper, we investigate for the first time how misinformation sites subvert the ad-tech transparency standards and pool their ad inventory with unrelated sites to circumvent brand safety protections. We find that a few major ad exchanges are disproportionately responsible for the dark pools that are exploited by misinformation websites. We further find evidence that dark pooling allows misinformation sites to deceptively sell their ad inventory to reputable brands. We conclude with a discussion of potential countermeasures such as better vetting of ad exchange partners, adoption of new ad-tech transparency standards that enable end-to-end validation of the ad-tech supply chain, as well as widespread deployment of independent audits like ours. △ Less

Submitted 14 October, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: To appear at IEEE Symposium on Security & Privacy (Oakland) 2024

arXiv:2208.12370 [pdf, other]

COOKIEGRAPH: Understanding and Detecting First-Party Tracking Cookies

Authors: Shaoor Munir, Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, Carmela Troncoso

Abstract: As third-party cookie blocking is becoming the norm in browsers, advertisers and trackers have started to use first-party cookies for tracking. We conduct a differential measurement study on 10K websites with third-party cookies allowed and blocked. This study reveals that first-party cookies are used to store and exfiltrate identifiers to known trackers even when third-party cookies are blocked.… ▽ More As third-party cookie blocking is becoming the norm in browsers, advertisers and trackers have started to use first-party cookies for tracking. We conduct a differential measurement study on 10K websites with third-party cookies allowed and blocked. This study reveals that first-party cookies are used to store and exfiltrate identifiers to known trackers even when third-party cookies are blocked. As opposed to third-party cookie blocking, outright first-party cookie blocking is not practical because it would result in major functionality breakage. We propose CookieGraph, a machine learning-based approach that can accurately and robustly detect first-party tracking cookies. CookieGraph detects first-party tracking cookies with 90.20% accuracy, outperforming the state-of-the-art CookieBlock approach by 17.75%. We show that CookieGraph is fully robust against cookie name manipulation while CookieBlock's acuracy drops by 15.68%. While blocking all first-party cookies results in major breakage on 32% of the sites with SSO logins, and CookieBlock reduces it to 10%, we show that CookieGraph does not cause any major breakage on these sites. Our deployment of CookieGraph shows that first-party tracking cookies are used on 93.43% of the 10K websites. We also find that first-party tracking cookies are set by fingerprinting scripts. The most prevalent first-party tracking cookies are set by major advertising entities such as Google, Facebook, and TikTok. △ Less

Submitted 27 November, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

arXiv:2206.13599 [pdf, other]

Nowhere to Hide: Detecting Obfuscated Fingerprinting Scripts

Authors: Ray Ngan, Surya Konkimalla, Zubair Shafiq

Abstract: As the web moves away from stateful tracking, browser fingerprinting is becoming more prevalent. Unfortunately, existing approaches to detect browser fingerprinting do not take into account potential evasion tactics such as code obfuscation. To address this gap, we investigate the robustness of a state-of-the-art fingerprinting detection approach against various off-the-shelf obfuscation tools. Ov… ▽ More As the web moves away from stateful tracking, browser fingerprinting is becoming more prevalent. Unfortunately, existing approaches to detect browser fingerprinting do not take into account potential evasion tactics such as code obfuscation. To address this gap, we investigate the robustness of a state-of-the-art fingerprinting detection approach against various off-the-shelf obfuscation tools. Overall, we find that the combination of static and dynamic analysis is robust against different types of obfuscation. While some obfuscators are able to induce false negatives in static analysis, dynamic analysis is still able detect these cases. Since obfuscation does not induce significant false positives, the combination of static and dynamic analysis is still able to accurately detect obfuscated fingerprinting scripts. △ Less

Submitted 27 June, 2022; originally announced June 2022.

arXiv:2206.04803 [pdf, other]

Detecting Anomalous Cryptocurrency Transactions: an AML/CFT Application of Machine Learning-based Forensics

Authors: Nadia Pocher, Mirko Zichichi, Fabio Merizzi, Muhammad Zohaib Shafiq, Stefano Ferretti

Abstract: In shaping the Internet of Money, the application of blockchain and distributed ledger technologies (DLTs) to the financial sector triggered regulatory concerns. Notably, while the user anonymity enabled in this field may safeguard privacy and data protection, the lack of identifiability hinders accountability and challenges the fight against money laundering and the financing of terrorism and pro… ▽ More In shaping the Internet of Money, the application of blockchain and distributed ledger technologies (DLTs) to the financial sector triggered regulatory concerns. Notably, while the user anonymity enabled in this field may safeguard privacy and data protection, the lack of identifiability hinders accountability and challenges the fight against money laundering and the financing of terrorism and proliferation (AML/CFT). As law enforcement agencies and the private sector apply forensics to track crypto transfers across ecosystems that are socio-technical in nature, this paper focuses on the growing relevance of these techniques in a domain where their deployment impacts the traits and evolution of the sphere. In particular, this work offers contextualized insights into the application of methods of machine learning and transaction graph analysis. Namely, it analyzes a real-world dataset of Bitcoin transactions represented as a directed graph network through various techniques. The modeling of blockchain transactions as a complex network suggests that the use of graph-based data analysis methods can help classify transactions and identify illicit ones. Indeed, this work shows that the neural network types known as Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) are a promising AML/CFT solution. Notably, in this scenario GCN outperform other classic approaches and GAT are applied for the first time to detect anomalies in Bitcoin. Ultimately, the paper upholds the value of public-private synergies to devise forensic strategies conscious of the spirit of explainability and data openness. △ Less

Submitted 18 March, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

arXiv:2204.10920 [pdf, other]

doi 10.1145/3618257.3624803

Tracking, Profiling, and Ad Targeting in the Alexa Echo Smart Speaker Ecosystem

Authors: Umar Iqbal, Pouneh Nikkhah Bahrami, Rahmadi Trimananda, Hao Cui, Alexander Gamero-Garrido, Daniel Dubois, David Choffnes, Athina Markopoulou, Franziska Roesner, Zubair Shafiq

Abstract: Smart speakers collect voice commands, which can be used to infer sensitive information about users. Given the potential for privacy harms, there is a need for greater transparency and control over the data collected, used, and shared by smart speaker platforms as well as third party skills supported on them. To bridge this gap, we build a framework to measure data collection, usage, and sharing b… ▽ More Smart speakers collect voice commands, which can be used to infer sensitive information about users. Given the potential for privacy harms, there is a need for greater transparency and control over the data collected, used, and shared by smart speaker platforms as well as third party skills supported on them. To bridge this gap, we build a framework to measure data collection, usage, and sharing by the smart speaker platforms. We apply our framework to the Amazon smart speaker ecosystem. Our results show that Amazon and third parties, including advertising and tracking services that are unique to the smart speaker ecosystem, collect smart speaker interaction data. We also find that Amazon processes smart speaker interaction data to infer user interests and uses those inferences to serve targeted ads to users. Smart speaker interaction also leads to ad targeting and as much as 30X higher bids in ad auctions, from third party advertisers. Finally, we find that Amazon's and third party skills' data practices are often not clearly disclosed in their policy documents. △ Less

Submitted 13 October, 2023; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: Published at the ACM Internet Measurement Conference 2023

arXiv:2203.11849 [pdf, other]

A Girl Has A Name, And It's ... Adversarial Authorship Attribution for Deobfuscation

Authors: Wanyue Zhai, Jonathan Rusert, Zubair Shafiq, Padmini Srinivasan

Abstract: Recent advances in natural language processing have enabled powerful privacy-invasive authorship attribution. To counter authorship attribution, researchers have proposed a variety of rule-based and learning-based text obfuscation approaches. However, existing authorship obfuscation approaches do not consider the adversarial threat model. Specifically, they are not evaluated against adversarially… ▽ More Recent advances in natural language processing have enabled powerful privacy-invasive authorship attribution. To counter authorship attribution, researchers have proposed a variety of rule-based and learning-based text obfuscation approaches. However, existing authorship obfuscation approaches do not consider the adversarial threat model. Specifically, they are not evaluated against adversarially trained authorship attributors that are aware of potential obfuscation. To fill this gap, we investigate the problem of adversarial authorship attribution for deobfuscation. We show that adversarially trained authorship attributors are able to degrade the effectiveness of existing obfuscators from 20-30% to 5-10%. We also evaluate the effectiveness of adversarial training when the attributor makes incorrect assumptions about whether and which obfuscator was used. While there is a a clear degradation in attribution accuracy, it is noteworthy that this degradation is still at or above the attribution accuracy of the attributor that is not adversarially trained at all. Our results underline the need for stronger obfuscation approaches that are resistant to deobfuscation △ Less

Submitted 22 March, 2022; originally announced March 2022.

Comments: 9 pages, 7 figures, 3 tables, ACL 2022

arXiv:2203.11331 [pdf, other]

On The Robustness of Offensive Language Classifiers

Authors: Jonathan Rusert, Zubair Shafiq, Padmini Srinivasan

Abstract: Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale. However, despite their real-world deployment, we do not yet comprehensively understand the extent to which offensive language classifiers are robust against adversarial attacks. Prior work in this space is limited to studying… ▽ More Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale. However, despite their real-world deployment, we do not yet comprehensively understand the extent to which offensive language classifiers are robust against adversarial attacks. Prior work in this space is limited to studying robustness of offensive language classifiers against primitive attacks such as misspellings and extraneous spaces. To address this gap, we systematically analyze the robustness of state-of-the-art offensive language classifiers against more crafty adversarial attacks that leverage greedy- and attention-based word selection and context-aware embeddings for word replacement. Our results on multiple datasets show that these crafty adversarial attacks can degrade the accuracy of offensive language classifiers by more than 50% while also being able to preserve the readability and meaning of the modified text. △ Less

Submitted 21 March, 2022; originally announced March 2022.

Comments: 9 pages, 2 figures, Accepted at ACL 2022

ACM Class: I.2.7

arXiv:2203.10666 [pdf, other]

YouTube, The Great Radicalizer? Auditing and Mitigating Ideological Biases in YouTube Recommendations

Authors: Muhammad Haroon, Anshuman Chhabra, Xin Liu, Prasant Mohapatra, Zubair Shafiq, Magdalena Wojcieszak

Abstract: Recommendations algorithms of social media platforms are often criticized for placing users in "rabbit holes" of (increasingly) ideologically biased content. Despite these concerns, prior evidence on this algorithmic radicalization is inconsistent. Furthermore, prior work lacks systematic interventions that reduce the potential ideological bias in recommendation algorithms. We conduct a systematic… ▽ More Recommendations algorithms of social media platforms are often criticized for placing users in "rabbit holes" of (increasingly) ideologically biased content. Despite these concerns, prior evidence on this algorithmic radicalization is inconsistent. Furthermore, prior work lacks systematic interventions that reduce the potential ideological bias in recommendation algorithms. We conduct a systematic audit of YouTube's recommendation system using a hundred thousand sock puppets to determine the presence of ideological bias (i.e., are recommendations aligned with users' ideology), its magnitude (i.e., are users recommended an increasing number of videos aligned with their ideology), and radicalization (i.e., are the recommendations progressively more extreme). Furthermore, we design and evaluate a bottom-up intervention to minimize ideological bias in recommendations without relying on cooperation from YouTube. We find that YouTube's recommendations do direct users -- especially right-leaning users -- to ideologically biased and increasingly radical content on both homepages and in up-next recommendations. Our intervention effectively mitigates the observed bias, leading to more recommendations to ideologically neutral, diverse, and dissimilar content, yet debiasing is especially challenging for right-leaning users. Our systematic assessment shows that while YouTube recommendations lead to ideological bias, such bias can be mitigated through our intervention. △ Less

Submitted 24 March, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

arXiv:2202.12872 [pdf, other]

AutoFR: Automated Filter Rule Generation for Adblocking

Authors: Hieu Le, Salma Elmalaki, Athina Markopoulou, Zubair Shafiq

Abstract: Adblocking relies on filter lists, which are manually curated and maintained by a community of filter list authors. Filter list curation is a laborious process that does not scale well to a large number of sites or over time. In this paper, we introduce AutoFR, a reinforcement learning framework to fully automate the process of filter rule creation and evaluation for sites of interest. We design a… ▽ More Adblocking relies on filter lists, which are manually curated and maintained by a community of filter list authors. Filter list curation is a laborious process that does not scale well to a large number of sites or over time. In this paper, we introduce AutoFR, a reinforcement learning framework to fully automate the process of filter rule creation and evaluation for sites of interest. We design an algorithm based on multi-arm bandits to generate filter rules that block ads while controlling the trade-off between blocking ads and avoiding visual breakage. We test AutoFR on thousands of sites and we show that it is efficient: it takes only a few minutes to generate filter rules for a site of interest. AutoFR is effective: it generates filter rules that can block 86% of the ads, as compared to 87% by EasyList, while achieving comparable visual breakage. Furthermore, AutoFR generates filter rules that generalize well to new sites. We envision that AutoFR can assist the adblocking community in filter rule generation at scale. △ Less

Submitted 7 March, 2023; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: 16 pages with 13 figures, 3 tables, 1 algorithm. 3.5 pages of references. Appendices include 10 pages of appendices with 11 figures and 3 tables

arXiv:2112.01662 [pdf, other]

FP-Radar: Longitudinal Measurement and Early Detection of Browser Fingerprinting

Authors: Pouneh Nikkhah Bahrami, Umar Iqbal, Zubair Shafiq

Abstract: Browser fingerprinting is a stateless tracking technique that attempts to combine information exposed by multiple different web APIs to create a unique identifier for tracking users across the web. Over the last decade, trackers have abused several existing and newly proposed web APIs to further enhance the browser fingerprint. Existing approaches are limited to detecting a specific fingerprinting… ▽ More Browser fingerprinting is a stateless tracking technique that attempts to combine information exposed by multiple different web APIs to create a unique identifier for tracking users across the web. Over the last decade, trackers have abused several existing and newly proposed web APIs to further enhance the browser fingerprint. Existing approaches are limited to detecting a specific fingerprinting technique(s) at a particular point in time. Thus, they are unable to systematically detect novel fingerprinting techniques that abuse different web APIs. In this paper, we propose FP-Radar, a machine learning approach that leverages longitudinal measurements of web API usage on top-100K websites over the last decade, for early detection of new and evolving browser fingerprinting techniques. The results show that FP-Radar is able to early detect the abuse of newly introduced properties of already known (e.g., WebGL, Sensor) and as well as previously unknown (e.g., Gamepad, Clipboard) APIs for browser fingerprinting. To the best of our knowledge, FP-Radar is also the first to detect the abuse of the Visibility API for ephemeral fingerprinting in the wild. △ Less

Submitted 14 December, 2021; v1 submitted 2 December, 2021; originally announced December 2021.

Journal ref: Proceedings on Privacy Enhancing Technologies (2022)

arXiv:2111.05792 [pdf, other]

doi 10.14722/ndss.2022.23062

HARPO: Learning to Subvert Online Behavioral Advertising

Authors: Jiang Zhang, Konstantinos Psounis, Muhammad Haroon, Zubair Shafiq

Abstract: Online behavioral advertising, and the associated tracking paraphernalia, poses a real privacy threat. Unfortunately, existing privacy-enhancing tools are not always effective against online advertising and tracking. We propose Harpo, a principled learning-based approach to subvert online behavioral advertising through obfuscation. Harpo uses reinforcement learning to adaptively interleave real pa… ▽ More Online behavioral advertising, and the associated tracking paraphernalia, poses a real privacy threat. Unfortunately, existing privacy-enhancing tools are not always effective against online advertising and tracking. We propose Harpo, a principled learning-based approach to subvert online behavioral advertising through obfuscation. Harpo uses reinforcement learning to adaptively interleave real page visits with fake pages to distort a tracker's view of a user's browsing profile. We evaluate Harpo against real-world user profiling and ad targeting models used for online behavioral advertising. The results show that Harpo improves privacy by triggering more than 40% incorrect interest segments and 6x higher bid values. Harpo outperforms existing obfuscation tools by as much as 16x for the same overhead. Harpo is also able to achieve better stealthiness to adversarial detection than existing obfuscation tools. Harpo meaningfully advances the state-of-the-art in leveraging obfuscation to subvert online behavioral advertising △ Less

Submitted 23 November, 2021; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: Accepted at NDSS'22

arXiv:2109.07028 [pdf, other]

Avengers Ensemble! Improving Transferability of Authorship Obfuscation

Authors: Muhammad Haroon, Fareed Zaffar, Padmini Srinivasan, Zubair Shafiq

Abstract: Stylometric approaches have been shown to be quite effective for real-world authorship attribution. To mitigate the privacy threat posed by authorship attribution, researchers have proposed automated authorship obfuscation approaches that aim to conceal the stylometric artefacts that give away the identity of an anonymous document's author. Recent work has focused on authorship obfuscation approac… ▽ More Stylometric approaches have been shown to be quite effective for real-world authorship attribution. To mitigate the privacy threat posed by authorship attribution, researchers have proposed automated authorship obfuscation approaches that aim to conceal the stylometric artefacts that give away the identity of an anonymous document's author. Recent work has focused on authorship obfuscation approaches that rely on black-box access to an attribution classifier to evade attribution while preserving semantics. However, to be useful under a realistic threat model, it is important that these obfuscation approaches work well even when the adversary's attribution classifier is different from the one used internally by the obfuscator. Unfortunately, existing authorship obfuscation approaches do not transfer well to unseen attribution classifiers. In this paper, we propose an ensemble-based approach for transferable authorship obfuscation. Our experiments show that if an obfuscator can evade an ensemble attribution classifier, which is based on multiple base attribution classifiers, it is more likely to transfer to different attribution classifiers. Our analysis shows that ensemble-based authorship obfuscation achieves better transferability because it combines the knowledge from each of the base attribution classifiers by essentially averaging their decision boundaries. △ Less

Submitted 8 October, 2021; v1 submitted 14 September, 2021; originally announced September 2021.

Comments: Submitted to PETS 2021

arXiv:2108.13923 [pdf, other]

TrackerSift: Untangling Mixed Tracking and Functional Web Resources

Authors: Abdul Haddi Amjad, Danial Saleem, Fareed Zaffar, Muhammad Ali Gulzar, Zubair Shafiq

Abstract: Trackers have recently started to mix tracking and functional resources to circumvent privacy-enhancing content blocking tools. Such mixed web resources put content blockers in a bind: risk breaking legitimate functionality if they act and risk missing privacy-invasive advertising and tracking if they do not. In this paper, we propose TrackerSift to progressively classify and untangle mixed web re… ▽ More Trackers have recently started to mix tracking and functional resources to circumvent privacy-enhancing content blocking tools. Such mixed web resources put content blockers in a bind: risk breaking legitimate functionality if they act and risk missing privacy-invasive advertising and tracking if they do not. In this paper, we propose TrackerSift to progressively classify and untangle mixed web resources (that combine tracking and legitimate functionality) at multiple granularities of analysis (domain, hostname, script, and method). Using TrackerSift, we conduct a large-scale measurement study of such mixed resources on 100K websites. We find that more than 17% domains, 48% hostnames, 6% scripts, and 9% methods observed in our crawls combine tracking and legitimate functionality. While mixed web resources are prevalent across all granularities, TrackerSift is able to attribute 98% of the script-initiated network requests to either tracking or functional resources at the finest method-level granularity. Our analysis shows that mixed resources at different granularities are typically served from CDNs or as inlined and bundled scripts, and that blocking them indeed results in breakage of legitimate functionality. Our results highlight opportunities for finer-grained content blocking to remove mixed resources without breaking legitimate functionality. △ Less

Submitted 29 September, 2021; v1 submitted 28 August, 2021; originally announced August 2021.

arXiv:2107.11309 [pdf, other]

WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking

Authors: Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, Carmela Troncoso

Abstract: Millions of web users directly depend on ad and tracker blocking tools to protect their privacy. However, existing ad and tracker blockers fall short because of their reliance on trivially susceptible advertising and tracking content. In this paper, we first demonstrate that the state-of-the-art machine learning based ad and tracker blockers, such as AdGraph, are susceptible to adversarial evasion… ▽ More Millions of web users directly depend on ad and tracker blocking tools to protect their privacy. However, existing ad and tracker blockers fall short because of their reliance on trivially susceptible advertising and tracking content. In this paper, we first demonstrate that the state-of-the-art machine learning based ad and tracker blockers, such as AdGraph, are susceptible to adversarial evasions deployed in real-world. Second, we introduce WebGraph, the first graph-based machine learning blocker that detects ads and trackers based on their action rather than their content. By building features around the actions that are fundamental to advertising and tracking - storing an identifier in the browser, or sharing an identifier with another tracker - WebGraph performs nearly as well as prior approaches, but is significantly more robust to adversarial evasions. In particular, we show that WebGraph achieves comparable accuracy to AdGraph, while significantly decreasing the success rate of an adversary from near-perfect under AdGraph to around 8% under WebGraph. Finally, we show that WebGraph remains robust to a more sophisticated adversary that uses evasion techniques beyond those currently deployed on the web. △ Less

Submitted 17 August, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

arXiv:2106.01703 [pdf, other]

Fingerprinting Fine-tuned Language Models in the Wild

Authors: Nirav Diwan, Tanmoy Chakravorty, Zubair Shafiq

Abstract: There are concerns that the ability of language models (LMs) to generate high quality synthetic text can be misused to launch spam, disinformation, or propaganda. Therefore, the research community is actively working on developing approaches to detect whether a given text is organic or synthetic. While this is a useful first step, it is important to be able to further fingerprint the author LM to… ▽ More There are concerns that the ability of language models (LMs) to generate high quality synthetic text can be misused to launch spam, disinformation, or propaganda. Therefore, the research community is actively working on developing approaches to detect whether a given text is organic or synthetic. While this is a useful first step, it is important to be able to further fingerprint the author LM to attribute its origin. Prior work on fingerprinting LMs is limited to attributing synthetic text generated by a handful (usually < 10) of pre-trained LMs. However, LMs such as GPT2 are commonly fine-tuned in a myriad of ways (e.g., on a domain-specific text corpus) before being used to generate synthetic text. It is challenging to fingerprinting fine-tuned LMs because the universe of fine-tuned LMs is much larger in realistic scenarios. To address this challenge, we study the problem of large-scale fingerprinting of fine-tuned LMs in the wild. Using a real-world dataset of synthetic text generated by 108 different fine-tuned LMs, we conduct comprehensive experiments to demonstrate the limitations of existing fingerprinting approaches. Our results show that fine-tuning itself is the most effective in attributing the synthetic text generated by fine-tuned LMs. △ Less

Submitted 3 June, 2021; originally announced June 2021.

arXiv:2105.05381 [pdf, ps, other]

doi 10.1109/SP46215.2023.00109

Accuracy-Privacy Trade-off in Deep Ensemble: A Membership Inference Perspective

Authors: Shahbaz Rezaei, Zubair Shafiq, Xin Liu

Abstract: Deep ensemble learning has been shown to improve accuracy by training multiple neural networks and averaging their outputs. Ensemble learning has also been suggested to defend against membership inference attacks that undermine privacy. In this paper, we empirically demonstrate a trade-off between these two goals, namely accuracy and privacy (in terms of membership inference attacks), in deep ense… ▽ More Deep ensemble learning has been shown to improve accuracy by training multiple neural networks and averaging their outputs. Ensemble learning has also been suggested to defend against membership inference attacks that undermine privacy. In this paper, we empirically demonstrate a trade-off between these two goals, namely accuracy and privacy (in terms of membership inference attacks), in deep ensembles. Using a wide range of datasets and model architectures, we show that the effectiveness of membership inference attacks increases when ensembling improves accuracy. We analyze the impact of various factors in deep ensembles and demonstrate the root cause of the trade-off. Then, we evaluate common defenses against membership inference attacks based on regularization and differential privacy. We show that while these defenses can mitigate the effectiveness of membership inference attacks, they simultaneously degrade ensemble accuracy. We illustrate similar trade-off in more advanced and state-of-the-art ensembling techniques, such as snapshot ensembles and diversified ensemble networks. Finally, we propose a simple yet effective defense for deep ensembles to break the trade-off and, consequently, improve the accuracy and privacy, simultaneously. △ Less

Submitted 5 December, 2022; v1 submitted 11 May, 2021; originally announced May 2021.

arXiv:2102.04217 [pdf, other]

Understanding Underground Incentivized Review Services

Authors: Rajvardhan Oak, Zubair Shafiq

Abstract: While human factors in fraud have been studied by the HCI and security communities, most research has been directed to understanding either the victims' perspectives or prevention strategies, and not on fraudsters, their motivations and operation techniques. Additionally, the focus has been on a narrow set of problems: phishing, spam and bullying. In this work, we seek to understand review fraud o… ▽ More While human factors in fraud have been studied by the HCI and security communities, most research has been directed to understanding either the victims' perspectives or prevention strategies, and not on fraudsters, their motivations and operation techniques. Additionally, the focus has been on a narrow set of problems: phishing, spam and bullying. In this work, we seek to understand review fraud on e-commerce platforms through an HCI lens. Through surveys with real fraudsters (N=36 agents and N=38 reviewers), we uncover sophisticated recruitment, execution, and reporting mechanisms fraudsters use to scale their operation while resisting takedown attempts, including the use of AI tools like ChatGPT. We find that countermeasures that crack down on communication channels through which these services operate are effective in combating incentivized reviews. This research sheds light on the complex landscape of incentivized reviews, providing insights into the mechanics of underground services and their resilience to removal efforts. △ Less

Submitted 14 February, 2024; v1 submitted 20 January, 2021; originally announced February 2021.

Comments: This work has been accepted for presentation at CHI 2024

ACM Class: K.4.4

arXiv:2010.01497 [pdf, other]

doi 10.1145/3419394.3423662

Understanding Incentivized Mobile App Installs on Google Play Store

Authors: Shehroze Farooqi, Álvaro Feal, Tobias Lauinger, Damon McCoy, Zubair Shafiq, Narseo Vallina-Rodriguez

Abstract: "Incentivized" advertising platforms allow mobile app developers to acquire new users by directly paying users to install and engage with mobile apps (e.g., create an account, make in-app purchases). Incentivized installs are banned by the Apple App Store and discouraged by the Google Play Store because they can manipulate app store metrics (e.g., install counts, appearance in top charts). Yet, ma… ▽ More "Incentivized" advertising platforms allow mobile app developers to acquire new users by directly paying users to install and engage with mobile apps (e.g., create an account, make in-app purchases). Incentivized installs are banned by the Apple App Store and discouraged by the Google Play Store because they can manipulate app store metrics (e.g., install counts, appearance in top charts). Yet, many organizations still offer incentivized install services for Android apps. In this paper, we present the first study to understand the ecosystem of incentivized mobile app install campaigns in Android and its broader ramifications through a series of measurements. We identify incentivized install campaigns that require users to install an app and perform in-app tasks targeting manipulation of a wide variety of user engagement metrics (e.g., daily active users, user session lengths) and revenue. Our results suggest that these artificially inflated metrics can be effective in improving app store metrics as well as helping mobile app developers to attract funding from venture capitalists. Our study also indicates lax enforcement of the Google Play Store's existing policies to prevent these behaviors. It further motivates the need for stricter policing of incentivized install campaigns. Our proposed measurements can also be leveraged by the Google Play Store to identify potential policy violations. △ Less

Submitted 4 October, 2020; originally announced October 2020.

Journal ref: ACM Internet Measurement Conference (2020)

arXiv:2008.04480 [pdf, other]

Fingerprinting the Fingerprinters: Learning to Detect Browser Fingerprinting Behaviors

Authors: Umar Iqbal, Steven Englehardt, Zubair Shafiq

Abstract: Browser fingerprinting is an invasive and opaque stateless tracking technique. Browser vendors, academics, and standards bodies have long struggled to provide meaningful protections against browser fingerprinting that are both accurate and do not degrade user experience. We propose FP-Inspector, a machine learning based syntactic-semantic approach to accurately detect browser fingerprinting. We sh… ▽ More Browser fingerprinting is an invasive and opaque stateless tracking technique. Browser vendors, academics, and standards bodies have long struggled to provide meaningful protections against browser fingerprinting that are both accurate and do not degrade user experience. We propose FP-Inspector, a machine learning based syntactic-semantic approach to accurately detect browser fingerprinting. We show that FP-Inspector performs well, allowing us to detect 26% more fingerprinting scripts than the state-of-the-art. We show that an API-level fingerprinting countermeasure, built upon FP-Inspector, helps reduce website breakage by a factor of 2. We use FP-Inspector to perform a measurement study of browser fingerprinting on top-100K websites. We find that browser fingerprinting is now present on more than 10% of the top-100K websites and over a quarter of the top-10K websites. We also discover previously unreported uses of JavaScript APIs by fingerprinting scripts suggesting that they are looking to exploit APIs in new and unexpected ways. △ Less

Submitted 10 August, 2020; originally announced August 2020.

Comments: To appear in the Proceedings of the IEEE Symposium on Security & Privacy 2021

arXiv:2006.15794 [pdf, other]

CanaryTrap: Detecting Data Misuse by Third-Party Apps on Online Social Networks

Authors: Shehroze Farooqi, Maaz Musa, Zubair Shafiq, Fareed Zaffar

Abstract: Online social networks support a vibrant ecosystem of third-party apps that get access to personal information of a large number of users. Despite several recent high-profile incidents, methods to systematically detect data misuse by third-party apps on online social networks are lacking. We propose CanaryTrap to detect misuse of data shared with third-party apps. CanaryTrap associates a honeytoke… ▽ More Online social networks support a vibrant ecosystem of third-party apps that get access to personal information of a large number of users. Despite several recent high-profile incidents, methods to systematically detect data misuse by third-party apps on online social networks are lacking. We propose CanaryTrap to detect misuse of data shared with third-party apps. CanaryTrap associates a honeytoken to a user account and then monitors its unrecognized use via different channels after sharing it with the third-party app. We design and implement CanaryTrap to investigate misuse of data shared with third-party apps on Facebook. Specifically, we share the email address associated with a Facebook account as a honeytoken by installing a third-party app. We then monitor the received emails and use Facebook's ad transparency tool to detect any unrecognized use of the shared honeytoken. Our deployment of CanaryTrap to monitor 1,024 Facebook apps has uncovered multiple cases of misuse of data shared with third-party apps on Facebook including ransomware, spam, and targeted advertising. △ Less

Submitted 28 June, 2020; originally announced June 2020.

arXiv:2005.00702 [pdf, other]

A Girl Has A Name: Detecting Authorship Obfuscation

Authors: Asad Mahmood, Zubair Shafiq, Padmini Srinivasan

Abstract: Authorship attribution aims to identify the author of a text based on the stylometric analysis. Authorship obfuscation, on the other hand, aims to protect against authorship attribution by modifying a text's style. In this paper, we evaluate the stealthiness of state-of-the-art authorship obfuscation methods under an adversarial threat model. An obfuscator is stealthy to the extent an adversary fi… ▽ More Authorship attribution aims to identify the author of a text based on the stylometric analysis. Authorship obfuscation, on the other hand, aims to protect against authorship attribution by modifying a text's style. In this paper, we evaluate the stealthiness of state-of-the-art authorship obfuscation methods under an adversarial threat model. An obfuscator is stealthy to the extent an adversary finds it challenging to detect whether or not a text modified by the obfuscator is obfuscated - a decision that is key to the adversary interested in authorship attribution. We show that the existing authorship obfuscation methods are not stealthy as their obfuscated texts can be identified with an average F1 score of 0.87. The reason for the lack of stealthiness is that these obfuscators degrade text smoothness, as ascertained by neural language models, in a detectable manner. Our results highlight the need to develop stealthy authorship obfuscation methods that can better protect the identity of an author seeking anonymity. △ Less

Submitted 2 May, 2020; originally announced May 2020.

Comments: 9 pages, 4 figures, 2 tables, ACL 2020

arXiv:2001.10999 [pdf, other]

A4 : Evading Learning-based Adblockers

Authors: Shitong Zhu, Zhongjie Wang, Xun Chen, Shasha Li, Umar Iqbal, Zhiyun Qian, Kevin S. Chan, Srikanth V. Krishnamurthy, Zubair Shafiq

Abstract: Efforts by online ad publishers to circumvent traditional ad blockers towards regaining fiduciary benefits, have been demonstrably successful. As a result, there have recently emerged a set of adblockers that apply machine learning instead of manually curated rules and have been shown to be more robust in blocking ads on websites including social media sites such as Facebook. Among these, AdGraph… ▽ More Efforts by online ad publishers to circumvent traditional ad blockers towards regaining fiduciary benefits, have been demonstrably successful. As a result, there have recently emerged a set of adblockers that apply machine learning instead of manually curated rules and have been shown to be more robust in blocking ads on websites including social media sites such as Facebook. Among these, AdGraph is arguably the state-of-the-art learning-based adblocker. In this paper, we develop A4, a tool that intelligently crafts adversarial samples of ads to evade AdGraph. Unlike the popular research on adversarial samples against images or videos that are considered less- to un-restricted, the samples that A4 generates preserve application semantics of the web page, or are actionable. Through several experiments we show that A4 can bypass AdGraph about 60% of the time, which surpasses the state-of-the-art attack by a significant margin of 84.3%; in addition, changes to the visual layout of the web page due to these perturbations are imperceptible. We envision the algorithmic framework proposed in A4 is also promising in improving adversarial attacks against other learning-based web applications with similar requirements. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: 10 pages, 7 figures

arXiv:2001.08288 [pdf, ps, other]

Characterizing Smart Home IoT Traffic in the Wild

Authors: M. Hammad Mazhar, Zubair Shafiq

Abstract: As the smart home IoT ecosystem flourishes, it is imperative to gain a better understanding of the unique challenges it poses in terms of management, security, and privacy. Prior studies are limited because they examine smart home IoT devices in testbed environments or at a small scale. To address this gap, we present a measurement study of smart home IoT devices in the wild by instrumenting home… ▽ More As the smart home IoT ecosystem flourishes, it is imperative to gain a better understanding of the unique challenges it poses in terms of management, security, and privacy. Prior studies are limited because they examine smart home IoT devices in testbed environments or at a small scale. To address this gap, we present a measurement study of smart home IoT devices in the wild by instrumenting home gateways and passively collecting real-world network traffic logs from more than 200 homes across a large metropolitan area in the United States. We characterize smart home IoT traffic in terms of its volume, temporal patterns, and external endpoints along with focusing on certain security and privacy concerns. We first show that traffic characteristics reflect the functionality of smart home IoT devices such as smart TVs generating high volume traffic to content streaming services following diurnal patterns associated with human activity. While the smart home IoT ecosystem seems fragmented, our analysis reveals that it is mostly centralized due to its reliance on a few popular cloud and DNS services. Our findings also highlight several interesting security and privacy concerns in smart home IoT ecosystem such as the need to improve policy-based access control for IoT traffic, lack of use of application layer encryption, and prevalence of third-party advertising and tracking services. Our findings have important implications for future research on improving management, security, and privacy of the smart home IoT ecosystem. △ Less

Submitted 30 March, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

Comments: 13 pages, to be published in IoTDI 2020

arXiv:1911.03447 [pdf, other]

The TV is Smart and Full of Trackers: Towards Understanding the Smart TV Advertising and Tracking Ecosystem

Authors: Janus Varmarken, Hieu Le, Anastasia Shuba, Zubair Shafiq, Athina Markopoulou

Abstract: Motivated by the growing popularity of smart TVs, we present a large-scale measurement study of smart TVs by collecting and analyzing their network traffic from two different vantage points. First, we analyze aggregate network traffic of smart TVs in-the-wild, collected from residential gateways of tens of homes and several different smart TV platforms, including Apple, Samsung, Roku, and Chromeca… ▽ More Motivated by the growing popularity of smart TVs, we present a large-scale measurement study of smart TVs by collecting and analyzing their network traffic from two different vantage points. First, we analyze aggregate network traffic of smart TVs in-the-wild, collected from residential gateways of tens of homes and several different smart TV platforms, including Apple, Samsung, Roku, and Chromecast. In addition to accessing video streaming and cloud services, we find that smart TVs frequently connect to well-known as well as platform-specific advertising and tracking services (ATS). Second, we instrument Roku and Amazon Fire TV, two popular smart TV platforms, by setting up a controlled testbed to systematically exercise the top-1000 apps on each platform, and analyze their network traffic at the granularity of the individual apps. We again find that smart TV apps connect to a wide range of ATS, and that the key players of the ATS ecosystems of the two platforms are different from each other and from that of the mobile platform. Third, we evaluate the (in)effectiveness of state-of-the-art DNS-based blocklists in filtering advertising and tracking traffic for smart TVs. We find that personally identifiable information (PII) is exfiltrated to platform-related Internet endpoints and third parties, and that blocklists are generally better at preventing exposure of PII to third parties than to platform-related endpoints. Our work demonstrates the segmentation of the smart TV ATS ecosystem across platforms and its differences from the mobile ATS ecosystem, thus motivating the need for designing privacy-enhancing tools specifically for each smart TV platform. △ Less

Submitted 8 November, 2019; originally announced November 2019.

arXiv:1907.07275 [pdf, other]

doi 10.2478/popets-2020-0001

Inferring Tracker-Advertiser Relationships in the Online Advertising Ecosystem using Header Bidding

Authors: John Cook, Rishab Nithyanand, Zubair Shafiq

Abstract: Online advertising relies on trackers and data brokers to show targeted ads to users. To improve targeting, different entities in the intricately interwoven online advertising and tracking ecosystems are incentivized to share information with each other through client-side or server-side mechanisms. Inferring data sharing between entities, especially when it happens at the server-side, is an impor… ▽ More Online advertising relies on trackers and data brokers to show targeted ads to users. To improve targeting, different entities in the intricately interwoven online advertising and tracking ecosystems are incentivized to share information with each other through client-side or server-side mechanisms. Inferring data sharing between entities, especially when it happens at the server-side, is an important and challenging research problem. In this paper, we introduce KASHF: a novel method to infer data sharing relationships between advertisers and trackers by studying how an advertiser's bidding behavior changes as we manipulate the presence of trackers. We operationalize this insight by training an interpretable machine learning model that uses the presence of trackers as features to predict the bidding behavior of an advertiser. By analyzing the machine learning model, we are able to infer relationships between advertisers and trackers irrespective of whether data sharing occurs at the client-side or the server-side. We are also able to identify several server-side data sharing relationships that are validated externally but are not detected by client-side cookie syncing. △ Less

Submitted 20 September, 2019; v1 submitted 16 July, 2019; originally announced July 2019.

Comments: 18 pages, 2 figures, Privacy Enhancing Technologies Symposium (2020)

ACM Class: H.3.5

arXiv:1805.09155 [pdf, other]

AdGraph: A Graph-Based Approach to Ad and Tracker Blocking

Authors: Umar Iqbal, Peter Snyder, Shitong Zhu, Benjamin Livshits, Zhiyun Qian, Zubair Shafiq

Abstract: User demand for blocking advertising and tracking online is large and growing. Existing tools, both deployed and described in research, have proven useful, but lack either the completeness or robustness needed for a general solution. Existing detection approaches generally focus on only one aspect of advertising or tracking (e.g. URL patterns, code structure), making existing approaches susceptibl… ▽ More User demand for blocking advertising and tracking online is large and growing. Existing tools, both deployed and described in research, have proven useful, but lack either the completeness or robustness needed for a general solution. Existing detection approaches generally focus on only one aspect of advertising or tracking (e.g. URL patterns, code structure), making existing approaches susceptible to evasion. In this work we present AdGraph, a novel graph-based machine learning approach for detecting advertising and tracking resources on the web. AdGraph differs from existing approaches by building a graph representation of the HTML structure, network requests, and JavaScript behavior of a webpage, and using this unique representation to train a classifier for identifying advertising and tracking resources. Because AdGraph considers many aspects of the context a network request takes place in, it is less susceptible to the single-factor evasion techniques that flummox existing approaches. We evaluate AdGraph on the Alexa top-10K websites, and find that it is highly accurate, able to replicate the labels of human-generated filter lists with 95.33% accuracy, and can even identify many mistakes in filter lists. We implement AdGraph as a modification to Chromium. AdGraph adds only minor overhead to page loading and execution, and is actually faster than stock Chromium on 42% of websites and AdBlock Plus on 78% of websites. Overall, we conclude that AdGraph is both accurate enough and performant enough for online use, breaking comparable or fewer websites than popular filter list based approaches. △ Less

Submitted 30 May, 2019; v1 submitted 21 May, 2018; originally announced May 2018.

Comments: To appear in the Proceedings of the IEEE Symposium on Security & Privacy, May 2020

arXiv:1707.00190 [pdf, other]

Measuring, Characterizing, and Detecting Facebook Like Farms

Authors: Muhammad Ikram, Lucky Onwuzurike, Shehroze Farooqi, Emiliano De Cristofaro, Arik Friedman, Guillaume Jourjon, Dali Kaafar, M. Zubair Shafiq

Abstract: Social networks offer convenient ways to seamlessly reach out to large audiences. In particular, Facebook pages are increasingly used by businesses, brands, and organizations to connect with multitudes of users worldwide. As the number of likes of a page has become a de-facto measure of its popularity and profitability, an underground market of services artificially inflating page likes, aka like… ▽ More Social networks offer convenient ways to seamlessly reach out to large audiences. In particular, Facebook pages are increasingly used by businesses, brands, and organizations to connect with multitudes of users worldwide. As the number of likes of a page has become a de-facto measure of its popularity and profitability, an underground market of services artificially inflating page likes, aka like farms, has emerged alongside Facebook's official targeted advertising platform. Nonetheless, there is little work that systematically analyzes Facebook pages' promotion methods. Aiming to fill this gap, we present a honeypot-based comparative measurement study of page likes garnered via Facebook advertising and from popular like farms. First, we analyze likes based on demographic, temporal, and social characteristics, and find that some farms seem to be operated by bots and do not really try to hide the nature of their operations, while others follow a stealthier approach, mimicking regular users' behavior. Next, we look at fraud detection algorithms currently deployed by Facebook and show that they do not work well to detect stealthy farms which spread likes over longer timespans and like popular pages to mimic regular users. To overcome their limitations, we investigate the feasibility of timeline-based detection of like farm accounts, focusing on characterizing content generated by Facebook accounts on their timelines as an indicator of genuine versus fake social activity. We analyze a range of features, grouped into two main categories: lexical and non-lexical. We find that like farm accounts tend to re-share content, use fewer words and poorer vocabulary, and more often generate duplicate comments and likes compared to normal users. Using relevant lexical and non-lexical features, we build a classifier to detect like farms accounts that achieves precision higher than 99% and 93% recall. △ Less

Submitted 4 July, 2017; v1 submitted 1 July, 2017; originally announced July 2017.

Comments: To appear in ACM Transactions on Privacy and Security (TOPS)

arXiv:1605.05841 [pdf, other]

doi 10.1145/1235

A First Look at Ad-block Detection: A New Arms Race on the Web

Authors: Muhammad Haris Mughees, Zhiyun Qian, Zubair Shafiq, Karishma Dash, Pan Hui

Abstract: The rise of ad-blockers is viewed as an economic threat by online publishers, especially those who primarily rely on ad- vertising to support their services. To address this threat, publishers have started retaliating by employing ad-block detectors, which scout for ad-blocker users and react to them by restricting their content access and pushing them to whitelist the website or disabling ad-bloc… ▽ More The rise of ad-blockers is viewed as an economic threat by online publishers, especially those who primarily rely on ad- vertising to support their services. To address this threat, publishers have started retaliating by employing ad-block detectors, which scout for ad-blocker users and react to them by restricting their content access and pushing them to whitelist the website or disabling ad-blockers altogether. The clash between ad-blockers and ad-block detectors has resulted in a new arms race on the web. In this paper, we present the first systematic measurement and analysis of ad-block detection on the web. We have designed and implemented a machine learning based tech- nique to automatically detect ad-block detection, and use it to study the deployment of ad-block detectors on Alexa top- 100K websites. The approach is promising with precision of 94.8% and recall of 93.1%. We characterize the spectrum of different strategies used by websites for ad-block detection. We find that most of publishers use fairly simple passive ap- proaches for ad-block detection. However, we also note that a few websites use third-party services, e.g. PageFair, for ad-block detection and response. The third-party services use active deception and other sophisticated tactics to de- tect ad-blockers. We also find that the third-party services can successfully circumvent ad-blockers and display ads on publisher websites. △ Less

Submitted 19 May, 2016; originally announced May 2016.

Comments: 12 pages, 12 figures

arXiv:1506.00506 [pdf, other]

Combating Fraud in Online Social Networks: Detecting Stealthy Facebook Like Farms

Authors: Muhammad Ikram, Lucky Onwuzurike, Shehroze Farooqi, Emiliano De Cristofaro, Arik Friedman, Guillaume Jourjon, Mohammad Ali Kaafar, M. Zubair Shafiq

Abstract: As businesses increasingly rely on social networking sites to engage with their customers, it is crucial to understand and counter reputation manipulation activities, including fraudulently boosting the number of Facebook page likes using like farms. To this end, several fraud detection algorithms have been proposed and some deployed by Facebook that use graph co-clustering to distinguish between… ▽ More As businesses increasingly rely on social networking sites to engage with their customers, it is crucial to understand and counter reputation manipulation activities, including fraudulently boosting the number of Facebook page likes using like farms. To this end, several fraud detection algorithms have been proposed and some deployed by Facebook that use graph co-clustering to distinguish between genuine likes and those generated by farm-controlled profiles. However, as we show in this paper, these tools do not work well with stealthy farms whose users spread likes over longer timespans and like popular pages, aiming to mimic regular users. We present an empirical analysis of the graph-based detection tools used by Facebook and highlight their shortcomings against more sophisticated farms. Next, we focus on characterizing content generated by social networks accounts on their timelines, as an indicator of genuine versus fake social activity. We analyze a wide range of features extracted from timeline posts, which we group into two main classes: lexical and non-lexical. We postulate and verify that like farm accounts tend to often re-share content, use fewer words and poorer vocabulary, and more often generate duplicate comments and likes compared to normal users. We extract relevant lexical and non-lexical features and and use them to build a classifier to detect like farms accounts, achieving significantly higher accuracy, namely, at least 99% precision and 93% recall. △ Less

Submitted 9 May, 2016; v1 submitted 1 June, 2015; originally announced June 2015.

arXiv:1505.01637 [pdf, other]

Characterizing Key Stakeholders in an Online Black-Hat Marketplace

Authors: Shehroze Farooqi, Muhammad Ikram, Emiliano De Cristofaro, Arik Friedman, Guillaume Jourjon, Mohamed Ali Kaafar, M. Zubair Shafiq, Fareed Zaffar

Abstract: Over the past few years, many black-hat marketplaces have emerged that facilitate access to reputation manipulation services such as fake Facebook likes, fraudulent search engine optimization (SEO), or bogus Amazon reviews. In order to deploy effective technical and legal countermeasures, it is important to understand how these black-hat marketplaces operate, shedding light on the services they of… ▽ More Over the past few years, many black-hat marketplaces have emerged that facilitate access to reputation manipulation services such as fake Facebook likes, fraudulent search engine optimization (SEO), or bogus Amazon reviews. In order to deploy effective technical and legal countermeasures, it is important to understand how these black-hat marketplaces operate, shedding light on the services they offer, who is selling, who is buying, what are they buying, who is more successful, why are they successful, etc. Toward this goal, in this paper, we present a detailed micro-economic analysis of a popular online black-hat marketplace, namely, SEOClerks.com. As the site provides non-anonymized transaction information, we set to analyze selling and buying behavior of individual users, propose a strategy to identify key users, and study their tactics as compared to other (non-key) users. We find that key users: (1) are mostly located in Asian countries, (2) are focused more on selling black-hat SEO services, (3) tend to list more lower priced services, and (4) sometimes buy services from other sellers and then sell at higher prices. Finally, we discuss the implications of our analysis with respect to devising effective economic and legal intervention strategies against marketplace operators and key users. △ Less

Submitted 4 April, 2017; v1 submitted 7 May, 2015; originally announced May 2015.

Comments: 12th IEEE/APWG Symposium on Electronic Crime Research (eCrime 2017)

arXiv:1409.2097 [pdf, other]

Paying for Likes? Understanding Facebook Like Fraud Using Honeypots

Authors: Emiliano De Cristofaro, Arik Friedman, Guillaume Jourjon, Mohamed Ali Kaafar, M. Zubair Shafiq

Abstract: Facebook pages offer an easy way to reach out to a very large audience as they can easily be promoted using Facebook's advertising platform. Recently, the number of likes of a Facebook page has become a measure of its popularity and profitability, and an underground market of services boosting page likes, aka like farms, has emerged. Some reports have suggested that like farms use a network of pro… ▽ More Facebook pages offer an easy way to reach out to a very large audience as they can easily be promoted using Facebook's advertising platform. Recently, the number of likes of a Facebook page has become a measure of its popularity and profitability, and an underground market of services boosting page likes, aka like farms, has emerged. Some reports have suggested that like farms use a network of profiles that also like other pages to elude fraud protection algorithms, however, to the best of our knowledge, there has been no systematic analysis of Facebook pages' promotion methods. This paper presents a comparative measurement study of page likes garnered via Facebook ads and by a few like farms. We deploy a set of honeypot pages, promote them using both methods, and analyze garnered likes based on likers' demographic, temporal, and social characteristics. We highlight a few interesting findings, including that some farms seem to be operated by bots and do not really try to hide the nature of their operations, while others follow a stealthier approach, mimicking regular users' behavior. △ Less

Submitted 4 October, 2014; v1 submitted 7 September, 2014; originally announced September 2014.

Comments: To appear in IMC 2014

arXiv:1302.2376 [pdf, other]

Modeling Morphology of Social Network Cascades

Authors: M. Zubair Shafiq, Alex X. Liu

Abstract: Cascades represent an important phenomenon across various disciplines such as sociology, economy, psychology, political science, marketing, and epidemiology. An important property of cascades is their morphology, which encompasses the structure, shape, and size. However, cascade morphology has not been rigorously characterized and modeled in prior literature. In this paper, we propose a Multi-orde… ▽ More Cascades represent an important phenomenon across various disciplines such as sociology, economy, psychology, political science, marketing, and epidemiology. An important property of cascades is their morphology, which encompasses the structure, shape, and size. However, cascade morphology has not been rigorously characterized and modeled in prior literature. In this paper, we propose a Multi-order Markov Model for the Morphology of Cascades ($M^4C$) that can represent and quantitatively characterize the morphology of cascades with arbitrary structures, shapes, and sizes. $M^4C$ can be used in a variety of applications to classify different types of cascades. To demonstrate this, we apply it to an unexplored but important problem in online social networks -- cascade size prediction. Our evaluations using real-world Twitter data show that $M^4C$ based cascade size prediction scheme outperforms the baseline scheme based on cascade graph features such as edge growth rate, degree distribution, clustering, and diameter. $M^4C$ based cascade size prediction scheme consistently achieves more than 90% classification accuracy under different experimental scenarios. △ Less

Submitted 10 February, 2013; originally announced February 2013.

Comments: 12 pages, technical report

Showing 1–43 of 43 results for author: Shafiq, Z