-
Scalable ATLAS pMSSM computational workflows using containerised REANA reusable analysis platform
Authors:
Marco Donadoni,
Matthew Feickert,
Lukas Heinrich,
Yang Liu,
Audrius Mečionis,
Vladyslav Moisieienkov,
Tibor Šimko,
Giordon Stark,
Marco Vidal García
Abstract:
In this paper we describe the development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows. The project is looking to assess the global coverage of BSM physics and requires running O(5k) computational workflows representing pMSSM model points. Following ATLAS Analysis Preservation policies, many analyses have…
▽ More
In this paper we describe the development of a streamlined framework for large-scale ATLAS pMSSM reinterpretations of LHC Run-2 analyses using containerised computational workflows. The project is looking to assess the global coverage of BSM physics and requires running O(5k) computational workflows representing pMSSM model points. Following ATLAS Analysis Preservation policies, many analyses have been preserved as containerised Yadage workflows, and after validation were added to a curated selection for the pMSSM study. To run the workflows at scale, we utilised the REANA reusable analysis platform. We describe how the REANA platform was enhanced to ensure the best concurrent throughput by internal service scheduling changes. We discuss the scalability of the approach on Kubernetes clusters from 500 to 5000 cores. Finally, we demonstrate a possibility of using additional ad-hoc public cloud infrastructure resources by running the same workflows on the Google Cloud Platform.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Evaluating Bias and Noise Induced by the U.S. Census Bureau's Privacy Protection Methods
Authors:
Christopher T. Kenny,
Cory McCartan,
Shiro Kuriwaki,
Tyler Simko,
Kosuke Imai
Abstract:
The United States Census Bureau faces a difficult trade-off between the accuracy of Census statistics and the protection of individual information. We conduct the first independent evaluation of bias and noise induced by the Bureau's two main disclosure avoidance systems: the TopDown algorithm employed for the 2020 Census and the swapping algorithm implemented for the three previous Censuses. Our…
▽ More
The United States Census Bureau faces a difficult trade-off between the accuracy of Census statistics and the protection of individual information. We conduct the first independent evaluation of bias and noise induced by the Bureau's two main disclosure avoidance systems: the TopDown algorithm employed for the 2020 Census and the swapping algorithm implemented for the three previous Censuses. Our evaluation leverages the Noisy Measure File (NMF) as well as two independent runs of the TopDown algorithm applied to the 2010 decennial Census. We find that the NMF contains too much noise to be directly useful, especially for Hispanic and multiracial populations. TopDown's post-processing dramatically reduces the NMF noise and produces data whose accuracy is similar to that of swapping. While the estimated errors for both TopDown and swapping algorithms are generally no greater than other sources of Census error, they can be relatively substantial for geographies with small total populations.
△ Less
Submitted 10 February, 2024; v1 submitted 12 June, 2023;
originally announced June 2023.
-
Making Differential Privacy Work for Census Data Users
Authors:
Cory McCartan,
Tyler Simko,
Kosuke Imai
Abstract:
The U.S. Census Bureau collects and publishes detailed demographic data about Americans which are heavily used by researchers and policymakers. The Bureau has recently adopted the framework of differential privacy in an effort to improve confidentiality of individual census responses. A key output of this privacy protection system is the Noisy Measurement File (NMF), which is produced by adding ra…
▽ More
The U.S. Census Bureau collects and publishes detailed demographic data about Americans which are heavily used by researchers and policymakers. The Bureau has recently adopted the framework of differential privacy in an effort to improve confidentiality of individual census responses. A key output of this privacy protection system is the Noisy Measurement File (NMF), which is produced by adding random noise to tabulated statistics. The NMF is critical to understanding any errors introduced in the data, and performing valid statistical inference on published census data. Unfortunately, the current release format of the NMF is difficult to access and work with. We describe the process we use to transform the NMF into a usable format, and provide recommendations to the Bureau for how to release future versions of the NMF. These changes are essential for ensuring transparency of privacy measures and reproducibility of scientific research built on census data.
△ Less
Submitted 7 October, 2023; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Comment: The Essential Role of Policy Evaluation for the 2020 Census Disclosure Avoidance System
Authors:
Christopher T. Kenny,
Shiro Kuriwaki,
Cory McCartan,
Evan T. R. Rosenman,
Tyler Simko,
Kosuke Imai
Abstract:
In "Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau's Use of Differential Privacy," boyd and Sarathy argue that empirical evaluations of the Census Disclosure Avoidance System (DAS), including our published analysis, failed to recognize how the benchmark data against which the 2020 DAS was evaluated is never a ground truth of population counts. In this commentary,…
▽ More
In "Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau's Use of Differential Privacy," boyd and Sarathy argue that empirical evaluations of the Census Disclosure Avoidance System (DAS), including our published analysis, failed to recognize how the benchmark data against which the 2020 DAS was evaluated is never a ground truth of population counts. In this commentary, we explain why policy evaluation, which was the main goal of our analysis, is still meaningful without access to a perfect ground truth. We also point out that our evaluation leveraged features specific to the decennial Census and redistricting data, such as block-level population invariance under swapping and voter file racial identification, better approximating a comparison with the ground truth. Lastly, we show that accurate statistical predictions of individual race based on the Bayesian Improved Surname Geocoding, while not a violation of differential privacy, substantially increases the disclosure risk of private information the Census Bureau sought to protect. We conclude by arguing that policy makers must confront a key trade-off between data utility and privacy protection, and an epistemic disconnect alone is insufficient to explain disagreements between policy choices.
△ Less
Submitted 15 October, 2022;
originally announced October 2022.
-
Simulated redistricting plans for the analysis and evaluation of redistricting in the United States
Authors:
Cory McCartan,
Christopher T. Kenny,
Tyler Simko,
George Garcia III,
Kevin Wang,
Melissa Wu,
Shiro Kuriwaki,
Kosuke Imai
Abstract:
This article introduces the 50stateSimulations, a collection of simulated congressional districting plans and underlying code developed by the Algorithm-Assisted Redistricting Methodology (ALARM) Project. The 50stateSimulations allow for the evaluation of enacted and other congressional redistricting plans in the United States. While the use of redistricting simulation algorithms has become standa…
▽ More
This article introduces the 50stateSimulations, a collection of simulated congressional districting plans and underlying code developed by the Algorithm-Assisted Redistricting Methodology (ALARM) Project. The 50stateSimulations allow for the evaluation of enacted and other congressional redistricting plans in the United States. While the use of redistricting simulation algorithms has become standard in academic research and court cases, any simulation analysis requires non-trivial efforts to combine multiple data sets, identify state-specific redistricting criteria, implement complex simulation algorithms, and summarize and visualize simulation outputs. We have developed a complete workflow that facilitates this entire process of simulation-based redistricting analysis for the congressional districts of all 50 states. The resulting 50stateSimulations include ensembles of simulated 2020 congressional redistricting plans and necessary replication data. We also provide the underlying code, which serves as a template for customized analyses. All data and code are free and publicly available. This article details the design, creation, and validation of the data.
△ Less
Submitted 20 October, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.
-
The Impact of the U.S. Census Disclosure Avoidance System on Redistricting and Voting Rights Analysis
Authors:
Christopher T. Kenny,
Shiro Kuriwaki,
Cory McCartan,
Evan Rosenman,
Tyler Simko,
Kosuke Imai
Abstract:
The US Census Bureau plans to protect the privacy of 2020 Census respondents through its Disclosure Avoidance System (DAS), which attempts to achieve differential privacy guarantees by adding noise to the Census microdata. By applying redistricting simulation and analysis methods to DAS-protected 2010 Census data, we find that the protected data are not of sufficient quality for redistricting purp…
▽ More
The US Census Bureau plans to protect the privacy of 2020 Census respondents through its Disclosure Avoidance System (DAS), which attempts to achieve differential privacy guarantees by adding noise to the Census microdata. By applying redistricting simulation and analysis methods to DAS-protected 2010 Census data, we find that the protected data are not of sufficient quality for redistricting purposes. We demonstrate that the injected noise makes it impossible for states to accurately comply with the One Person, One Vote principle. Our analysis finds that the DAS-protected data are biased against certain areas, depending on voter turnout and partisan and racial composition, and that these biases lead to large and unpredictable errors in the analysis of partisan and racial gerrymanders. Finally, we show that the DAS algorithm does not universally protect respondent privacy. Based on the names and addresses of registered voters, we are able to predict their race as accurately using the DAS-protected data as when using the 2010 Census data. Despite this, the DAS-protected data can still inaccurately estimate the number of majority-minority districts. We conclude with recommendations for how the Census Bureau should proceed with privacy protection for the 2020 Census.
△ Less
Submitted 20 August, 2021; v1 submitted 28 May, 2021;
originally announced May 2021.
-
Use of Solr and Xapian in the Invenio document repository software
Authors:
Patrick O. Glauner,
Jan Iwaszkiewicz,
Jean-Yves Le Meur,
Tibor Simko
Abstract:
Invenio is a free comprehensive web-based document repository and digital library software suite originally developed at CERN. It can serve a variety of use cases from an institutional repository or digital library to a web journal. In order to fully use full-text documents for efficient search and ranking, Solr was integrated into Invenio through a generic bridge. Solr indexes extracted full-text…
▽ More
Invenio is a free comprehensive web-based document repository and digital library software suite originally developed at CERN. It can serve a variety of use cases from an institutional repository or digital library to a web journal. In order to fully use full-text documents for efficient search and ranking, Solr was integrated into Invenio through a generic bridge. Solr indexes extracted full-texts and most relevant metadata. Consequently, Invenio takes advantage of Solr's efficient search and word similarity ranking capabilities. In this paper, we first give an overview of Invenio, its capabilities and features. We then present our open source Solr integration as well as scalability challenges that arose for an Invenio-based multi-million record repository: the CERN Document Server. We also compare our Solr adapter to an alternative Xapian adapter using the same generic bridge. Both integrations are distributed with the Invenio package and ready to be used by the institutions using or adopting Invenio.
△ Less
Submitted 1 October, 2013;
originally announced October 2013.