(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 99 results for author: Nelson, M L

.
  1. arXiv:2406.05933  [pdf, other

    cs.CR

    A Relevance Model for Threat-Centric Ranking of Cybersecurity Vulnerabilities

    Authors: Corren McCoy, Ross Gore, Michael L. Nelson, Michele C. Weigle

    Abstract: The relentless process of tracking and remediating vulnerabilities is a top concern for cybersecurity professionals. The key challenge is trying to identify a remediation scheme specific to in-house, organizational objectives. Without a strategy, the result is a patchwork of fixes applied to a tide of vulnerabilities, any one of which could be the point of failure in an otherwise formidable defens… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: 24 pages, 8 figures, 14 tables

    ACM Class: K.6.5

  2. arXiv:2401.04887  [pdf, other

    cs.DL

    Cited But Not Archived: Analyzing the Status of Code References in Scholarly Articles

    Authors: Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, Michael L. Nelson

    Abstract: One in five arXiv articles published in 2021 contained a URI to a Git Hosting Platform (GHP), which demonstrates the growing prevalence of GHP URIs in scholarly publications. However, GHP URIs are vulnerable to the same reference rot that plagues the Web at large. The disappearance of software hosting platforms, like Gitorious and Google Code, and the source code they contain threatens research re… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  3. arXiv:2307.14469  [pdf, other

    cs.DL

    It's Not Just GitHub: Identifying Data and Software Sources Included in Publications

    Authors: Emily Escamilla, Lamia Salsabil, Martin Klein, Jian Wu, Michele C. Weigle, Michael L. Nelson

    Abstract: Paper publications are no longer the only form of research product. Due to recent initiatives by publication venues and funding institutions, open access datasets and software products are increasingly considered research products and URIs to these products are growing more prevalent in scholarly publications. However, as with all URIs, resources found on the live Web are not permanent. Archivists… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: 13 pages, 7 figures, pre-print of publication for Theory and Practice of Digital Libraries 2023

  4. arXiv:2306.08236  [pdf, other

    cs.IR

    Extracting Information from Twitter Screenshots

    Authors: Tarannum Zaki, Michael L. Nelson, Michele C. Weigle

    Abstract: Screenshots are prevalent on social media as a common approach for information sharing. Users rarely verify before sharing a screenshot whether the post it contains is fake or real. Information sharing through fake screenshots can be highly responsible for misinformation and disinformation spread on social media. Our ultimate goal is to develop a tool that could take a screenshot of a tweet and pr… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

  5. Coronal Heating as Determined by the Solar Flare Frequency Distribution Obtained by Aggregating Case Studies

    Authors: James Paul Mason, Alexandra Werth, Colin G. West, Allison A. Youngblood, Donald L. Woodraska, Courtney Peck, Kevin Lacjak, Florian G. Frick, Moutamen Gabir, Reema A. Alsinan, Thomas Jacobsen, Mohammad Alrubaie, Kayla M. Chizmar, Benjamin P. Lau, Lizbeth Montoya Dominguez, David Price, Dylan R. Butler, Connor J. Biron, Nikita Feoktistov, Kai Dewey, N. E. Loomis, Michal Bodzianowski, Connor Kuybus, Henry Dietrick, Aubrey M. Wolfe , et al. (977 additional authors not shown)

    Abstract: Flare frequency distributions represent a key approach to addressing one of the largest problems in solar and stellar physics: determining the mechanism that counter-intuitively heats coronae to temperatures that are orders of magnitude hotter than the corresponding photospheres. It is widely accepted that the magnetic field is responsible for the heating, but there are two competing mechanisms th… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: 1,002 authors, 14 pages, 4 figures, 3 tables, published by The Astrophysical Journal on 2023-05-09, volume 948, page 71

  6. arXiv:2305.01071  [pdf, other

    cs.DL

    Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering

    Authors: Michele C. Weigle, Michael L. Nelson, Sawood Alam, Mark Graham

    Abstract: Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: 20 pages, preprint version of paper accepted at the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

  7. arXiv:2305.00546  [pdf, other

    cs.IR cs.DL

    Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives

    Authors: Lesley Frew, Michael L. Nelson, Michele C. Weigle

    Abstract: Webpages change over time, and web archives hold copies of historical versions of webpages. Users of web archives, such as journalists, want to find and view changes on webpages over time. However, the current search interfaces for web archives do not support this task. For the web archives that include a full-text search feature, multiple versions of the same webpage that match the search query a… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

    Comments: In Proceedings of JCDL 2023; 20 pages, 11 figures, 2 tables

    ACM Class: H.3.3; H.3.7

  8. arXiv:2212.05322  [pdf

    cs.SI cs.CR

    Twitter DM Videos Are Accessible to Unauthenticated Users

    Authors: Michael L. Nelson

    Abstract: Videos shared in Twitter Direct Messages (DMs) have opaque URLs based on hashes of their content, but are otherwise available to unauthenticated HTTP users. These DM video URLs are thus hard to guess, but if they were somehow discovered, they are available to any user, including users without Twitter credentials (i.e., twitter.com specific HTTP Cookie or Authorization request headers). This includ… ▽ More

    Submitted 22 December, 2022; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: 22 pages, 7 figures, v2 adds "available this way since 2016" and "http/https" discussion

    ACM Class: H.3.5

  9. arXiv:2212.00760  [pdf, other

    cs.NI cs.DL

    Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

    Authors: Kritika Garg, Himarsha R. Jayanetti, Sawood Alam, Michele C. Weigle, Michael L. Nelson

    Abstract: Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  10. arXiv:2211.09681  [pdf, other

    cs.IR

    Did They Really Tweet That? Querying Fact-Checking Sites and Politwoops to Determine Tweet Misattribution

    Authors: Caleb Bradford, Michael L. Nelson

    Abstract: Screenshots of social media posts have become common place on social media sites. While screenshots definitely serve a purpose, their ubiquity enables the spread of fabricated screenshots of posts that were never actually made, thereby proliferating misattribution disinformation. With the motivation of detecting this type of disinformation, we researched developing methods of querying the Web for… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: 20 pages

  11. arXiv:2211.02188  [pdf, other

    cs.DL

    Web Archiving as Entertainment

    Authors: Travis Reid, Michael L. Nelson, Michele C. Weigle

    Abstract: We want to make web archiving entertaining so that it can be enjoyed like a spectator sport. To this end, we have been working on a proof of concept that involves gamification of the web archiving process and integrating video games and web archiving. Our vision for this proof of concept involves a web archiving live stream and a gaming live stream. We are creating web archiving live streams that… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: This is an extended version of a paper from ICADL 2022. 20 pages and 10 figures

  12. arXiv:2209.08649  [pdf, other

    cs.DL

    Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists

    Authors: Himarsha R. Jayanetti, Shawn M. Jones, Martin Klein, Alex Osbourne, Paul Koerbin, Michael L. Nelson, Michele C. Weigle

    Abstract: As web archives' holdings grow, archivists subdivide them into collections so they are easier to understand and manage. In this work, we review the collection structures of eight web archive platforms: : Archive-It, Conifer, the Croatian Web Archive (HAW), the Internet Archive's user account web archives, Library of Congress (LC), PANDORA, Trove, and the UK Web Archive (UKWA). We note a plethora o… ▽ More

    Submitted 18 September, 2022; originally announced September 2022.

    Comments: 5 figures, 16 pages, accepted for publication at TPDL 2022

  13. Robots Still Outnumber Humans in Web Archives, But Less Than Before

    Authors: Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle

    Abstract: To identify robots and humans and analyze their respective access patterns, we used the Internet Archive's (IA) Wayback Machine access logs from 2012 and 2019, as well as Arquivo.pt's (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate… ▽ More

    Submitted 26 August, 2022; originally announced August 2022.

  14. arXiv:2208.04895  [pdf, other

    cs.DL

    The Rise of GitHub in Scholarly Publications

    Authors: Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, Michael L. Nelson

    Abstract: The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.

    Comments: 4 figures, 15 pages, accepted for publication at TPDL 2022

  15. arXiv:2108.12092  [pdf, other

    cs.DL

    Replaying Archived Twitter: When your bird is broken, will it bring you down?

    Authors: Kritika Garg, Himarsha R. Jayanetti, Sawood Alam, Michele C. Weigle, Michael L. Nelson

    Abstract: Historians and researchers trust web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this paper, we document and analyze the problems in archiving Twitter ever since Twitter forced the use of its new UI in June 2020. Most web archives were unable to archive the ne… ▽ More

    Submitted 26 August, 2021; originally announced August 2021.

  16. arXiv:2108.05939  [pdf, other

    cs.DL

    Where Did the Web Archive Go?

    Authors: Mohamed Aturban, Michael L. Nelson, Michele C. Weigle

    Abstract: To perform a longitudinal investigation of web archives and detecting variations and changes replaying individual archived pages, or mementos, we created a sample of 16,627 mementos from 17 public web archives. Over the course of our 14-month study (November, 2017 - January, 2019), we found that four web archives changed their base URIs and did not leave a machine-readable method of locating their… ▽ More

    Submitted 12 August, 2021; originally announced August 2021.

    Comments: 18 pages

  17. arXiv:2108.03311  [pdf, other

    cs.DL cs.IR

    Profiling Web Archival Voids for Memento Routing

    Authors: Sawood Alam, Michele C. Weigle, Michael L. Nelson

    Abstract: Prior work on web archive profiling were focused on Archival Holdings to describe what is present in an archive. This work defines and explores Archival Voids to establish a means to represent portions of URI spaces that are not present in a web archive. Archival Holdings and Archival Voids profiles can work independently or as complements to each other to maximize the Accuracy of Memento Aggregat… ▽ More

    Submitted 6 August, 2021; originally announced August 2021.

    Comments: Accepted in JCDL 2021 (10 pages, 7 figures, 7 tables)

  18. arXiv:2107.02680  [pdf, other

    cs.DL

    Garbage, Glitter, or Gold: Assigning Multi-dimensional Quality Scores to Social Media Seeds for Web Archive Collections

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by reference rot which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories/events before the… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

    Comments: This is an extended version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL2021) paper

  19. arXiv:2104.14041  [pdf, other

    cs.DL

    What Did It Look Like: A service for creating website timelapses using the Memento framework

    Authors: Dhruv Patel, Alexander C. Nwala, Michael L. Nelson, Michele C. Weigle

    Abstract: Popular web pages are archived frequently, which makes it difficult to visualize the progression of the site through the years at web archives. The What Did It Look Like (WDILL) Twitter bot shows web page transitions by creating a timelapse of a given website using one archived copy from each calendar year. Originally implemented in 2015, we recently added new features to WDILL, such as date range… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: 11 pages

  20. It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth

    Authors: Shawn M. Jones, Valentina Neblitt-Jones, Michele C. Weigle, Martin Klein, Michael L. Nelson

    Abstract: In a perfect world, all articles consistently contain sufficient metadata to describe the resource. We know this is not the reality, so we are motivated to investigate the evolution of the metadata that is present when authors and publishers supply their own. Because applying metadata takes time, we recognize that each news article author has a limited metadata budget with which to spend their tim… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: 10 pages, 10 figures, 3 tables

  21. Automatically Selecting Striking Images for Social Cards

    Authors: Shawn M. Jones, Michele C. Weigle, Martin Klein, Michael L. Nelson

    Abstract: To allow previewing a web page, social media platforms have developed social cards: visualizations consisting of vital information about the underlying resource. At a minimum, social cards often include features such as the web resource's title, text summary, striking image, and domain name. News and scholarly articles on the web are frequently subject to social card creation when being shared on… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

    Comments: 10 pages, 5 figures, 10 tables

  22. Modeling Updates of Scholarly Webpages Using Archived Data

    Authors: Yasith Jayawardana, Alexander C. Nwala, Gavindya Jayawardena, Jian Wu, Sampath Jayarathna, Michael L. Nelson, C. Lee Giles

    Abstract: The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the sch… ▽ More

    Submitted 6 December, 2020; originally announced December 2020.

    Comments: 12 pages, 2 appendix pages, 18 figures, to be published in Proceedings of IEEE Big Data 2020 - 5th Computational Archival Science (CAS) Workshop

  23. arXiv:2008.11680  [pdf, other

    cs.DL

    A 25 Year Retrospective on D-Lib Magazine

    Authors: Michael L. Nelson, Herbert Van de Sompel

    Abstract: In July, 1995 the first issue of D-Lib Magazine was published as an on-line, HTML-only, open access magazine, serving as the focal point for the then emerging digital library research community. In 2017 it ceased publication, in part due to the maturity of the community it served as well as the increasing availability of and competition from eprints, institutional repositories, conferences, social… ▽ More

    Submitted 27 August, 2020; v1 submitted 26 August, 2020; originally announced August 2020.

    Comments: 44 pages, 29 figures. Minor fixes

    ACM Class: H.3.7

  24. arXiv:2008.00139  [pdf, other

    cs.DL cs.HC cs.IR

    SHARI -- An Integration of Tools to Visualize the Story of the Day

    Authors: Shawn M. Jones, Alexander C. Nwala, Martin Klein, Michele C. Weigle, Michael L. Nelson

    Abstract: Tools such as Google News and Flipboard exist to convey daily news, but what about the past? In this paper, we describe how to combine several existing tools with web archive holdings to perform news analysis and visualization of the "biggest story" for a given date. StoryGraph clusters news articles together to identify a common news story. Hypercane leverages ArchiveNow to store URLs produced by… ▽ More

    Submitted 31 July, 2020; originally announced August 2020.

    Comments: 19 pages, 16 figures, 1 Table

    ACM Class: H.3.7; H.3.6; H.3.4

    Journal ref: Presented at the Web Archiving and Digital Libraries 2020 Workshop

  25. arXiv:2008.00137  [pdf, other

    cs.DL cs.HC cs.IR

    MementoEmbed and Raintale for Web Archive Storytelling

    Authors: Shawn M. Jones, Martin Klein, Michele C. Weigle, Michael L. Nelson

    Abstract: For traditional library collections, archivists can select a representative sample from a collection and display it in a featured physical or digital library space. Web archive collections may consist of thousands of archived pages, or mementos. How should an archivist display this sample to drive visitors to their collection? Search engines and social media platforms often represent web pages as… ▽ More

    Submitted 31 July, 2020; originally announced August 2020.

    Comments: 54 pages, 5 tables, 46 figures

    ACM Class: H.3.7; H.3.6; H.3.4

    Journal ref: Presented at the Web Archiving and Digital Libraries 2020 Workshop

  26. arXiv:2006.02487  [pdf, other

    cs.DL

    Visualizing Webpage Changes Over Time

    Authors: Abigail Mabe, Dhruv Patel, Maheedhar Gunnam, Surbhi Shankar, Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle

    Abstract: We report on the development of TMVis, a web service to provide visualizations of how individual webpages have changed over time. We leverage past research on summarizing collections of webpages with thumbnail-sized screenshots and on choosing a small number of representative past archived webpages from a large collection. We offer four visualizations: image grid, image slider, timeline, and anima… ▽ More

    Submitted 3 June, 2020; originally announced June 2020.

    Comments: 13 pages

  27. arXiv:2003.09989  [pdf, other

    cs.IR cs.CL cs.SI

    365 Dots in 2019: Quantifying Attention of News Sources

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days… ▽ More

    Submitted 22 March, 2020; originally announced March 2020.

    Comments: This is an extended version of the paper accepted at Computation + Journalism Symposium 2020, which has been postponed because of COVID-19

  28. arXiv:1908.02819  [pdf, other

    cs.DL cs.IR

    Making Recommendations from Web Archives for "Lost" Web Pages

    Authors: Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle

    Abstract: When a user requests a web page from a web archive, the user will typically either get an HTTP 200 if the page is available, or an HTTP 404 if the web page has not been archived. This is because web archives are typically accessed by URI lookup, and the response is binary: the archive either has the page or it does not, and the user will not know of other archived web pages that exist and are pote… ▽ More

    Submitted 7 August, 2019; originally announced August 2019.

    Comments: 12 pages

  29. arXiv:1906.07141  [pdf, other

    cs.DL

    Impact of HTTP Cookie Violations in Web Archives

    Authors: Sawood Alam, Plinio Vargas, Michele C. Weigle, Michael L. Nelson

    Abstract: Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: Presented at WADL 2019 (http://fox.cs.vt.edu/wadl2019.html). Slides: https://www.slideshare.net/ibnesayeed/impact-of-http-cookie-violations-in-web-archives

  30. arXiv:1906.07104  [pdf, other

    cs.NI cs.CR cs.CY cs.DL

    Supporting Web Archiving via Web Packaging

    Authors: Sawood Alam, Michele C. Weigle, Michael L. Nelson, Martin Klein, Herbert Van de Sompel

    Abstract: We describe challenges related to web archiving, replaying archived web resources, and verifying their authenticity. We show that Web Packaging has significant potential to help address these challenges and identify areas in which changes are needed in order to fully realize that potential.

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: This is a position paper accepted at the ESCAPE Workshop 2019. https://www.iab.org/activities/workshops/escape-workshop/

  31. arXiv:1905.12607  [pdf, other

    cs.DL

    MementoMap Framework for Flexible and Adaptive Web Archive Profiling

    Authors: Sawood Alam, Michele C. Weigle, Michael L. Nelson, Fernando Melo, Daniel Bicho, Daniel Gomes

    Abstract: In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the Arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

    Comments: In Proceedings of JCDL 2019; 13 pages, 9 tables, 13 figures, 3 code samples, and 1 equation

  32. arXiv:1905.12565  [pdf, other

    cs.DL

    Archive Assisted Archival Fixity Verification Framework

    Authors: Mohamed Aturban, Sawood Alam, Michael L. Nelson, Michele C. Weigle

    Abstract: The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure an archived resource has remained unaltered since the time it was captured. Some web archives do not allow users to access fixity information and, more importantly, even if fixity information is available, it is provided by the same archive from whic… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

    Comments: 16 pages

  33. arXiv:1905.12220  [pdf, other

    cs.DL cs.IR

    Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from cre… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

    Comments: This is an extended version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2019) full paper. Some figures have been enlarged, and appendices of additional figures included

  34. arXiv:1905.11342  [pdf, other

    cs.DL cs.HC cs.SI

    Social Cards Probably Provide For Better Understanding Of Web Archive Collections

    Authors: Shawn M. Jones, Michele C. Weigle, Michael L. Nelson

    Abstract: Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page.… ▽ More

    Submitted 29 May, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: 58 pages, 53 figures

    ACM Class: H.3.7; H.3.6; H.3.5; H.5.2

  35. arXiv:1905.03836  [pdf, other

    cs.DL

    Collecting 16K archived web pages from 17 public web archives

    Authors: Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, Martin Klein, Herbert Van de Sompel

    Abstract: We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b… ▽ More

    Submitted 9 May, 2019; originally announced May 2019.

    Comments: 21 pages

  36. arXiv:1806.09082  [pdf, other

    cs.DL cs.IR cs.SI

    Measuring News Similarity Across Ten U.S. News Sites

    Authors: Grant C. Atkins, Alexander Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations ar… ▽ More

    Submitted 1 July, 2018; v1 submitted 24 June, 2018; originally announced June 2018.

    Comments: This is an extended version of the paper to appear in the proceedings of the 15th International Conference on Digital Preservation (iPres 2018)

  37. The Many Shapes of Archive-It

    Authors: Shawn M. Jones, Alexander Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription ser… ▽ More

    Submitted 18 June, 2018; originally announced June 2018.

    Comments: 10 pages, 12 figures, to appear in the proceedings of the 15th International Conference on Digital Preservation (iPres 2018)

    ACM Class: H.3.7; H.3.1

  38. The Off-Topic Memento Toolkit

    Authors: Shawn M. Jones, Michele C. Weigle, Michael L. Nelson

    Abstract: Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configured to revisit the same original resource multiple times. This is incredibly useful for understanding an unfolding news sto… ▽ More

    Submitted 17 September, 2018; v1 submitted 18 June, 2018; originally announced June 2018.

    Comments: 10 pages, 14 figures, to appear in the proceedings of the 15th International Conference on Digital Preservation (iPres 2018)

    ACM Class: H.3.7; H.3.6; H.3.4

  39. A Framework for Aggregating Private and Public Web Archives

    Authors: Mat Kelly, Michael L. Nelson, Michele C. Weigle

    Abstract: Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential s… ▽ More

    Submitted 3 June, 2018; originally announced June 2018.

    Comments: Preprint version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018) full paper, accessible at the DOI

  40. Scraping SERPs for Archival Seeds: It Matters When You Start

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Goog… ▽ More

    Submitted 25 May, 2018; originally announced May 2018.

    Comments: This is an extended version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018) full paper: https://doi.org/10.1145/3197026.3197056. Some of the figure numbers have changed

  41. arXiv:1712.03140  [pdf, other

    cs.DL

    Difficulties of Timestamping Archived Web Pages

    Authors: Mohamed Aturban, Michael L. Nelson, Michele C. Weigle

    Abstract: We show that state-of-the-art services for creating trusted timestamps in blockchain-based networks do not adequately allow for timestamping of web pages. They accept data by value (e.g., images and text), but not by reference (e.g., URIs of web pages). Also, we discuss difficulties in repeatedly generating the same cryptographic hash value of an archived web page. We then introduce several requir… ▽ More

    Submitted 8 December, 2017; originally announced December 2017.

    Comments: 27 pages

  42. arXiv:1708.05790  [pdf, other

    cs.DL cs.SI

    University Twitter Engagement: Using Twitter Followers to Rank Universities

    Authors: Corren G. McCoy, Michael L. Nelson, Michele C. Weigle

    Abstract: We examine and rank a set of 264 U.S. universities extracted from the National Collegiate Athletic Association (NCAA) Division I membership and global lists published in U.S. News, Times Higher Education, Academic Ranking of World Universities, and Money Magazine. Our University Twitter Engagement (UTE) rank is based on the friend and extended follower network of primary and affiliated secondary T… ▽ More

    Submitted 18 August, 2017; originally announced August 2017.

    Comments: 14 pages, 4 figures

  43. arXiv:1705.06218  [pdf, other

    cs.DL

    Stories From the Past Web

    Authors: Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson

    Abstract: Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exists to allow institutions to develop, curate, and preserve collections of Web resources. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the har… ▽ More

    Submitted 17 May, 2017; originally announced May 2017.

  44. Impact of URI Canonicalization on Memento Count

    Authors: Mat Kelly, Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle, Herbert Van de Sompel

    Abstract: Quantifying the captures of a URI over time is useful for researchers to identify the extent to which a Web page has been archived. Memento TimeMaps provide a format to list mementos (URI-Ms) for captures along with brief metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are dereferenced, they simply provide a redirect to a different URI-M (instead of a unique representati… ▽ More

    Submitted 9 March, 2017; originally announced March 2017.

    Comments: 43 pages, 8 figures

  45. arXiv:1605.06154  [pdf, other

    cs.DL

    Web Infrastructure to Support e-Journal Preservation (and More)

    Authors: Herbert Van de Sompel, David S. H. Rosenthal, Michael L. Nelson

    Abstract: E-journal preservation systems have to ingest millions of articles each year. Ingest, especially of the "long tail" of journals from small publishers, is the largest element of their cost. Cost is the major reason that archives contain less than half the content they should. Automation is essential to minimize these costs. This paper examines the potential for automation beyond the status quo base… ▽ More

    Submitted 19 May, 2016; originally announced May 2016.

    Comments: 23 pages, 5 figures

    ACM Class: H.3.7

  46. arXiv:1601.05142  [pdf, other

    cs.DL

    Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants

    Authors: Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

    Abstract: The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are increasingly difficult to archive. Client-side technologies (e.g., JavaScript) enable interactions that can potentially change the client-side state of a representation. We refer to representations that… ▽ More

    Submitted 19 January, 2016; originally announced January 2016.

  47. arXiv:1512.06195  [pdf, other

    cs.DL

    Quantifying Orphaned Annotations in Hypothes.is

    Authors: Mohamed Aturban, Michael L. Nelson, Michele C. Weigle

    Abstract: Web annotation has been receiving increased attention recently with the organization of the Open Annotation Collaboration and new tools for open annotation, such as Hypothes.is. We investigate the prevalence of orphaned annotations, where neither the live Web page nor an archived copy of the Web page contains the text that had previously been annotated in the Hypothes.is annotation system (contain… ▽ More

    Submitted 19 December, 2015; originally announced December 2015.

  48. arXiv:1508.02315  [pdf, other

    cs.DL cs.IR

    Archiving Deferred Representations Using a Two-Tiered Crawling Approach

    Authors: Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

    Abstract: Web resources are increasingly interactive, resulting in resources that are increasingly difficult to archive. The archival difficulty is based on the use of client-side technologies (e.g., JavaScript) to change the client-side state of a representation after it has initially loaded. We refer to these representations as deferred representations. We can better archive deferred representations using… ▽ More

    Submitted 10 August, 2015; originally announced August 2015.

    Comments: To appear at iPRES2015 11 pages

    ACM Class: H.3.7

  49. arXiv:1506.06279  [pdf, other

    cs.DL

    Avoiding Spoilers in Fan Wikis of Episodic Fiction

    Authors: Shawn M. Jones, Michael L. Nelson

    Abstract: A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if readers are behind in their viewing they run the risk of encountering "spoilers" -- information that gives away key plot points before the intended time of the show's writers. Enterprising readers might b… ▽ More

    Submitted 20 June, 2015; originally announced June 2015.

    Comments: 18 pages, 31 figures, 3 tables, 2 algorithms

    ACM Class: H.3.7

  50. Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

    Authors: Sawood Alam, Fateh ud din B Mehmood, Michael L. Nelson

    Abstract: We propose an approach to index raster images of dictionary pages which in turn would require very little manual effort to enable direct access to the appropriate pages of the dictionary for lookup. Accessibility is further improved by feedback and crowdsourcing that enables highlighting of the specific location on the page where the lookup word is found, annotation, digitization, and fielded sear… ▽ More

    Submitted 3 September, 2014; originally announced September 2014.

    Comments: 11 pages, 5 images, 2 codes, 1 table

    ACM Class: H.3.3