(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 51 results for author: Skiena, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.00500  [pdf, other

    cs.CL math.AT

    The Shape of Word Embeddings: Recognizing Language Phylogenies through Topological Data Analysis

    Authors: Ondřej Draganov, Steven Skiena

    Abstract: Word embeddings represent language vocabularies as clouds of $d$-dimensional points. We investigate how information is conveyed by the general shape of these clouds, outside of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  2. arXiv:2311.06362  [pdf, other

    cs.CL

    Word Definitions from Large Language Models

    Authors: Yunting Yin, Steven Skiena

    Abstract: Dictionary definitions are historically the arbitrator of what words mean, but this primacy has come under threat by recent progress in NLP, including word embeddings and generative models like ChatGPT. We present an exploratory study of the degree of alignment between word definitions from classical dictionaries and these newer computational artifacts. Specifically, we compare definitions from th… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  3. arXiv:2311.04020  [pdf, other

    cs.CL

    Analyzing Film Adaptation through Narrative Alignment

    Authors: Tanzir Pial, Shahreen Salim, Charuta Pethe, Allen Kim, Steven Skiena

    Abstract: Novels are often adapted into feature films, but the differences between the two media usually require dropping sections of the source text from the movie script. Here we study this screen adaptation process by constructing narrative alignments using the Smith-Waterman local alignment algorithm coupled with SBERT embedding distance to quantify text similarity between scenes and book units. We use… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: 20 pages, 5 figures, 10 tables

  4. arXiv:2311.03627  [pdf, other

    cs.CL

    GNAT: A General Narrative Alignment Tool

    Authors: Tanzir Pial, Steven Skiena

    Abstract: Algorithmic sequence alignment identifies similar segments shared between pairs of documents, and is fundamental to many NLP tasks. But it is difficult to recognize similarities between distant versions of narratives such as translations and retellings, particularly for summaries and abridgements which are much shorter than the original novels. We develop a general approach to narrative alignmen… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: 17 pages, 5 figures, 8 tables

  5. arXiv:2311.03614  [pdf, other

    cs.CL

    STONYBOOK: A System and Resource for Large-Scale Analysis of Novels

    Authors: Charuta Pethe, Allen Kim, Rajesh Prabhakar, Tanzir Pial, Steven Skiena

    Abstract: Books have historically been the primary mechanism through which narratives are transmitted. We have developed a collection of resources for the large-scale analysis of novels, including: (1) an open source end-to-end NLP analysis pipeline for the annotation of novels into a standard XML format, (2) a collection of 49,207 distinct cleaned and annotated novels, and (3) a database with an associated… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: 8 pages, 12 figures

  6. arXiv:2310.06930  [pdf, other

    cs.SD cs.LG eess.AS

    Prosody Analysis of Audiobooks

    Authors: Charuta Pethe, Yunting Yin, Steven Skiena

    Abstract: Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models for prosody prediction properties (pitch, vo… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

  7. arXiv:2306.02102  [pdf, other

    cs.DS

    Accelerating Personalized PageRank Vector Computation

    Authors: Zhen Chen, Xingzhi Guo, Baojian Zhou, Deqing Yang, Steven Skiena

    Abstract: Personalized PageRank Vectors are widely used as fundamental graph-learning tools for detecting anomalous spammers, learning graph embeddings, and training graph neural networks. The well-known local FwdPush algorithm approximates PPVs and has a sublinear rate of $O\big(\frac{1}{αあるふぁεいぷしろん}\big)$. A recent study found that when high precision is required, FwdPush is similar to the power iteration method,… ▽ More

    Submitted 5 June, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

  8. arXiv:2306.01528  [pdf, other

    cs.CG cs.LG

    Does it pay to optimize AUC?

    Authors: Baojian Zhou, Steven Skiena

    Abstract: The Area Under the ROC Curve (AUえーゆーC) is an important model metric for evaluating binary classifiers, and many algorithms have been proposed to optimize AUC approximately. It raises the question of whether the generally insignificant gains observed by previous studies are due to inherent limitations of the metric or the inadequate quality of optimization. To better understand the value of optimizin… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: 16 pages, AAAI

  9. arXiv:2212.08578  [pdf, other

    cs.LG cs.CY

    Provable Fairness for Neural Network Models using Formal Verification

    Authors: Giorgian Borca-Tasciuc, Xingzhi Guo, Stanley Bak, Steven Skiena

    Abstract: Machine learning models are increasingly deployed for critical decision-making tasks, making it important to verify that they do not contain gender or racial biases picked up from training data. Typical approaches to achieve fairness revolve around efforts to clean or curate training data, with post-hoc statistical evaluation of the fairness of the model on evaluation data. In contrast, we propose… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

  10. arXiv:2211.01430  [pdf, other

    cs.CL

    Hierarchies over Vector Space: Orienting Word and Graph Embeddings

    Authors: Xingzhi Guo, Steven Skiena

    Abstract: Word and graph embeddings are widely used in deep learning applications. We present a data structure that captures inherent hierarchical properties from an unordered flat embedding space, particularly a sense of direction between pairs of entities. Inspired by the notion of \textit{distributional generality}, our algorithm constructs an arborescence (a directed rooted tree) by inserting nodes in d… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

  11. Subset Node Anomaly Tracking over Large Dynamic Graphs

    Authors: Xingzhi Guo, Baojian Zhou, Steven Skiena

    Abstract: Tracking a targeted subset of nodes in an evolving graph is important for many real-world applications. Existing methods typically focus on identifying anomalous edges or finding anomaly graph snapshots in a stream way. However, edge-oriented methods cannot quantify how individual nodes change over time while others need to maintain representations of the whole graph all time, thus computationally… ▽ More

    Submitted 17 November, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: 9 pages + 2 pages supplement, accepted to 2022 ACM SIGKDD Research Track - fixed one notation typo

  12. arXiv:2204.10053  [pdf, other

    cs.CG

    Time Window Frechet and Metric-Based Edit Distance for Passively Collected Trajectories

    Authors: Jiaxin Ding, Jie Gao, Steven Skiena

    Abstract: The advances of modern localization techniques and the wide spread of mobile devices have provided us great opportunities to collect and mine human mobility trajectories. In this work, we focus on passively collected trajectories, which are sequences of time-stamped locations that mobile entities visit. To analyse such trajectories, a crucial part is a measure of similarity between two trajectorie… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

  13. arXiv:2110.11934  [pdf, other

    cs.CL

    Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

    Authors: Allen Kim, Charuta Pethe, Naoya Inoue, Steve Skiena

    Abstract: Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the issue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collect… ▽ More

    Submitted 22 October, 2021; originally announced October 2021.

    Comments: Accepted for Findings of EMNLP 2021

  14. arXiv:2106.01570  [pdf, other

    cs.SI

    Subset Node Representation Learning over Large Dynamic Graphs

    Authors: Xingzhi Guo, Baojian Zhou, Steven Skiena

    Abstract: Dynamic graph representation learning is a task to learn node embeddings over dynamic networks, and has many important applications, including knowledge graphs, citation networks to social networks. Graphs of this type are usually large-scale but only a small subset of vertices are related in downstream tasks. Current methods are too expensive to this setting as the complexity is at best linear-de… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: 9 pages + 2 pages supplement, accepted to 2021 ACM SIGKDD

  15. Maximizing the Expected Value of a Lottery Ticket: How to Sell and When to Buy

    Authors: Allen Kim, Steven Skiena

    Abstract: Unusually large prize pools in lotteries like Mega Millions and Powerball attract additional bettors, which increases the likelihood that multiple winners will have to share the pool. Thus, the expected value of a lottery ticket decreases as the probability of collisions (two or more bettors with identical winning tickets) increase. We propose a way to increase the expected value of lottery ticket… ▽ More

    Submitted 11 January, 2021; originally announced January 2021.

    Journal ref: CHANCE 33.1 (2020), 30-37

  16. arXiv:2011.04163  [pdf, other

    cs.CL

    Chapter Captor: Text Segmentation in Novels

    Authors: Charuta Pethe, Allen Kim, Steven Skiena

    Abstract: Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter titl… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: 11 pages, 10 figures, Accepted at EMNLP 2020 as a long paper

  17. arXiv:2011.04124  [pdf, other

    cs.CL

    What time is it? Temporal Analysis of Novels

    Authors: Allen Kim, Charuta Pethe, Steven Skiena

    Abstract: Recognizing the flow of time in a story is a crucial aspect of understanding it. Prior work related to time has primarily focused on identifying temporal expressions or relative sequencing of events, but here we propose computationally annotating each line of a book with wall clock times, even in the absence of explicit time-descriptive phrases. To do so, we construct a data set of hourly time phr… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: EMNLP 2020

  18. arXiv:2010.08676  [pdf, other

    stat.ME cs.DS

    Fast Spatial Autocorrelation

    Authors: Anar Amgalan, Lilianne R. Mujica-Parodi, Steven S. Skiena

    Abstract: Physical or geographic location proves to be an important feature in many data science models, because many diverse natural and social phenomenon have a spatial component. Spatial autocorrelation measures the extent to which locally adjacent observations of the same phenomenon are correlated. Although statistics like Moran's $I$ and Geary's $C$ are widely used to measure spatial autocorrelation, t… ▽ More

    Submitted 16 October, 2020; originally announced October 2020.

    Comments: To be published in ICDM 2020

  19. arXiv:2009.10867  [pdf, other

    cs.LG cs.AI stat.ML

    Online AUえーゆーC Optimization for Sparse High-Dimensional Datasets

    Authors: Baojian Zhou, Yiming Ying, Steven Skiena

    Abstract: The Area Under the ROC Curve (AUえーゆーC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each $d$ dimensional sample has only $k$ non-zero features with $k \ll d$, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost… ▽ More

    Submitted 22 September, 2020; originally announced September 2020.

    Comments: 20th IEEE International Conference on Data Mining

  20. arXiv:1909.04002  [pdf, other

    cs.CL

    The Trumpiest Trump? Identifying a Subject's Most Characteristic Tweets

    Authors: Charuta Pethe, Steven Skiena

    Abstract: The sequence of documents produced by any given author varies in style and content, but some documents are more typical or representative of the source than others. We quantify the extent to which a given short text is characteristic of a specific person, using a dataset of tweets from fifteen celebrities. Such analysis is useful for generating excerpts of high-volume Twitter profiles, and underst… ▽ More

    Submitted 9 September, 2019; originally announced September 2019.

    Comments: 11 pages, 4 figures. Accepted at EMNLP-IJCNLP 2019 as a long paper

  21. arXiv:1908.11512  [pdf, other

    cs.SI cs.LG

    Fast and Accurate Network Embeddings via Very Sparse Random Projection

    Authors: Haochen Chen, Syed Fahad Sultan, Yingtao Tian, Muhao Chen, Steven Skiena

    Abstract: We present FastRP, a scalable and performant algorithm for learning distributed node representations in a graph. FastRP is over 4,000 times faster than state-of-the-art methods such as DeepWalk and node2vec, while achieving comparable or even better performance as evaluated on several real-world networks on various downstream tasks. We observe that most network embedding methods consist of two com… ▽ More

    Submitted 29 August, 2019; originally announced August 2019.

    Comments: CIKM 2019 Long Paper

  22. arXiv:1905.04799  [pdf, other

    cs.SI cs.CL

    The Secret Lives of Names? Name Embeddings from Social Media

    Authors: Junting Ye, Steven Skiena

    Abstract: Your name tells a lot about you: your gender, ethnicity and so on. It has been shown that name embeddings are more effective in representing names than traditional substring features. However, our previous name embedding model is trained on private email data and are not publicly accessible. In this paper, we explore learning name embeddings from public Twitter data. We argue that Twitter embeddin… ▽ More

    Submitted 12 May, 2019; originally announced May 2019.

    Comments: 9 pages; accepted to 2019 ACM SIGKDD; dataset sharing: www.name-prism.com;

  23. Data Races and the Discrete Resource-time Tradeoff Problem with Resource Reuse over Paths

    Authors: Rathish Das, Shih-Yu Tsai, Sharmila Duppala, Jayson Lynch, Esther M. Arkin, Rezaul Chowdhury, Joseph S. B. Mitchell, Steven Skiena

    Abstract: A determinacy race occurs if two or more logically parallel instructions access the same memory location and at least one of them tries to modify its content. Races often lead to nondeterministic and incorrect program behavior. A data race is a special case of a determinacy race which can be eliminated by associating a mutual-exclusion lock or allowing atomic accesses to the memory location. Howev… ▽ More

    Submitted 19 April, 2019; originally announced April 2019.

  24. arXiv:1903.07581  [pdf, other

    cs.SI cs.IR

    MediaRank: Computational Ranking of Online News Sources

    Authors: Junting Ye, Steven Skiena

    Abstract: In the recent political climate, the topic of news quality has drawn attention both from the public and the academic communities. The growing distrust of traditional news media makes it harder to find a common base of accepted truth. In this work, we design and build MediaRank (www.media-rank.com), a fully automated system to rank over 50,000 online news sources around the world. MediaRank collect… ▽ More

    Submitted 12 May, 2019; v1 submitted 18 March, 2019; originally announced March 2019.

    Comments: 9 pages. Demo: www.media-rank.com. Accepted to 2019 ACM SIGKDD

  25. arXiv:1809.05124  [pdf, other

    cs.SI physics.soc-ph

    Enhanced Network Embeddings via Exploiting Edge Labels

    Authors: Haochen Chen, Xiaofei Sun, Yingtao Tian, Bryan Perozzi, Muhao Chen, Steven Skiena

    Abstract: Network embedding methods aim at learning low-dimensional latent representation of nodes in a network. While achieving competitive performance on a variety of network inference tasks such as node classification and link prediction, these methods treat the relations between nodes as a binary variable and ignore the rich semantics of edges. In this work, we attempt to learn network embeddings which… ▽ More

    Submitted 13 September, 2018; originally announced September 2018.

    Comments: CIKM 2018

  26. arXiv:1809.03485  [pdf, other

    cs.CL

    Multi-view Models for Political Ideology Detection of News Articles

    Authors: Vivek Kulkarni, Junting Ye, Steven Skiena, William Yang Wang

    Abstract: A news article's title, content and link structure often reveal its political ideology. However, most existing works on automatic political ideology detection only leverage textual cues. Drawing inspiration from recent advances in neural inference, we propose a novel attention based multi-view model to leverage cues from all of the above views to identify the ideology evinced by a news article. Ou… ▽ More

    Submitted 10 September, 2018; originally announced September 2018.

    Comments: 10 pages. EMNLP 2018. Added copyright statement stating this is authors draft (also noticed and fixed issue with citation (spacing and readability))

  27. arXiv:1808.03726  [pdf, other

    cs.CL cs.AI cs.LG

    Learning to Represent Bilingual Dictionaries

    Authors: Muhao Chen, Yingtao Tian, Haochen Chen, Kai-Wei Chang, Steven Skiena, Carlo Zaniolo

    Abstract: Bilingual word embeddings have been widely used to capture the similarity of lexical semantics in different human languages. However, many applications, such as cross-lingual semantic search and question answering, can be largely benefited from the cross-lingual correspondence between sentences and lexicons. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionar… ▽ More

    Submitted 6 September, 2019; v1 submitted 10 August, 2018; originally announced August 2018.

    Comments: CoNLL 2019

  28. arXiv:1808.02590  [pdf, other

    cs.SI

    A Tutorial on Network Embeddings

    Authors: Haochen Chen, Bryan Perozzi, Rami Al-Rfou, Steven Skiena

    Abstract: Network embedding methods aim at learning low-dimensional latent representation of nodes in a network. These representations can be used as features for a wide range of tasks on graphs such as classification, clustering, link prediction, and visualization. In this survey, we give an overview of network embeddings by summarizing and categorizing recent advancements in this research field. We first… ▽ More

    Submitted 7 August, 2018; originally announced August 2018.

    Comments: 23 pages, 6 figures

  29. arXiv:1806.06478  [pdf, other

    cs.AI cs.CL

    Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment

    Authors: Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven Skiena, Carlo Zaniolo

    Abstract: Multilingual knowledge graph (KG) embeddings provide latent semantic representations of entities and structured knowledge with cross-lingual inferences, which benefit various knowledge-driven cross-lingual NLP tasks. However, precisely learning such cross-lingual inferences is usually hindered by the low coverage of entity alignment in many KGs. Since many multilingual KGs also provide literal des… ▽ More

    Submitted 17 June, 2018; originally announced June 2018.

    Comments: To appear in IJCAI-18

  30. arXiv:1802.08786  [pdf, other

    cs.LG cs.CL

    Syntax-Directed Variational Autoencoder for Structured Data

    Authors: Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, Le Song

    Abstract: Deep generative models have been enjoying success in modeling continuous data. However it remains challenging to capture the representations for discrete structures with formal grammars and semantics, e.g., computer programs and molecular structures. How to generate both syntactically and semantically correct data still remains largely an open problem. Inspired by the theory of compiler where the… ▽ More

    Submitted 23 February, 2018; originally announced February 2018.

    Comments: to appear in ICLR 2018

  31. arXiv:1708.07903  [pdf, other

    cs.SI cs.CL

    Nationality Classification Using Name Embeddings

    Authors: Junting Ye, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Hong Qin, Steven Skiena

    Abstract: Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained c… ▽ More

    Submitted 25 August, 2017; originally announced August 2017.

    Comments: 10 pages, 9 figures, 4 table, accepted by CIKM 2017, Demo and free API: www.name-prism.com

  32. arXiv:1706.07845  [pdf, other

    cs.SI

    HARP: Hierarchical Representation Learning for Networks

    Authors: Haochen Chen, Bryan Perozzi, Yifan Hu, Steven Skiena

    Abstract: We present HARP, a novel method for learning low dimensional embeddings of a graph's nodes which preserves higher-order structural features. Our proposed method achieves this by compressing the input graph prior to embedding it, effectively avoiding troublesome embedding configurations (i.e. local minima) which can pose problems to non-convex optimization. HARP works by finding a smaller graph whi… ▽ More

    Submitted 16 November, 2017; v1 submitted 23 June, 2017; originally announced June 2017.

    Comments: To appear in AAAI 2018

  33. Latent Human Traits in the Language of Social Media: An Open-Vocabulary Approach

    Authors: Vivek Kulkarni, Margaret L. Kern, David Stillwell, Michal Kosinski, Sandra Matz, Lyle Ungar, Steven Skiena, H. Andrew Schwartz

    Abstract: Over the past century, personality theory and research has successfully identified core sets of characteristics that consistently describe and explain fundamental differences in the way people think, feel and behave. Such characteristics were derived through theory, dictionary analyses, and survey research using explicit self-reports. The availability of social media data spanning millions of user… ▽ More

    Submitted 22 May, 2017; originally announced May 2017.

    Comments: In submission to PLOS One

  34. arXiv:1704.07427  [pdf, other

    cs.CL

    Recognizing Descriptive Wikipedia Categories for Historical Figures

    Authors: Yanqing Chen, Steven Skiena

    Abstract: Wikipedia is a useful knowledge source that benefits many applications in language processing and knowledge representation. An important feature of Wikipedia is that of categories. Wikipedia pages are assigned different categories according to their contents as human-annotated labels which can be used in information retrieval, ad hoc search improvements, entity ranking and tag recommendations. How… ▽ More

    Submitted 24 April, 2017; originally announced April 2017.

    Comments: 9 pages, 6 tables, 5 figures

  35. arXiv:1703.04746  [pdf, other

    cs.DL

    Citation histories of papers: sometimes the rich get richer, sometimes they don't

    Authors: Michael J. Hazoglu, Vivek Kulkarni, Steven S. Skiena, Ken A. Dill

    Abstract: We describe a simple model of how a publication's citations change over time, based on pure-birth stochastic processes with a linear cumulative advantage effect. The model is applied to citation data from the Physical Review corpus provided by APS. Our model reveals that papers fall into three different clusters: papers that have rapid initial citations and ultimately high impact (fast-hi), fast t… ▽ More

    Submitted 14 March, 2017; originally announced March 2017.

  36. arXiv:1611.06722  [pdf, other

    cs.CL

    False-Friend Detection and Entity Matching via Unsupervised Transliteration

    Authors: Yanqing Chen, Steven Skiena

    Abstract: Transliterations play an important role in multilingual entity reference resolution, because proper names increasingly travel between languages in news and social media. Previous work associated with machine translation targets transliteration only single between language pairs, focuses on specific classes of entities (such as cities and celebrities) and relies on manual curation, which limits the… ▽ More

    Submitted 21 November, 2016; originally announced November 2016.

    Comments: 11 Pages, ACL style

  37. arXiv:1605.03956  [pdf, other

    cs.CL

    On the Convergent Properties of Word Embedding Methods

    Authors: Yingtao Tian, Vivek Kulkarni, Bryan Perozzi, Steven Skiena

    Abstract: Do word embeddings converge to learn similar things over different initializations? How repeatable are experiments with word embeddings? Are all word embedding techniques equally reliable? In this paper we propose evaluating methods for learning word representations by their consistency across initializations. We propose a measure to quantify the similarity of the learned word representations unde… ▽ More

    Submitted 12 May, 2016; originally announced May 2016.

    Comments: RepEval @ ACL 2016

  38. arXiv:1605.02115  [pdf, other

    cs.SI physics.soc-ph

    Don't Walk, Skip! Online Learning of Multi-scale Network Embeddings

    Authors: Bryan Perozzi, Vivek Kulkarni, Haochen Chen, Steven Skiena

    Abstract: We present Walklets, a novel approach for learning multiscale representations of vertices in a network. In contrast to previous works, these representations explicitly encode multiscale vertex relationships in a way that is analytically derivable. Walklets generates these multiscale relationships by subsampling short random walks on the vertices of a graph. By `skipping' over steps in each rando… ▽ More

    Submitted 24 June, 2017; v1 submitted 6 May, 2016; originally announced May 2016.

    Comments: 8 pages, ASONAM'17

  39. arXiv:1510.06786  [pdf, other

    cs.CL cs.IR cs.LG

    Freshman or Fresher? Quantifying the Geographic Variation of Internet Language

    Authors: Vivek Kulkarni, Bryan Perozzi, Steven Skiena

    Abstract: We present a new computational technique to detect and analyze statistically significant geographic variation in language. Our meta-analysis approach captures statistical properties of word usage across geographical regions and uses statistical methods to identify significant changes specific to regions. While previous approaches have primarily focused on lexical variation between regions, our met… ▽ More

    Submitted 7 March, 2016; v1 submitted 22 October, 2015; originally announced October 2015.

    Comments: 11 pages (updated submission)

  40. arXiv:1411.3315  [pdf, other

    cs.CL cs.IR cs.LG

    Statistically Significant Detection of Linguistic Change

    Authors: Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, Steven Skiena

    Abstract: We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change poi… ▽ More

    Submitted 12 November, 2014; originally announced November 2014.

    Comments: 11 pages, 7 figures, 4 tables

    ACM Class: H.3.3; I.2.6

  41. arXiv:1410.3791  [pdf, other

    cs.CL cs.LG

    POLYGLOT-NER: Massive Multilingual Named Entity Recognition

    Authors: Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steven Skiena

    Abstract: The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Ent… ▽ More

    Submitted 14 October, 2014; originally announced October 2014.

    Comments: 9 pages, 4 figures, 5 tables

    ACM Class: I.2.7; I.2.6

  42. arXiv:1405.2622  [pdf, ps, other

    cs.SI physics.soc-ph

    News-Based Group Modeling and Forecasting

    Authors: Wenbin Zhang, Steven Skiena

    Abstract: In this paper, we study news group modeling and forecasting methods using quantitative data generated by our large-scale natural language processing (NLP) text analysis system. A news group is a set of news entities, like top U.S. cities, governors, senators, golfers, or movie actors. Our fame distribution analysis of news groups shows that log-normal and power-law distributions generally could de… ▽ More

    Submitted 11 May, 2014; originally announced May 2014.

    Comments: 10 pages

    ACM Class: H.2.8

  43. arXiv:1404.1521  [pdf, other

    cs.LG cs.CL

    Exploring the power of GPU's for training Polyglot language models

    Authors: Vivek Kulkarni, Rami Al-Rfou', Bryan Perozzi, Steven Skiena

    Abstract: One of the major research trends currently is the evolution of heterogeneous parallel computing. GP-GPU computing is being widely used and several applications have been designed to exploit the massive parallelism that GP-GPU's have to offer. While GPU's have always been widely used in areas of computer vision for image processing, little has been done to investigate whether the massive parallelis… ▽ More

    Submitted 15 April, 2014; v1 submitted 5 April, 2014; originally announced April 2014.

    Comments: version 2 (just corrected citation)

  44. DeepWalk: Online Learning of Social Representations

    Authors: Bryan Perozzi, Rami Al-Rfou, Steven Skiena

    Abstract: We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses loca… ▽ More

    Submitted 27 June, 2014; v1 submitted 26 March, 2014; originally announced March 2014.

    Comments: 10 pages, 5 figures, 4 tables

    ACM Class: H.2.8; I.2.6; I.5.1

  45. arXiv:1403.1252  [pdf, other

    cs.LG cs.CL cs.SI

    Inducing Language Networks from Continuous Space Word Representations

    Authors: Bryan Perozzi, Rami Al-Rfou, Vivek Kulkarni, Steven Skiena

    Abstract: Recent advancements in unsupervised feature learning have developed powerful latent representations of words. However, it is still not clear what makes one representation better than another and how we can learn the ideal representation. Understanding the structure of latent spaces attained is key to any future advancement in unsupervised learning. In this work, we introduce a new view of continuo… ▽ More

    Submitted 27 June, 2014; v1 submitted 5 March, 2014; originally announced March 2014.

    Comments: 14 pages

  46. arXiv:1307.1662  [pdf, other

    cs.CL cs.LG

    Polyglot: Distributed Word Representations for Multilingual NLP

    Authors: Rami Al-Rfou, Bryan Perozzi, Steven Skiena

    Abstract: Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subs… ▽ More

    Submitted 27 June, 2014; v1 submitted 5 July, 2013; originally announced July 2013.

    Comments: 10 pages, 2 figures, Proceedings of Conference on Computational Natural Language Learning CoNLL'2013

  47. arXiv:1301.3226  [pdf, ps, other

    cs.LG cs.CL stat.ML

    The Expressive Power of Word Embeddings

    Authors: Yanqing Chen, Bryan Perozzi, Rami Al-Rfou, Steven Skiena

    Abstract: We seek to better understand the difference in quality of the several publicly released embeddings. We propose several tasks that help to distinguish the characteristics of different embeddings. Our evaluation of sentiment polarity and synonym/antonym relations shows that embeddings are able to capture surprisingly nuanced semantics even in the absence of sentence structure. Moreover, benchmarking… ▽ More

    Submitted 29 May, 2013; v1 submitted 14 January, 2013; originally announced January 2013.

    Comments: submitted to ICML 2013, Deep Learning for Audio, Speech and Language Processing Workshop. 8 pages, 8 figures

  48. arXiv:1301.2857  [pdf, other

    cs.CL

    SpeedRead: A Fast Named Entity Recognition Pipeline

    Authors: Rami Al-Rfou', Steven Skiena

    Abstract: Online content analysis employs algorithmic methods to identify entities in unstructured text. Both machine learning and knowledge-base approaches lie at the foundation of contemporary named entities extraction systems. However, the progress in deploying these approaches on web-scale has been been hampered by the computational cost of NLP over massive text corpora. We present SpeedRead (SR), a nam… ▽ More

    Submitted 13 January, 2013; originally announced January 2013.

    Comments: Long paper at COLING 2012

  49. arXiv:0802.4244  [pdf, ps, other

    cs.NI cs.DS

    Call Admission Control Algorithm for pre-stored VBR video streams

    Authors: Christos Tryfonas, Dimitris Papamichail, Andrew Mehler, Steven Skiena

    Abstract: We examine the problem of accepting a new request for a pre-stored VBR video stream that has been smoothed using any of the smoothing algorithms found in the literature. The output of these algorithms is a piecewise constant-rate schedule for a Variable Bit-Rate (VBR) stream. The schedule guarantees that the decoder buffer does not overflow or underflow. The problem addressed in this paper is th… ▽ More

    Submitted 28 February, 2008; originally announced February 2008.

    Comments: 12 pages, 9 figures, includes appendix

  50. The Lazy Bureaucrat Scheduling Problem

    Authors: Esther M. Arkin, Michael A. Bender, Joseph S. B. Mitchell, Steven S. Skiena

    Abstract: We introduce a new class of scheduling problems in which the optimization is performed by the worker (single ``machine'') who performs the tasks. A typical worker's objective is to minimize the amount of work he does (he is ``lazy''), or more generally, to schedule as inefficiently (in some sense) as possible. The worker is subject to the constraint that he must be busy when there is work that h… ▽ More

    Submitted 26 October, 2002; originally announced October 2002.

    Comments: 19 pages, 2 figures, Latex. To appear, Information and Computation

    ACM Class: F.2.2; I.2.8