Search | arXiv e-print repository

SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations

Authors: Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad

Abstract: Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is n… ▽ More Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in architecture, pre-training objectives and datasets to benchmark the performance of SUGARCREPE++ dataset. Experimental results highlight the difficulties of VLMs in distinguishing between lexical and semantic variations, particularly in object attributes and spatial relations. Although VLMs with larger pre-training datasets, model sizes, and multiple pre-training objectives achieve better performance on SUGARCREPE++, there is a significant opportunity for improvement. We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++, signifying that compositionality alone may not be sufficient for understanding semantic and lexical alterations. Given the importance of the property that the SUGARCREPE++ dataset targets, it serves as a new challenge to the vision-and-language community. △ Less

Submitted 18 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: Added the dataset link to the abstract

MSC Class: 68T45; 68T50 ACM Class: I.2.7; I.2.10

arXiv:2404.16365 [pdf, other]

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Authors: Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad

Abstract: Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences a… ▽ More Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at https://github.com/Sri-Harsha/visla_benchmark. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.05496 [pdf, other]

Stability Mechanisms for Predictive Safety Filters

Authors: Elias Milios, Kim Peter Wabersich, Felix Berkel, Lukas Schwenkel

Abstract: Predictive safety filters enable the integration of potentially unsafe learning-based control approaches and humans into safety-critical systems. In addition to simple constraint satisfaction, many control problems involve additional stability requirements that may vary depending on the specific use case or environmental context. In this work, we address this problem by augmenting predictive safet… ▽ More Predictive safety filters enable the integration of potentially unsafe learning-based control approaches and humans into safety-critical systems. In addition to simple constraint satisfaction, many control problems involve additional stability requirements that may vary depending on the specific use case or environmental context. In this work, we address this problem by augmenting predictive safety filters with stability guarantees, ranging from bounded convergence to uniform asymptotic stability. The proposed framework extends well-known stability results from model predictive control (MPC) theory while supporting commonly used design techniques. As a result, straightforward extensions to dynamic trajectory tracking problems can be easily adapted, as outlined in this article. The practicality of the framework is demonstrated using an automotive advanced driver assistance scenario, involving a reference trajectory stabilization problem. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2403.14895 [pdf, other]

Stance Reasoner: Zero-Shot Stance Detection on Social Media with Explicit Reasoning

Authors: Maksym Taranukhin, Vered Shwartz, Evangelos Milios

Abstract: Social media platforms are rich sources of opinionated content. Stance detection allows the automatic extraction of users' opinions on various topics from such content. We focus on zero-shot stance detection, where the model's success relies on (a) having knowledge about the target topic; and (b) learning general reasoning strategies that can be employed for new topics. We present Stance Reasoner,… ▽ More Social media platforms are rich sources of opinionated content. Stance detection allows the automatic extraction of users' opinions on various topics from such content. We focus on zero-shot stance detection, where the model's success relies on (a) having knowledge about the target topic; and (b) learning general reasoning strategies that can be employed for new topics. We present Stance Reasoner, an approach to zero-shot stance detection on social media that leverages explicit reasoning over background knowledge to guide the model's inference about the document's stance on a target. Specifically, our method uses a pre-trained language model as a source of world knowledge, with the chain-of-thought in-context learning approach to generate intermediate reasoning steps. Stance Reasoner outperforms the current state-of-the-art models on 3 Twitter datasets, including fully supervised models. It can better generalize across targets, while at the same time providing explicit and interpretable explanations for its predictions. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: Accepted to COLING 2024

arXiv:2403.12678 [pdf, other]

Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights

Authors: Maksym Taranukhin, Sahithya Ravi, Gabor Lukacs, Evangelos Milios, Vered Shwartz

Abstract: The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regu… ▽ More The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regulations. The most relevant passages from these documents are presented along with links to the original documents and the generated queries, enabling users to dissect and leverage the information for their unique circumstances. The system successfully overcomes two predominant challenges: understanding complex user inputs, and delivering accurate answers, free of hallucinations, that passengers can rely on for making informed decisions. A user study comparing the chatbot to a Google search demonstrated the chatbot's usefulness and ease of use. Beyond the primary goal of providing accurate and timely information to air passengers regarding their rights, we hope that this system will also enable further research exploring the tradeoff between the user-friendly conversational interface of chatbots and the accuracy of retrieval systems. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: under review

arXiv:2310.20558 [pdf, other]

Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT

Authors: Aman Jaiswal, Evangelos Milios

Abstract: Transformer-based models, specifically BERT, have propelled research in various NLP tasks. However, these models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. Various complex methods have claimed to overcome this limit, but recent research questions the efficacy of these models across different classi… ▽ More Transformer-based models, specifically BERT, have propelled research in various NLP tasks. However, these models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. Various complex methods have claimed to overcome this limit, but recent research questions the efficacy of these models across different classification tasks. These complex architectures evaluated on carefully curated long datasets perform at par or worse than simple baselines. In this work, we propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text. The proposed method is based on chunking token representations and CNN layers, making it compatible with any pre-trained BERT. We evaluate chunkBERT exclusively on a benchmark for comparing long-text classification models across a variety of tasks (including binary classification, multi-class classification, and multi-label classification). A BERT model finetuned using the ChunkBERT method performs consistently across long samples in the benchmark while utilizing only a fraction (6.25\%) of the original memory footprint. These findings suggest that efficient finetuning and inference can be achieved through simple modifications to pre-trained BERT models. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: 11 pages, 6 figures, submitted to NeurIPS 23

ACM Class: I.2.7

arXiv:2309.01015 [pdf, other]

MPTopic: Improving topic modeling via Masked Permuted pre-training

Authors: Xinche Zhang, Evangelos milios

Abstract: Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the… ▽ More Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive evaluation, it is evident that the topic keywords identified with the synergy of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec. △ Less

Submitted 2 September, 2023; originally announced September 2023.

Comments: 12 pages, will submit to ECIR 2024

arXiv:2306.11832 [pdf, other]

QuOTeS: Query-Oriented Technical Summarization

Authors: Juan Ramirez-Orta, Eduardo Xamena, Ana Maguitman, Axel J. Soto, Flavia P. Zanoto, Evangelos Milios

Abstract: Abstract. When writing an academic paper, researchers often spend considerable time reviewing and summarizing papers to extract relevant citations and data to compose the Introduction and Related Work sections. To address this problem, we propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references and hence ass… ▽ More Abstract. When writing an academic paper, researchers often spend considerable time reviewing and summarizing papers to extract relevant citations and data to compose the Introduction and Related Work sections. To address this problem, we propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references and hence assist in the composition of new papers. QuOTeS integrates techniques from Query-Focused Extractive Summarization and High-Recall Information Retrieval to provide Interactive Query-Focused Summarization of scientific documents. To measure the performance of our system, we carried out a comprehensive user study where participants uploaded papers related to their research and evaluated the system in terms of its usability and the quality of the summaries it produces. The results show that QuOTeS provides a positive user experience and consistently provides query-focused summaries that are relevant, concise, and complete. We share the code of our system and the novel Query-Focused Summarization dataset collected during our experiments at https://github.com/jarobyte91/quotes. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: Accepted at ICDAR 2023

arXiv:2211.16752 [pdf, other]

DimenFix: A novel meta-dimensionality reduction method for feature preservation

Authors: Qiaodan Luo, Leonardo Christino, Fernando V Paulovich, Evangelos Milios

Abstract: Dimensionality reduction has become an important research topic as demand for interpreting high-dimensional datasets has been increasing rapidly in recent years. There have been many dimensionality reduction methods with good performance in preserving the overall relationship among data points when mapping them to a lower-dimensional space. However, these existing methods fail to incorporate the d… ▽ More Dimensionality reduction has become an important research topic as demand for interpreting high-dimensional datasets has been increasing rapidly in recent years. There have been many dimensionality reduction methods with good performance in preserving the overall relationship among data points when mapping them to a lower-dimensional space. However, these existing methods fail to incorporate the difference in importance among features. To address this problem, we propose a novel meta-method, DimenFix, which can be operated upon any base dimensionality reduction method that involves a gradient-descent-like process. By allowing users to define the importance of different features, which is considered in dimensionality reduction, DimenFix creates new possibilities to visualize and understand a given dataset. Meanwhile, DimenFix does not increase the time cost or reduce the quality of dimensionality reduction with respect to the base dimensionality reduction used. △ Less

Submitted 30 November, 2022; originally announced November 2022.

arXiv:2204.00585 [pdf, other]

A Theoretical Approach for Structuring and Analysing Knowledge Provenance for Visual Analytics

Authors: Leonardo Christino, Sima Rezaeipourfarsangi, Evangelos Milios, Fernando V. Paulovich

Abstract: The primary goal of Visual Analytics (VA) is to enable user-guided knowledge generation. Theoretical VA works to explain how the different aspects of a VA tool bring forth new insights through user interactivity, which itself can be captured through tracking methods for reproduction or evaluation. However, the process of automatically capturing the user's thought process, such as intent and insigh… ▽ More The primary goal of Visual Analytics (VA) is to enable user-guided knowledge generation. Theoretical VA works to explain how the different aspects of a VA tool bring forth new insights through user interactivity, which itself can be captured through tracking methods for reproduction or evaluation. However, the process of automatically capturing the user's thought process, such as intent and insights, and associating it with user's interaction events are largely ignored. Also, two forms of interactivity capture are typically ambiguous and intermixed: the temporal aspect, which indicates sequences of events, and the atemporal aspect, which explains the workflow as sequences of states within a state-space. In this work, we propose Visual Analytics Knowledge Graph (VAKG), a conceptual framework that brings VA modeling theory to practice through a novel Set-Theory formalization of knowledge modeling. By extracting such a model from a VA tool, VAKG structures a 4-way temporal knowledge graph that describes user behavior and its associated knowledge gain process. Such knowledge graphs can be populated manually or automatically during user analysis sessions, which can then be analyzed using graph analysis methods. VAKG is demonstrated by modeling and collecting Tableau and visual text-mining workflows, where comparative user satisfaction, tool efficacy, and overall workflow shortcomings can be extracted from the knowledge graph. △ Less

Submitted 27 October, 2023; v1 submitted 1 April, 2022; originally announced April 2022.

Comments: 14 pgs, submitted to Computer Graphics Forum 2023

ACM Class: I.2.4; I.2.6

arXiv:2109.06264 [pdf, other]

Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models

Authors: Juan Ramirez-Orta, Eduardo Xamena, Ana Maguitman, Evangelos Milios, Axel J. Soto

Abstract: In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimen… ▽ More In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction. △ Less

Submitted 24 January, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

arXiv:2106.03953 [pdf, other]

Neural Abstractive Unsupervised Summarization of Online News Discussions

Authors: Ignacio Tampe Palma, Marcelo Mendoza, Evangelos Milios

Abstract: Summarization has usually relied on gold standard summaries to train extractive or abstractive models. Social media brings a hurdle to summarization techniques since it requires addressing a multi-document multi-author approach. We address this challenging task by introducing a novel method that generates abstractive summaries of online news discussions. Our method extends a BERT-based architectur… ▽ More Summarization has usually relied on gold standard summaries to train extractive or abstractive models. Social media brings a hurdle to summarization techniques since it requires addressing a multi-document multi-author approach. We address this challenging task by introducing a novel method that generates abstractive summaries of online news discussions. Our method extends a BERT-based architecture, including an attention encoding that fed comments' likes during the training stage. To train our model, we define a task which consists of reconstructing high impact comments based on popularity (likes). Accordingly, our model learns to summarize online discussions based on their most relevant comments. Our novel approach provides a summary that represents the most relevant aspects of a news item that users comment on, incorporating the social context as a source of information to summarize texts in online social networks. Our model is evaluated using ROUGE scores between the generated summary and each comment on the thread. Our model, including the social attention encoding, significantly outperforms both extractive and abstractive summarization methods based on such evaluation. △ Less

Submitted 18 June, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2104.05741 [pdf, other]

Active learning for medical code assignment

Authors: Martha Dais Ferreira, Michal Malyska, Nicola Sahar, Riccardo Miotto, Fernando Paulovich, Evangelos Milios

Abstract: Machine Learning (ML) is widely used to automatically extract meaningful information from Electronic Health Records (EHR) to support operational, clinical, and financial decision-making. However, ML models require a large number of annotated examples to provide satisfactory results, which is not possible in most healthcare scenarios due to the high cost of clinician-labeled data. Active Learning (… ▽ More Machine Learning (ML) is widely used to automatically extract meaningful information from Electronic Health Records (EHR) to support operational, clinical, and financial decision-making. However, ML models require a large number of annotated examples to provide satisfactory results, which is not possible in most healthcare scenarios due to the high cost of clinician-labeled data. Active Learning (AL) is a process of selecting the most informative instances to be labeled by an expert to further train a supervised algorithm. We demonstrate the effectiveness of AL in multi-label text classification in the clinical domain. In this context, we apply a set of well-known AL methods to help automatically assign ICD-9 codes on the MIMIC-III dataset. Our results show that the selection of informative instances provides satisfactory classification with a significantly reduced training set (8.3\% of the total instances). We conclude that AL methods can significantly reduce the manual annotation cost while preserving model performance. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: It was accepted in the ACM CHIL 2021 workshop track

arXiv:2104.02604 [pdf, other]

doi 10.1093/bib/bbab365

Using Molecular Embeddings in QSAR Modeling: Does it Make a Difference?

Authors: María Virginia Sabando, Ignacio Ponzoni, Evangelos E. Milios, Axel J. Soto

Abstract: With the consolidation of deep learning in drug discovery, several novel algorithms for learning molecular representations have been proposed. Despite the interest of the community in developing new methods for learning molecular embeddings and their theoretical benefits, comparing molecular embeddings with each other and with traditional representations is not straightforward, which in turn hinde… ▽ More With the consolidation of deep learning in drug discovery, several novel algorithms for learning molecular representations have been proposed. Despite the interest of the community in developing new methods for learning molecular embeddings and their theoretical benefits, comparing molecular embeddings with each other and with traditional representations is not straightforward, which in turn hinders the process of choosing a suitable representation for QSAR modeling. A reason behind this issue is the difficulty of conducting a fair and thorough comparison of the different existing embedding approaches, which requires numerous experiments on various datasets and training scenarios. To close this gap, we reviewed the literature on methods for molecular embeddings and reproduced three unsupervised and two supervised molecular embedding techniques recently proposed in the literature. We compared these five methods concerning their performance in QSAR scenarios using different classification and regression datasets. We also compared these representations to traditional molecular representations, namely molecular descriptors and fingerprints. As opposed to the expected outcome, our experimental setup consisting of over 25,000 trained models and statistical tests revealed that the predictive performance using molecular embeddings did not significantly surpass that of traditional representations. While supervised embeddings yielded competitive results compared to those using traditional molecular representations, unsupervised embeddings tended to perform worse than traditional representations. Our results highlight the need for conducting a careful comparison and analysis of the different embedding techniques prior to using them in drug design tasks, and motivate a discussion about the potential of molecular embeddings in computer-aided drug design. △ Less

Submitted 28 July, 2021; v1 submitted 20 March, 2021; originally announced April 2021.

Journal ref: Briefings in Bioinformatics, Volume 23, Issue 1, January 2022, bbab365

arXiv:2008.10022 [pdf]

doi 10.1007/s41666-021-00111-w

COVID-19 Pandemic: Identifying Key Issues using Social Media and Natural Language Processing

Authors: Oladapo Oyebode, Chinenye Ndulue, Dinesh Mulchandani, Banuchitra Suruliraj, Ashfaq Adib, Fidelia Anulika Orji, Evangelos Milios, Stan Matwin, Rita Orji

Abstract: The COVID-19 pandemic has affected people's lives in many ways. Social media data can reveal public perceptions and experience with respect to the pandemic, and also reveal factors that hamper or support efforts to curb global spread of the disease. In this paper, we analyzed COVID-19-related comments collected from six social media platforms using Natural Language Processing (NLP) techniques. We… ▽ More The COVID-19 pandemic has affected people's lives in many ways. Social media data can reveal public perceptions and experience with respect to the pandemic, and also reveal factors that hamper or support efforts to curb global spread of the disease. In this paper, we analyzed COVID-19-related comments collected from six social media platforms using Natural Language Processing (NLP) techniques. We identified relevant opinionated keyphrases and their respective sentiment polarity (negative or positive) from over 1 million randomly selected comments, and then categorized them into broader themes using thematic analysis. Our results uncover 34 negative themes out of which 17 are economic, socio-political, educational, and political issues. 20 positive themes were also identified. We discuss the negative issues and suggest interventions to tackle them based on the positive themes and research evidence. △ Less

Submitted 23 August, 2020; originally announced August 2020.

Comments: 12 pages, 7 figures, 3 tables

Journal ref: Journal of Healthcare Informatics Research. 2022

arXiv:2007.01379 [pdf, other]

Detecting Ongoing Events Using Contextual Word and Sentence Embeddings

Authors: Mariano Maisonnave, Fernando Delbianco, Fernando Tohmé, Ana Maguitman, Evangelos Milios

Abstract: This paper introduces the Ongoing Event Detection (OED) task, which is a specific Event Detection task where the goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current. Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an… ▽ More This paper introduces the Ongoing Event Detection (OED) task, which is a specific Event Detection task where the goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current. Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an OED system. The main contribution of this paper are the following: (1) it introduces the OED task along with a dataset manually labeled for the task; (2) it presents the design and implementation of an RNN model for the task that uses BERT embeddings to define contextual word and contextual sentence embeddings as attributes, which to the best of our knowledge were never used before for detecting ongoing events in news; (3) it presents an extensive empirical evaluation that includes (i) the exploration of different architectures and hyperparameters, (ii) an ablation test to study the impact of each attribute, and (iii) a comparison with a replication of a state-of-the-art model. The results offer several insights into the importance of contextual embeddings and indicate that the proposed approach is effective in the OED task, outperforming the baseline models. △ Less

Submitted 5 February, 2021; v1 submitted 2 July, 2020; originally announced July 2020.

arXiv:2001.11631 [pdf, ps, other]

Enhancement of Short Text Clustering by Iterative Classification

Authors: Md Rashadul Hasan Rakib, Norbert Zeh, Magdalena Jankowska, Evangelos Milios

Abstract: Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then… ▽ More Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then it trains a classification algorithm using the non-outliers based on their cluster distributions. Using the trained classification model, iterative classification reclassifies the outliers to obtain a new set of clusters. By repeating this several times, we obtain a much improved clustering of texts. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin. △ Less

Submitted 30 January, 2020; originally announced January 2020.

Comments: 30 pages, 2 figures

arXiv:1702.03470 [pdf, ps, other]

Vector Embedding of Wikipedia Concepts and Entities

Authors: Ehsan Sherkat, Evangelos Milios

Abstract: Using deep learning for different machine learning tasks such as image classification and word embedding has recently gained many attentions. Its appealing performance reported across specific Natural Language Processing (NLP) tasks in comparison with other approaches is the reason for its popularity. Word embedding is the task of mapping words or phrases to a low dimensional numerical vector. In… ▽ More Using deep learning for different machine learning tasks such as image classification and word embedding has recently gained many attentions. Its appealing performance reported across specific Natural Language Processing (NLP) tasks in comparison with other approaches is the reason for its popularity. Word embedding is the task of mapping words or phrases to a low dimensional numerical vector. In this paper, we use deep learning to embed Wikipedia Concepts and Entities. The English version of Wikipedia contains more than five million pages, which suggest its capability to cover many English Entities, Phrases, and Concepts. Each Wikipedia page is considered as a concept. Some concepts correspond to entities, such as a person's name, an organization or a place. Contrary to word embedding, Wikipedia Concepts Embedding is not ambiguous, so there are different vectors for concepts with similar surface form but different mentions. We proposed several approaches and evaluated their performance based on Concept Analogy and Concept Similarity tasks. The results show that proposed approaches have the performance comparable and in some cases even higher than the state-of-the-art methods. △ Less

Submitted 11 February, 2017; originally announced February 2017.

arXiv:1611.06950 [pdf, ps, other]

Statistical Learning for OCR Text Correction

Authors: Jie Mei, Aminul Islam, Yajing Wu, Abidalrahman Moh'd, Evangelos E. Milios

Abstract: The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, we… ▽ More The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, we show how to enlarge candidate suggestion space by using external corpus and integrating OCR-specific features in a regression approach to correct OCR-generated errors. The evaluation results show that our model can correct 61.5% of the OCR-errors (considering the top 1 suggestion) and 71.5% of the OCR-errors (considering the top 3 suggestions), for cases where the theoretical correction upper-bound is 78%. △ Less

Submitted 21 November, 2016; originally announced November 2016.

arXiv:1311.5978 [pdf, other]

Event Evolution Tracking from Streaming Social Posts

Authors: Pei Lee, Laks V. S. Lakshmanan, Evangelos E. Milios

Abstract: Online social post streams such as Twitter timelines and forum discussions have emerged as important channels for information dissemination. They are noisy, informal, and surge quickly. Real life events, which may happen and evolve every minute, are perceived and circulated in post streams by social users. Intuitively, an event can be viewed as a dense cluster of posts with a life cycle sharing th… ▽ More Online social post streams such as Twitter timelines and forum discussions have emerged as important channels for information dissemination. They are noisy, informal, and surge quickly. Real life events, which may happen and evolve every minute, are perceived and circulated in post streams by social users. Intuitively, an event can be viewed as a dense cluster of posts with a life cycle sharing the same descriptive words. There are many previous works on event detection from social streams. However, there has been surprisingly little work on tracking the evolution patterns of events, e.g., birth/death, growth/decay, merge/split, which we address in this paper. To define a tracking scope, we use a sliding time window, where old posts disappear and new posts appear at each moment. Following that, we model a social post stream as an evolving network, where each social post is a node, and edges between posts are constructed when the post similarity is above a threshold. We propose a framework which summarizes the information in the stream within the current time window as a ``sketch graph'' composed of ``core'' posts. We develop incremental update algorithms to handle highly dynamic social streams and track event evolution patterns in real time. Moreover, we visualize events as word clouds to aid human perception. Our evaluation on a real data set consisting of 5.2 million posts demonstrates that our method can effectively track event dynamics in the whole life cycle from very large volumes of social streams on the fly. △ Less

Submitted 23 November, 2013; originally announced November 2013.

Showing 1–20 of 20 results for author: Milios, E