-
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Authors:
Sri Harsha Dumpala,
Aman Jaiswal,
Chandramouli Sastry,
Evangelos Milios,
Sageev Oore,
Hassan Sajjad
Abstract:
Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is n…
▽ More
Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in architecture, pre-training objectives and datasets to benchmark the performance of SUGARCREPE++ dataset. Experimental results highlight the difficulties of VLMs in distinguishing between lexical and semantic variations, particularly in object attributes and spatial relations. Although VLMs with larger pre-training datasets, model sizes, and multiple pre-training objectives achieve better performance on SUGARCREPE++, there is a significant opportunity for improvement. We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++, signifying that compositionality alone may not be sufficient for understanding semantic and lexical alterations. Given the importance of the property that the SUGARCREPE++ dataset targets, it serves as a new challenge to the vision-and-language community.
△ Less
Submitted 18 June, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations
Authors:
Sri Harsha Dumpala,
Aman Jaiswal,
Chandramouli Sastry,
Evangelos Milios,
Sageev Oore,
Hassan Sajjad
Abstract:
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences a…
▽ More
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at https://github.com/Sri-Harsha/visla_benchmark.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Stability Mechanisms for Predictive Safety Filters
Authors:
Elias Milios,
Kim Peter Wabersich,
Felix Berkel,
Lukas Schwenkel
Abstract:
Predictive safety filters enable the integration of potentially unsafe learning-based control approaches and humans into safety-critical systems. In addition to simple constraint satisfaction, many control problems involve additional stability requirements that may vary depending on the specific use case or environmental context. In this work, we address this problem by augmenting predictive safet…
▽ More
Predictive safety filters enable the integration of potentially unsafe learning-based control approaches and humans into safety-critical systems. In addition to simple constraint satisfaction, many control problems involve additional stability requirements that may vary depending on the specific use case or environmental context. In this work, we address this problem by augmenting predictive safety filters with stability guarantees, ranging from bounded convergence to uniform asymptotic stability. The proposed framework extends well-known stability results from model predictive control (MPC) theory while supporting commonly used design techniques. As a result, straightforward extensions to dynamic trajectory tracking problems can be easily adapted, as outlined in this article. The practicality of the framework is demonstrated using an automotive advanced driver assistance scenario, involving a reference trajectory stabilization problem.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Stance Reasoner: Zero-Shot Stance Detection on Social Media with Explicit Reasoning
Authors:
Maksym Taranukhin,
Vered Shwartz,
Evangelos Milios
Abstract:
Social media platforms are rich sources of opinionated content. Stance detection allows the automatic extraction of users' opinions on various topics from such content. We focus on zero-shot stance detection, where the model's success relies on (a) having knowledge about the target topic; and (b) learning general reasoning strategies that can be employed for new topics. We present Stance Reasoner,…
▽ More
Social media platforms are rich sources of opinionated content. Stance detection allows the automatic extraction of users' opinions on various topics from such content. We focus on zero-shot stance detection, where the model's success relies on (a) having knowledge about the target topic; and (b) learning general reasoning strategies that can be employed for new topics. We present Stance Reasoner, an approach to zero-shot stance detection on social media that leverages explicit reasoning over background knowledge to guide the model's inference about the document's stance on a target. Specifically, our method uses a pre-trained language model as a source of world knowledge, with the chain-of-thought in-context learning approach to generate intermediate reasoning steps. Stance Reasoner outperforms the current state-of-the-art models on 3 Twitter datasets, including fully supervised models. It can better generalize across targets, while at the same time providing explicit and interpretable explanations for its predictions.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights
Authors:
Maksym Taranukhin,
Sahithya Ravi,
Gabor Lukacs,
Evangelos Milios,
Vered Shwartz
Abstract:
The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regu…
▽ More
The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regulations. The most relevant passages from these documents are presented along with links to the original documents and the generated queries, enabling users to dissect and leverage the information for their unique circumstances. The system successfully overcomes two predominant challenges: understanding complex user inputs, and delivering accurate answers, free of hallucinations, that passengers can rely on for making informed decisions. A user study comparing the chatbot to a Google search demonstrated the chatbot's usefulness and ease of use. Beyond the primary goal of providing accurate and timely information to air passengers regarding their rights, we hope that this system will also enable further research exploring the tradeoff between the user-friendly conversational interface of chatbots and the accuracy of retrieval systems.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT
Authors:
Aman Jaiswal,
Evangelos Milios
Abstract:
Transformer-based models, specifically BERT, have propelled research in various NLP tasks. However, these models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. Various complex methods have claimed to overcome this limit, but recent research questions the efficacy of these models across different classi…
▽ More
Transformer-based models, specifically BERT, have propelled research in various NLP tasks. However, these models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. Various complex methods have claimed to overcome this limit, but recent research questions the efficacy of these models across different classification tasks. These complex architectures evaluated on carefully curated long datasets perform at par or worse than simple baselines. In this work, we propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text. The proposed method is based on chunking token representations and CNN layers, making it compatible with any pre-trained BERT. We evaluate chunkBERT exclusively on a benchmark for comparing long-text classification models across a variety of tasks (including binary classification, multi-class classification, and multi-label classification). A BERT model finetuned using the ChunkBERT method performs consistently across long samples in the benchmark while utilizing only a fraction (6.25\%) of the original memory footprint. These findings suggest that efficient finetuning and inference can be achieved through simple modifications to pre-trained BERT models.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
MPTopic: Improving topic modeling via Masked Permuted pre-training
Authors:
Xinche Zhang,
Evangelos milios
Abstract:
Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the…
▽ More
Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive evaluation, it is evident that the topic keywords identified with the synergy of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec.
△ Less
Submitted 2 September, 2023;
originally announced September 2023.
-
QuOTeS: Query-Oriented Technical Summarization
Authors:
Juan Ramirez-Orta,
Eduardo Xamena,
Ana Maguitman,
Axel J. Soto,
Flavia P. Zanoto,
Evangelos Milios
Abstract:
Abstract. When writing an academic paper, researchers often spend considerable time reviewing and summarizing papers to extract relevant citations and data to compose the Introduction and Related Work sections. To address this problem, we propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references and hence ass…
▽ More
Abstract. When writing an academic paper, researchers often spend considerable time reviewing and summarizing papers to extract relevant citations and data to compose the Introduction and Related Work sections. To address this problem, we propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references and hence assist in the composition of new papers. QuOTeS integrates techniques from Query-Focused Extractive Summarization and High-Recall Information Retrieval to provide Interactive Query-Focused Summarization of scientific documents. To measure the performance of our system, we carried out a comprehensive user study where participants uploaded papers related to their research and evaluated the system in terms of its usability and the quality of the summaries it produces. The results show that QuOTeS provides a positive user experience and consistently provides query-focused summaries that are relevant, concise, and complete. We share the code of our system and the novel Query-Focused Summarization dataset collected during our experiments at https://github.com/jarobyte91/quotes.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
DimenFix: A novel meta-dimensionality reduction method for feature preservation
Authors:
Qiaodan Luo,
Leonardo Christino,
Fernando V Paulovich,
Evangelos Milios
Abstract:
Dimensionality reduction has become an important research topic as demand for interpreting high-dimensional datasets has been increasing rapidly in recent years. There have been many dimensionality reduction methods with good performance in preserving the overall relationship among data points when mapping them to a lower-dimensional space. However, these existing methods fail to incorporate the d…
▽ More
Dimensionality reduction has become an important research topic as demand for interpreting high-dimensional datasets has been increasing rapidly in recent years. There have been many dimensionality reduction methods with good performance in preserving the overall relationship among data points when mapping them to a lower-dimensional space. However, these existing methods fail to incorporate the difference in importance among features.
To address this problem, we propose a novel meta-method, DimenFix, which can be operated upon any base dimensionality reduction method that involves a gradient-descent-like process. By allowing users to define the importance of different features, which is considered in dimensionality reduction, DimenFix creates new possibilities to visualize and understand a given dataset. Meanwhile, DimenFix does not increase the time cost or reduce the quality of dimensionality reduction with respect to the base dimensionality reduction used.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
A Theoretical Approach for Structuring and Analysing Knowledge Provenance for Visual Analytics
Authors:
Leonardo Christino,
Sima Rezaeipourfarsangi,
Evangelos Milios,
Fernando V. Paulovich
Abstract:
The primary goal of Visual Analytics (VA) is to enable user-guided knowledge generation. Theoretical VA works to explain how the different aspects of a VA tool bring forth new insights through user interactivity, which itself can be captured through tracking methods for reproduction or evaluation. However, the process of automatically capturing the user's thought process, such as intent and insigh…
▽ More
The primary goal of Visual Analytics (VA) is to enable user-guided knowledge generation. Theoretical VA works to explain how the different aspects of a VA tool bring forth new insights through user interactivity, which itself can be captured through tracking methods for reproduction or evaluation. However, the process of automatically capturing the user's thought process, such as intent and insights, and associating it with user's interaction events are largely ignored. Also, two forms of interactivity capture are typically ambiguous and intermixed: the temporal aspect, which indicates sequences of events, and the atemporal aspect, which explains the workflow as sequences of states within a state-space. In this work, we propose Visual Analytics Knowledge Graph (VAKG), a conceptual framework that brings VA modeling theory to practice through a novel Set-Theory formalization of knowledge modeling. By extracting such a model from a VA tool, VAKG structures a 4-way temporal knowledge graph that describes user behavior and its associated knowledge gain process. Such knowledge graphs can be populated manually or automatically during user analysis sessions, which can then be analyzed using graph analysis methods. VAKG is demonstrated by modeling and collecting Tableau and visual text-mining workflows, where comparative user satisfaction, tool efficacy, and overall workflow shortcomings can be extracted from the knowledge graph.
△ Less
Submitted 27 October, 2023; v1 submitted 1 April, 2022;
originally announced April 2022.
-
Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models
Authors:
Juan Ramirez-Orta,
Eduardo Xamena,
Ana Maguitman,
Evangelos Milios,
Axel J. Soto
Abstract:
In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimen…
▽ More
In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.
△ Less
Submitted 24 January, 2022; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Neural Abstractive Unsupervised Summarization of Online News Discussions
Authors:
Ignacio Tampe Palma,
Marcelo Mendoza,
Evangelos Milios
Abstract:
Summarization has usually relied on gold standard summaries to train extractive or abstractive models. Social media brings a hurdle to summarization techniques since it requires addressing a multi-document multi-author approach. We address this challenging task by introducing a novel method that generates abstractive summaries of online news discussions. Our method extends a BERT-based architectur…
▽ More
Summarization has usually relied on gold standard summaries to train extractive or abstractive models. Social media brings a hurdle to summarization techniques since it requires addressing a multi-document multi-author approach. We address this challenging task by introducing a novel method that generates abstractive summaries of online news discussions. Our method extends a BERT-based architecture, including an attention encoding that fed comments' likes during the training stage. To train our model, we define a task which consists of reconstructing high impact comments based on popularity (likes). Accordingly, our model learns to summarize online discussions based on their most relevant comments. Our novel approach provides a summary that represents the most relevant aspects of a news item that users comment on, incorporating the social context as a source of information to summarize texts in online social networks. Our model is evaluated using ROUGE scores between the generated summary and each comment on the thread. Our model, including the social attention encoding, significantly outperforms both extractive and abstractive summarization methods based on such evaluation.
△ Less
Submitted 18 June, 2021; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Active learning for medical code assignment
Authors:
Martha Dais Ferreira,
Michal Malyska,
Nicola Sahar,
Riccardo Miotto,
Fernando Paulovich,
Evangelos Milios
Abstract:
Machine Learning (ML) is widely used to automatically extract meaningful information from Electronic Health Records (EHR) to support operational, clinical, and financial decision-making. However, ML models require a large number of annotated examples to provide satisfactory results, which is not possible in most healthcare scenarios due to the high cost of clinician-labeled data. Active Learning (…
▽ More
Machine Learning (ML) is widely used to automatically extract meaningful information from Electronic Health Records (EHR) to support operational, clinical, and financial decision-making. However, ML models require a large number of annotated examples to provide satisfactory results, which is not possible in most healthcare scenarios due to the high cost of clinician-labeled data. Active Learning (AL) is a process of selecting the most informative instances to be labeled by an expert to further train a supervised algorithm. We demonstrate the effectiveness of AL in multi-label text classification in the clinical domain. In this context, we apply a set of well-known AL methods to help automatically assign ICD-9 codes on the MIMIC-III dataset. Our results show that the selection of informative instances provides satisfactory classification with a significantly reduced training set (8.3\% of the total instances). We conclude that AL methods can significantly reduce the manual annotation cost while preserving model performance.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Using Molecular Embeddings in QSAR Modeling: Does it Make a Difference?
Authors:
María Virginia Sabando,
Ignacio Ponzoni,
Evangelos E. Milios,
Axel J. Soto
Abstract:
With the consolidation of deep learning in drug discovery, several novel algorithms for learning molecular representations have been proposed. Despite the interest of the community in developing new methods for learning molecular embeddings and their theoretical benefits, comparing molecular embeddings with each other and with traditional representations is not straightforward, which in turn hinde…
▽ More
With the consolidation of deep learning in drug discovery, several novel algorithms for learning molecular representations have been proposed. Despite the interest of the community in developing new methods for learning molecular embeddings and their theoretical benefits, comparing molecular embeddings with each other and with traditional representations is not straightforward, which in turn hinders the process of choosing a suitable representation for QSAR modeling. A reason behind this issue is the difficulty of conducting a fair and thorough comparison of the different existing embedding approaches, which requires numerous experiments on various datasets and training scenarios. To close this gap, we reviewed the literature on methods for molecular embeddings and reproduced three unsupervised and two supervised molecular embedding techniques recently proposed in the literature. We compared these five methods concerning their performance in QSAR scenarios using different classification and regression datasets. We also compared these representations to traditional molecular representations, namely molecular descriptors and fingerprints. As opposed to the expected outcome, our experimental setup consisting of over 25,000 trained models and statistical tests revealed that the predictive performance using molecular embeddings did not significantly surpass that of traditional representations. While supervised embeddings yielded competitive results compared to those using traditional molecular representations, unsupervised embeddings tended to perform worse than traditional representations. Our results highlight the need for conducting a careful comparison and analysis of the different embedding techniques prior to using them in drug design tasks, and motivate a discussion about the potential of molecular embeddings in computer-aided drug design.
△ Less
Submitted 28 July, 2021; v1 submitted 20 March, 2021;
originally announced April 2021.
-
COVID-19 Pandemic: Identifying Key Issues using Social Media and Natural Language Processing
Authors:
Oladapo Oyebode,
Chinenye Ndulue,
Dinesh Mulchandani,
Banuchitra Suruliraj,
Ashfaq Adib,
Fidelia Anulika Orji,
Evangelos Milios,
Stan Matwin,
Rita Orji
Abstract:
The COVID-19 pandemic has affected people's lives in many ways. Social media data can reveal public perceptions and experience with respect to the pandemic, and also reveal factors that hamper or support efforts to curb global spread of the disease. In this paper, we analyzed COVID-19-related comments collected from six social media platforms using Natural Language Processing (NLP) techniques. We…
▽ More
The COVID-19 pandemic has affected people's lives in many ways. Social media data can reveal public perceptions and experience with respect to the pandemic, and also reveal factors that hamper or support efforts to curb global spread of the disease. In this paper, we analyzed COVID-19-related comments collected from six social media platforms using Natural Language Processing (NLP) techniques. We identified relevant opinionated keyphrases and their respective sentiment polarity (negative or positive) from over 1 million randomly selected comments, and then categorized them into broader themes using thematic analysis. Our results uncover 34 negative themes out of which 17 are economic, socio-political, educational, and political issues. 20 positive themes were also identified. We discuss the negative issues and suggest interventions to tackle them based on the positive themes and research evidence.
△ Less
Submitted 23 August, 2020;
originally announced August 2020.
-
Detecting Ongoing Events Using Contextual Word and Sentence Embeddings
Authors:
Mariano Maisonnave,
Fernando Delbianco,
Fernando Tohmé,
Ana Maguitman,
Evangelos Milios
Abstract:
This paper introduces the Ongoing Event Detection (OED) task, which is a specific Event Detection task where the goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current. Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an…
▽ More
This paper introduces the Ongoing Event Detection (OED) task, which is a specific Event Detection task where the goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current. Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an OED system. The main contribution of this paper are the following: (1) it introduces the OED task along with a dataset manually labeled for the task; (2) it presents the design and implementation of an RNN model for the task that uses BERT embeddings to define contextual word and contextual sentence embeddings as attributes, which to the best of our knowledge were never used before for detecting ongoing events in news; (3) it presents an extensive empirical evaluation that includes (i) the exploration of different architectures and hyperparameters, (ii) an ablation test to study the impact of each attribute, and (iii) a comparison with a replication of a state-of-the-art model. The results offer several insights into the importance of contextual embeddings and indicate that the proposed approach is effective in the OED task, outperforming the baseline models.
△ Less
Submitted 5 February, 2021; v1 submitted 2 July, 2020;
originally announced July 2020.
-
Enhancement of Short Text Clustering by Iterative Classification
Authors:
Md Rashadul Hasan Rakib,
Norbert Zeh,
Magdalena Jankowska,
Evangelos Milios
Abstract:
Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then…
▽ More
Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then it trains a classification algorithm using the non-outliers based on their cluster distributions. Using the trained classification model, iterative classification reclassifies the outliers to obtain a new set of clusters. By repeating this several times, we obtain a much improved clustering of texts. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin.
△ Less
Submitted 30 January, 2020;
originally announced January 2020.
-
Vector Embedding of Wikipedia Concepts and Entities
Authors:
Ehsan Sherkat,
Evangelos Milios
Abstract:
Using deep learning for different machine learning tasks such as image classification and word embedding has recently gained many attentions. Its appealing performance reported across specific Natural Language Processing (NLP) tasks in comparison with other approaches is the reason for its popularity. Word embedding is the task of mapping words or phrases to a low dimensional numerical vector. In…
▽ More
Using deep learning for different machine learning tasks such as image classification and word embedding has recently gained many attentions. Its appealing performance reported across specific Natural Language Processing (NLP) tasks in comparison with other approaches is the reason for its popularity. Word embedding is the task of mapping words or phrases to a low dimensional numerical vector. In this paper, we use deep learning to embed Wikipedia Concepts and Entities. The English version of Wikipedia contains more than five million pages, which suggest its capability to cover many English Entities, Phrases, and Concepts. Each Wikipedia page is considered as a concept. Some concepts correspond to entities, such as a person's name, an organization or a place. Contrary to word embedding, Wikipedia Concepts Embedding is not ambiguous, so there are different vectors for concepts with similar surface form but different mentions. We proposed several approaches and evaluated their performance based on Concept Analogy and Concept Similarity tasks. The results show that proposed approaches have the performance comparable and in some cases even higher than the state-of-the-art methods.
△ Less
Submitted 11 February, 2017;
originally announced February 2017.
-
Statistical Learning for OCR Text Correction
Authors:
Jie Mei,
Aminul Islam,
Yajing Wu,
Abidalrahman Moh'd,
Evangelos E. Milios
Abstract:
The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, we…
▽ More
The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, we show how to enlarge candidate suggestion space by using external corpus and integrating OCR-specific features in a regression approach to correct OCR-generated errors. The evaluation results show that our model can correct 61.5% of the OCR-errors (considering the top 1 suggestion) and 71.5% of the OCR-errors (considering the top 3 suggestions), for cases where the theoretical correction upper-bound is 78%.
△ Less
Submitted 21 November, 2016;
originally announced November 2016.
-
Event Evolution Tracking from Streaming Social Posts
Authors:
Pei Lee,
Laks V. S. Lakshmanan,
Evangelos E. Milios
Abstract:
Online social post streams such as Twitter timelines and forum discussions have emerged as important channels for information dissemination. They are noisy, informal, and surge quickly. Real life events, which may happen and evolve every minute, are perceived and circulated in post streams by social users. Intuitively, an event can be viewed as a dense cluster of posts with a life cycle sharing th…
▽ More
Online social post streams such as Twitter timelines and forum discussions have emerged as important channels for information dissemination. They are noisy, informal, and surge quickly. Real life events, which may happen and evolve every minute, are perceived and circulated in post streams by social users. Intuitively, an event can be viewed as a dense cluster of posts with a life cycle sharing the same descriptive words. There are many previous works on event detection from social streams. However, there has been surprisingly little work on tracking the evolution patterns of events, e.g., birth/death, growth/decay, merge/split, which we address in this paper. To define a tracking scope, we use a sliding time window, where old posts disappear and new posts appear at each moment. Following that, we model a social post stream as an evolving network, where each social post is a node, and edges between posts are constructed when the post similarity is above a threshold. We propose a framework which summarizes the information in the stream within the current time window as a ``sketch graph'' composed of ``core'' posts. We develop incremental update algorithms to handle highly dynamic social streams and track event evolution patterns in real time. Moreover, we visualize events as word clouds to aid human perception. Our evaluation on a real data set consisting of 5.2 million posts demonstrates that our method can effectively track event dynamics in the whole life cycle from very large volumes of social streams on the fly.
△ Less
Submitted 23 November, 2013;
originally announced November 2013.