Goran Glavas, Marc Franco-Salvador, Simone P. Ponzetto, Paolo Rosso , 19 Jan 2018
Recognizing semantically similar sentences or paragraphs across languages is benecial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in dierent languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate dierent unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a suciently large corpus, required to learn monolingual word embeddings. Experimental results on three dierent datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across dierent language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.
Marc Franco-Salvador, Greg Kondrak, Paolo Rosso , 06 Sep 2017
The objective of Native Language Identification is to determine the native language of the author of a text that he or she wrote in another language. By contrast, Language Variety Identification aims at classifying texts representing different varieties of a single language. We postulate that both tasks may be reduced to a single objective, which is to identify the language variety of the text. We design a general approach that combines string kernels and word embeddings, which capture different characteristics of texts. The results of our experiments show that the approach achieves excellent results on both tasks, without any task-specific adaptations.
Francisco Rangel, Marc Franco-Salvador, and Paolo Rosso , 30 Mai 2017
Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of 35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality — and increasing the big data suitability — to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages
Marc Franco , 28 Mai 2017
The polarity classification task aims at automatically identifying whether a subjective text is positive or negative. When the target domain is different from those where a model was trained, we refer to a cross-domain setting. That setting usually implies the use of a domain adaptation method. In this work, we study the single and cross-domain
Marc Franco-Salvador, Nataliia Plotnikova, Neha Pawar, Yassine Benajiba , 01 Jan 2017
Author profiling aims at identifying the authors’ traits on the basis of their sociolect aspect, that is, how language is shared by them. This work describes the system submitted by Symanto Research for the PAN 2017 Author Profiling Shared Task. The current edition is focused on language variety and gender identification on Twitter. We address these tasks by exploiting the morphology and semantics of the words. For that purpose, we generate embeddings of the authors’ text based on subword character n-grams. These representations are classified using deep averaging networks. Experimental results show competitive performance in the evaluated author profiling tasks.
Vivek Datla, Sadid A. Hasan, Joey Liu, Yassine Benajiba, Kathy Leey, Ashequl Qadir, Aaditya Prakashz, Oladimeji Farri , 01 Jan 2016
In this paper, we describe our system and results of our participation in the Live-QA track of the Text Retrieval Conference(TREC) 2016. The Live-QA task involves real user questions, extracted from the stream of most recent questions submitted to the Yahoo Answers (YA) site, which have not yet been answered by humans. These questions are pushed to the participants via a socket connection, and the systems are needed to provide an answer which is less than 1000 characters length in less than 60 seconds. The answers given by the system are evaluated by human experts in terms of accuracy, readability, and preciseness. Our strategy for answering the questions include question decomposition, question relatedness identification, and answer generation.