Research Papers

A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Goran Glavas, Marc Franco-Salvador, Simone P. Ponzetto, Paolo Rosso , 19 Jan 2018

Recognizing semantically similar sentences or paragraphs across languages is bene cial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in di erent languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate di erent unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a suciently large corpus, required to learn monolingual word embeddings. Experimental results on three di erent datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across di erent language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.

Bridging the Native Language and Language Variety Identification Tasks

Marc Franco-Salvador, Greg Kondrak, Paolo Rosso , 06 Sep 2017

The objective of Native Language Identification is to determine the native language of the author of a text that he or she wrote in another language. By contrast, Language Variety Identification aims at classifying texts representing different varieties of a single language. We postulate that both tasks may be reduced to a single objective, which is to identify the language variety of the text. We design a general approach that combines string kernels and word embeddings, which capture different characteristics of texts. The results of our experiments show that the approach achieves excellent results on both tasks, without any task-specific adaptations.

A Low Dimensionality Representation for Language Variety Identification

Francisco Rangel, Marc Franco-Salvador, and Paolo Rosso , 30 Mai 2017

Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of 35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality — and increasing the big data suitability — to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages

Single and Cross-domain Polarity Classification using String Kernels

Marc Franco , 28 Mai 2017

The polarity classification task aims at automatically identifying whether a subjective text is positive or negative. When the target domain is different from those where a model was trained, we refer to a cross-domain setting. That setting usually implies the use of a domain adaptation method. In this work, we study the single and cross-domain

Subword-based deep averaging networks for author profiling in social media

Marc Franco-Salvador, Nataliia Plotnikova, Neha Pawar, Yassine Benajiba , 01 Jan 2017

Author profiling aims at identifying the authors’ traits on the basis of their sociolect aspect, that is, how language is shared by them. This work describes the system submitted by Symanto Research for the PAN 2017 Author Profiling Shared Task. The current edition is focused on language variety and gender identification on Twitter. We address these tasks by exploiting the morphology and semantics of the words. For that purpose, we generate embeddings of the authors’ text based on subword character n-grams. These representations are classified using deep averaging networks. Experimental results show competitive performance in the evaluated author profiling tasks.

Open Domain Real-Time Question Answering Based on Semantic and Syntactic Question Similarity

Vivek Datla, Sadid A. Hasan, Joey Liu, Yassine Benajiba, Kathy Leey, Ashequl Qadir, Aaditya Prakashz, Oladimeji Farri , 01 Jan 2016

In this paper, we describe our system and results of our participation in the Live-QA track of the Text Retrieval Conference(TREC) 2016. The Live-QA task involves real user questions, extracted from the stream of most recent questions submitted to the Yahoo Answers (YA) site, which have not yet been answered by humans. These questions are pushed to the participants via a socket connection, and the systems are needed to provide an answer which is less than 1000 characters length in less than 60 seconds. The answers given by the system are evaluated by human experts in terms of accuracy, readability, and preciseness. Our strategy for answering the questions include question decomposition, question relatedness identification, and answer generation.