Our research revolves around natural language processing, machine learning, text analytics (aka text mining), and data mining. We pursue quite diverse research topics, ranging from data analysis-oriented ones, such as information extraction and semantic search, to NLP-oriented ones, such as computational semantics and argumentation mining. We mostly work with English, Croatian, and German data. Many of our research efforts involve cooperation with partners from the industry and academia. Equally important, we strive to involve students in our research as much as possible.

We make most software and resources produced as part of our research freely available to the research community. Visit this page to browse our software and resources.

Information extraction

Information extraction uses natural language processing to automatically extract structured information, which can then be analyzed on par with structured data using data mining and knowledge discovery techniques. Within our Knowledge Discovery in Textual Data project, we worked on methods for the extraction of event-oriented information from news stories. We introduced a novel, graph-based event extraction method (see our NLE paper), and have shown that graph-based event extraction can be used for efficient event-centered information retrieval (see our ESWA paper), event identification (see our ACL 2013 paper), as well as multi-document summarization. We also explored the interesting problem of automatic extraction of spatio-temporal event hierarchies (see our LREC 2014 paper and our TextGraphs 2014 paper). Most recently, we developed MINERAL, a system for extraction of disease and disorder mentions from clinical text (see our paper to appear at SemEval 2015 or the online demo).

In the realm of language technologies for Croatian, we developed CroNER, a state-of-the-art Named Entity Recognizer for Croatian (see our Informatica paper and the online demo). Building on CroNER, and in cooperation with our industry partners, we built a system for information extraction from police reports, meant to serve as a basis for a car insurance fraud system (see project description). We also explored the event and temporal extraction from Croatian texts (see our IS-JT 2012 paper). Recently, we developed a named entity recognizer for Croatian Tweets (see our IS-JT2014 paper) and a HeidelTime module for temporal expression extraction for Croatian (see our IS-JT 2014 paper).

Semantic search

Semantic search is an umbrella term for methods that enhance information retrieval with natural language semantics, ranging from morphological analysis and synonym detection to concept indexing and faceted search.

In our multiple-award-winning CADIAL project, we developed a semantic search engine for accessing Croatian legislative documents. To improve search performance and users’ experience, the engine leverages the EuroVoc descriptors – semantic descriptors from a multilingual EuroVoc thesuarus, which our system assigns automatically to more than 20,000 documents (see our book on CADIAL project and our SPLeT 2014 paper). The semantic search engine enables cross lingual search (English/Croatian) and is now part of the European Commision N-Lex portal – a common getway to National Law of EU Member States.

Another topic we looked into is the retrieval of frequently asked questions (FAQs). In many scenarios FAQ retrieval has the potential of more directly meeting the users’ information needs. We developed a prototype FAQ retrieval system for Croatian (see our BSNLP 2013 paper), and are now continuing the research in that direction.

Computational semantics

Computational Semantics, standing proudly at the crossroads of Linguistics and Computer Science, studies how to automate the construction of meaning representations. The notion of meaning is of major importance for, among others, automatic processing of natural language.

We are currently pursuing three Computational Semantics research topics. First are the models for measuring semantic textual similarity (STS) – a task that is useful for many downstream NLP problems. We developed an STS system that hit the top ranks in the SemEval 2012 shared evaluation task and became one of the de facto standards for this task (see our *SEM 2012 paper). Along the same lines, we recently developed an STS system for tweets (paper to appear at SemEval 2015).

Our second research direction is distributional semantics. Distributional semantics has emerged as a popular and practical approach to computational semantics, in which the meaning of linguistic expressions is modeled based on the statistical analysis of word co-occurrences in large corpora. To model the meaning of Croatian words, we explored the (now classic) word-based models (see our IS-JT 2012 paper and TSD 2011 paper), syntax-based distributional semantic models (see our ACL 2013 paper), and the more recent neural network-based models (see our IS-JT 2014 paper).

Our third research direction within the Computational Semantic purview is the extraction and modeling of multi-word expressions. We explored how lexical association measures can be extended to arbitrary-length expressions (see our CS&L paper) and how they can be induced automatically by means of genetic programming (see our ACL 2008 paper). We also explored the use of distributional semantics for predicting the transparency of Croatian multi-word expressions – a task of practical relevance for applications such as machine translation and information retrieval (see our IS-JT 2014 paper).

Computational morphology

Computational Morphology deals with theories and techniques for the analysis and synthesis of word forms. Since morphological processing is generally recognized as an important step for many NLP tasks, computational morphology is of practical relevance for many languages. This is the case not only for inflectionally rich Slavic languages, such as Croatian, but also for languages with less prominent inflectional morphology, such as German.

Our earlier research focused on formal models of Croatian morphology. We proposed novel models for both inflectional morphology (see our FASSBL 2008 paper) and derivational morphology (see our FASSBL 2010 paper) that exploit the notion of higher-order functions. Building on these findings, we proposed a method for the acquisition of inflectional lexica for information retrieval (see our IP&M paper) and a model for predicting the inflectional paradigms of unknown Croatian words (see our Slovenščina 2.0 paper).

Our recent research focuses on derivational morphology and the semantics of derivation. In collaboration with Heidelberg University and Stuttgart University, we have worked on a method for the induction of large-scale derivational morphology resources for German and Croatian (see our ACL 2013 paper and LREC 2014 paper, respectively). We explored how derivational knowledge provided by such resources can be used to improve similarity predictions in syntax-based distributional spaces (see our ACL 2013 paper). Our recent research activities focus more on the semantic aspect of derivation, such as the prediction of semantic relatedness between derivationally related words (see our COLING 2014 paper).

Text classification and keyphrase extraction

Text classification (aka document categorization) is about assigning predefined categories to documents based on their content. Somewhat different, keyphrase extraction is about extracting informative phrases from the text of the document. Both techniques are indispensable for document management, search, and analysis, especially in case of large or rapidly growing document archives.

Our primarily research focus within this topic was the hierarchical multi-label classification of legal documents based on the EuroVoc thesaurus. We began with the development of a computer-aided document indexing system (see our CIT paper), backed up by suitable multi-word expression statistics (see the other CIT paper). Within the CADIAL project, and in cooperation with KU Leuven, we developed a EuroVoc classifier for Croatian documents (see our SPLeT 2014 paper). On top of this, we developed a semantic search engine that leverages this information to provide semantic search functionality (see our FASSBL 2008 paper).

Keyphrase extraction for Croatian has been a recurring research topic for some time now. We have experimented with supervised (see our FASSBL 2010 paper and InFuture 2009 paper) and unsupervised methods (see our TSD 2011 paper). Recently, we have explored the use of genetic programming for learning keyphrase extraction measures (see our BSNLP 2013 paper).

Sentiment analysis & Argumentation mining

Sentiment Analysis (aka Opinion Mining) refers to methods for analyzing the subjectivity of text, including the general attitude towards a given topic as well as more fine-grained opinions towards certain people or product aspects. The advent of social media has escalated the interest in sentiment analysis, with applications ranging from brand analysis to social studies.

We have worked on sentiment analysis for both English and Croatian. For English, we explored a hybrid (combined supervised and semi-supervised) approach for sentiment lexicon acquisition (see our Hybrid2012 paper). Similarly, we experimented with a semi-supervised lexicon acquisition method for Croatian (see our TSD 2012 paper). We also experimented with aspect-oriented sentiment analysis from user reviews in Croatian (see our BSNLP 2013 paper) and the recently proposed neural network-based model to model the sentiment of Croatian phrases (see our IS-JT 2014 paper).

Lately we have shifted our research focus on combining opinion mining with argumentation mining. Argumentation mining aims to detect and structure argumentation in free text. Our research goal here is to go beyond conventional opinion mining and use argumentation mining to discover the reasons why people hold certain opinions and take a certain stance. As a first step toward argument-based opinion mining, we have explored the task of argument recognition in online discussion (see our ArgMining 2014 paper).

Language technologies for Croatian

Language technology (aka human language technology) refers to computer tools and methods for analyzing and producing text and speech. Despite recent advances in the development of language technologies for Croatian, Croatian still belongs to the group of under-resourced languages. At TakeLab we have made our priority to contribute to the development of state-of-the-art language technologies for Croatian, targeting in particular the tools and resources required for semantic text analysis, semantic search, and digital content management.

Probably our two most prominent results in this vein are MOLEX, a morphological lexicon for Croatian (see our IP&M paper and tool page), and CroNER, a state-of-the-art Named Entity Recognizer for Croatian (see our Informatica paper and the online demo).

Morphological analysis of Croatian has been our research preoccupation since day one. We proposed novel models for both inflectional morphology (see our FASSBL 2008 paper) and derivational morphology (see our FASSBL 2010 paper). We developed a model for predicting the inflectional paradigms of unknown Croatian words (see our Slovenščina 2.0 paper). Furthermore, we developed DerivBase.hr, a large-coverage derivational resource for Croatian (see our LREC 2014 paper).

Another research direction we deem relevant for the development of Croatian language technologies is distributional semantics. We modeled the meaning of Croatian words using word-based models (see our IS-JT 2012 paper and BSNLP 2011 paper), syntax-based distributional semantic models (see our ACL 2013 paper), and the more recent neural network-based models (see our IS-JT 2014 paper). Our datasets are freely available from this page.

In addition, we developed a bunch of tiny but practical language technology tools for Croatian, such as the a tool for collocation extraction TermeX (see our CICLing 2009 paper and tool page), diacritics restoration tool (see our InFuture 2009 paper and online demo), a corpus aligner (see our LREC 2010 paper), OCR error correction tool (see our CECIS 2010 paper and FASSBL 2010 paper), and a sentence segmenter (see our TSD 2012 paper).