David Dukić

David Dukić

Diamonds are made under pressure 💎

Research Interests

My main research interests include:

  • Transfer learning
  • Information extraction
  • In-context learning
  • Semantic textual similarity
  • Author profiling

Check out my diamonds below 👇

Publications

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder

arXiv preprint

TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.

Publication Image

Are ELECTRA’s Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Ivan Rep, David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9159–9169, Miami, Florida, USA. Association for Computational Linguistics.

While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA’s sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator’s last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over 8 points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA’s generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.

Publication Image

Closed-domain event extraction for hard news event monitoring: a systematic study

David Dukić​​, Filip Karlo Došilović​, Domagoj Pluščec, Jan Šnajder

PeerJ Computer Science

News event monitoring systems allow real-time monitoring of a large number of events reported in the news, including the urgent and critical events comprising the so-called hard news. These systems heavily rely on natural language processing (NLP) to perform automatic event extraction at scale. While state-of-the-art event extraction models are readily available, integrating them into a news event monitoring system is not as straightforward as it seems due to practical issues related to model selection, robustness, and scale. To address this gap, we present a study on the practical use of event extraction models for news event monitoring. Our study focuses on the key task of closed-domain main event extraction (CDMEE), which aims to determine the type of the story’s main event and extract its arguments from the text. We evaluate a range of state-of-the-art NLP models for this task, including those based on pre-trained language models. Aiming at a more realistic evaluation than done in the literature, we introduce a new dataset manually labeled with event types and their arguments. Additionally, we assess the scalability of CDMEE models and analyze the trade-off between accuracy and inference speed. Our results give insights into the performance of state-of-the-art NLP models on the CDMEE task and provide recommendations for developing effective, robust, and scalable news event monitoring systems.

Publication Image

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: ACL 2024, pages 14168–14181, Bangkok, Thailand. Association for Computational Linguistics.

Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs’ poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs’ performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.

Publication Image

Leveraging Open Information Extraction for More Robust Domain Transfer of Event Trigger Detection

David Dukić, Kiril Gashteovski, Goran Glavaš, Jan Šnajder

In Findings of the Association for Computational Linguistics: EACL 2024, pages 1197–1213, St. Julian’s, Malta. Association for Computational Linguistics.

Event detection is a crucial information extraction task in many domains, such as Wikipedia or news. The task typically relies on trigger detection (TD) – identifying token spans in the text that evoke specific events. While the notion of triggers should ideally be universal across domains, domain transfer for TD from high- to low-resource domains results in significant performance drops. We address the problem of negative transfer in TD by coupling triggers between domains using subject-object relations obtained from a rule-based open information extraction (OIE) system. We demonstrate that OIE relations injected through multi-task training can act as mediators between triggers in different domains, enhancing zero- and few-shot TD domain transfer and reducing performance drops, in particular when transferring from a high-resource source domain (Wikipedia) to a low(er)-resource target domain (news). Additionally, we combine this improved transfer with masked language modeling on the target domain, observing further TD transfer gains. Finally, we demonstrate that the gains are robust to the choice of the OIE system.

Publication Image