TakeLab

Jan Šnajder

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Write your content here...

Publications

Context Parametrization with Compositional Adapters

Josip Jukić, Martin Tutek, Jan Šnajder

arXiv preprint

PDF Cite

Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAS, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAS encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAS outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.

Disentangling Latent Shifts of In-Context Learning with Weak Supervision

Josip Jukić, Jan Šnajder

NeurIPS 2025

PDF Cite

In-context learning (ICL) enables large language models to perform few-shot learning by conditioning on labeled examples in the prompt. Despite its flexibility, ICL suffers from instability -- especially as prompt length increases with more demonstrations. To address this, we treat ICL as a source of weak supervision and propose a parameter-efficient method that disentangles demonstration-induced latent shifts from those of the query. An ICL-based teacher generates pseudo-labels on unlabeled queries, while a student predicts them using only the query input, updating a lightweight adapter. This captures demonstration effects in a compact, reusable form, enabling efficient inference while remaining composable with new demonstrations. Although trained on noisy teacher outputs, the student often outperforms its teacher through pseudo-label correction and coverage expansion, consistent with the weak-to-strong generalization effect. Empirically, our method improves generalization, stability, and efficiency across both in-domain and out-of-domain tasks, surpassing standard ICL and prior disentanglement methods.

Supervised In-Context Fine-Tuning for Generative Sequence Labeling

David Dukić, Goran Glavaš, Jan Šnajder

arXiv preprint

PDF Cite

Sequence labeling (SL) tasks, where labels are assigned to tokens, are abundant in NLP (e.g., named entity recognition and aspect-based sentiment analysis). Owing to the intuition that they require bidirectional context, SL tasks are commonly tackled with encoder-only models. Recent work also shows that removing the causal mask in fine-tuning enables decoder-based LLMs to become effective token classifiers. Less work, however, focused on (supervised) generative SL, a more natural setting for causal LLMs. Due to their rapid scaling, causal LLMs applied to SL are expected to outperform encoders, whose own development has stagnated. In this work, we propose supervised in-context fine-tuning (SIFT) for generative SL. SIFT casts SL tasks as constrained response generation, natural to LLMs, combining in-context learning (ICL) from demonstrations with supervised fine-tuning. SIFT considerably outperforms both ICL and decoder-as-encoder fine-tuning baselines on a range of standard SL tasks. We further find that although long context hinders the performance of generative SL in both ICL and SIFT, this deficiency can be mitigated by removing the instruction, as instructions are shown to be largely unnecessary for achieving strong SL performance with SIFT. Our findings highlight strengths and limitations of SL with LLMs, underscoring the importance of a response-based generative task formulation for effective SL performance.

From Robustness to Improved Generalization and Calibration in Pre-trained Language Models

Josip Jukić, Jan Šnajder

Transactions of the Association for Computational Linguistics, 13:264–280.

PDF Cite

Enforcing representation smoothness in pre-trained language models (PLMs) through Jacobian and Hessian regularization provides an effective approach for enhancing both robustness and generalization. Although such regularization methods have proven effective in computer vision, their application in natural language processing, where PLM inputs are derived from a discrete domain, poses unique challenges. We introduce JacHess, a regularization approach for PLMs that minimizes the norms of the Jacobian and Hessian matrices in intermediate representations, using embeddings as substitutes for discrete token inputs. JacHess supports dual-mode regularization, alternating between fine-tuning with labeled data and regularization with unlabeled data. We evaluate JacHess on the GLUE benchmark and demonstrate that it consistently and significantly improves in-distribution generalization and enhances performance under domain shift. Across diverse PLMs, JacHess outperforms comparable representation-based regularization methods and unregularized fine-tuning, while also improving model calibration. Our findings, coupled with a computationally efficient estimator for the Jacobian and Hessian norms, position JacHess as a robust and widely applicable solution for enhancing PLM performance.

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder

arXiv preprint

PDF Cite

TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.

Are ELECTRA’s Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Ivan Rep, David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9159–9169, Miami, Florida, USA. Association for Computational Linguistics.

PDF Cite

While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA’s sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator’s last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over 8 points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA’s generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.

Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?

Laura Majer and Jan Šnajder.

In Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER), pages 245–263, Miami, Florida, USA. Association for Computational Linguistics.

PDF Cite

The rising threat of disinformation underscores the need to fully or partially automate the fact-checking process. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs’ predictive accuracy on five CD/CW datasets from diverse domains, using corresponding annotation guidelines in prompts. We examine two key aspects: (1) how to best distill factuality and worthiness criteria into a prompt, and (2) how much context to provide for each claim. To this end, we experiment with different levels of prompt verbosity and varying amounts of contextual information given to the model. We additionally evaluate the top-performing models with ranking metrics, resembling prioritization done by fact-checkers. Our results show that optimal prompt verbosity varies, meta-data alone adds more performance boost than co-text, and confidence scores can be directly used to produce reliable check-worthiness rankings.

Closed-domain event extraction for hard news event monitoring: a systematic study

David Dukić, Filip Karlo Došilović, Domagoj Pluščec, Jan Šnajder

PeerJ Computer Science

PDF Cite

News event monitoring systems allow real-time monitoring of a large number of events reported in the news, including the urgent and critical events comprising the so-called hard news. These systems heavily rely on natural language processing (NLP) to perform automatic event extraction at scale. While state-of-the-art event extraction models are readily available, integrating them into a news event monitoring system is not as straightforward as it seems due to practical issues related to model selection, robustness, and scale. To address this gap, we present a study on the practical use of event extraction models for news event monitoring. Our study focuses on the key task of closed-domain main event extraction (CDMEE), which aims to determine the type of the story’s main event and extract its arguments from the text. We evaluate a range of state-of-the-art NLP models for this task, including those based on pre-trained language models. Aiming at a more realistic evaluation than done in the literature, we introduce a new dataset manually labeled with event types and their arguments. Additionally, we assess the scalability of CDMEE models and analyze the trade-off between accuracy and inference speed. Our results give insights into the performance of state-of-the-art NLP models on the CDMEE task and provide recommendations for developing effective, robust, and scalable news event monitoring systems.

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: ACL 2024, pages 14168–14181, Bangkok, Thailand. Association for Computational Linguistics.

PDF Cite

Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs’ poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs’ performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.

LLMs for Targeted Sentiment in News Headlines: Exploring the Descriptive-Prescriptive Dilemma

Jana Juroš, Laura Majer and Jan Šnajder.

In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 329–343, Bangkok, Thailand. Association for Computational Linguistics.

PDF Cite

News headlines often evoke sentiment by intentionally portraying entities in particular ways, making targeted sentiment analysis (TSA) of headlines a worthwhile but difficult task. Due to its subjectivity, creating TSA datasets can involve various annotation paradigms, from descriptive to prescriptive, either encouraging or limiting subjectivity. LLMs are a good fit for TSA due to their broad linguistic and world knowledge and in-context learning abilities, yet their performance depends on prompt design. In this paper, we compare the accuracy of state-of-the-art LLMs and fine-tuned encoder models for TSA of news headlines using descriptive and prescriptive datasets across several languages. Exploring the descriptive–prescriptive continuum, we analyze how performance is affected by prompt prescriptiveness, ranging from plain zero-shot to elaborate few-shot prompts. Finally, we evaluate the ability of LLMs to quantify uncertainty via calibration error and comparison to human label variation. We find that LLMs outperform fine-tuned encoders on descriptive datasets, while calibration and F1-score generally improve with increased prescriptiveness, yet the optimal level varies.

Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Fran Jelenić, Josip Jukić, Martin Tutek, Mate Puljiz, Jan Šnajder

The Twelfth International Conference on Learning Representations (ICLR 2024)

PDF Cite

Effective out-of-distribution (OOD) detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in Transformers based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks.

Leveraging Open Information Extraction for More Robust Domain Transfer of Event Trigger Detection

David Dukić, Kiril Gashteovski, Goran Glavaš, Jan Šnajder

In Findings of the Association for Computational Linguistics: EACL 2024, pages 1197–1213, St. Julian’s, Malta. Association for Computational Linguistics.

PDF Cite

Event detection is a crucial information extraction task in many domains, such as Wikipedia or news. The task typically relies on trigger detection (TD) – identifying token spans in the text that evoke specific events. While the notion of triggers should ideally be universal across domains, domain transfer for TD from high- to low-resource domains results in significant performance drops. We address the problem of negative transfer in TD by coupling triggers between domains using subject-object relations obtained from a rule-based open information extraction (OIE) system. We demonstrate that OIE relations injected through multi-task training can act as mediators between triggers in different domains, enhancing zero- and few-shot TD domain transfer and reducing performance drops, in particular when transferring from a high-resource source domain (Wikipedia) to a low(er)-resource target domain (news). Additionally, we combine this improved transfer with masked language modeling on the target domain, observing further TD transfer gains. Finally, we demonstrate that the gains are robust to the choice of the OIE system.

Parameter-Efficient Language Model Tuning with Active Learning in Low-Resource Settings

Josip Jukić, Jan Šnajder

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5061–5074, Singapore. Association for Computational Linguistics.

PDF Cite

Pre-trained language models (PLMs) have ignited a surge in demand for effective fine-tuning techniques, particularly in low-resource domains and languages. Active learning (AL), a set of algorithms designed to decrease labeling costs by minimizing label complexity, has shown promise in confronting the labeling bottleneck. In parallel, adapter modules designed for parameter-efficient fine-tuning (PEFT) have demonstrated notable potential in low-resource settings. However, the interplay between AL and adapter-based PEFT remains unexplored. We present an empirical study of PEFT behavior with AL in low-resource settings for text classification tasks. Our findings affirm the superiority of PEFT over full-fine tuning (FFT) in low-resource settings and demonstrate that this advantage persists in AL setups. We further examine the properties of PEFT and FFT through the lens of forgetting dynamics and instance-level representations, where we find that PEFT yields more stable representations of early and middle layers compared to FFT. Our research underscores the synergistic potential of AL and PEFT in low-resource settings, paving the way for advancements in efficient and effective fine-tuning.

Smooth Sailing: Improving Active Learning for Pre-trained Language Models with Representation Smoothness Analysis

Josip Jukić, Jan Šnajder

In Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD), pages 11–24, Gothenburg, Sweden. Association for Computational Linguistics.

PDF Cite

Developed to alleviate prohibitive labeling costs, active learning (AL) methods aim to reduce label complexity in supervised learning. While recent work has demonstrated the benefit of using AL in combination with large pre-trained language models (PLMs), it has often overlooked the practical challenges that hinder the effectiveness of AL. We address these challenges by leveraging representation smoothness analysis to ensure AL is feasible, that is, both effective and practicable. Firstly, we propose an early stopping technique that does not require a validation set – often unavailable in realistic AL conditions – and observe significant improvements over random sampling across multiple datasets and AL methods. Further, we find that task adaptation improves AL, whereas standard short fine-tuning in AL does not provide improvements over random sampling. Our work demonstrates the usefulness of representation smoothness analysis for AL and introduces an AL stopping criterion that reduces label complexity.

Easy to Decide, Hard to Agree: Reducing Disagreements Between Saliency Methods

Josip Jukić, Martin Tutek, Jan Šnajder

In Findings of the Association for Computational Linguistics: ACL 2023, pages 9147–9162, Toronto, Canada. Association for Computational Linguistics.

PDF Cite

A popular approach to unveiling the black box of neural NLP models is to leverage saliency methods, which assign scalar importance scores to each input component. A common practice for evaluating whether an interpretability method is faithful has been to use evaluation-by-agreement – if multiple methods agree on an explanation, its credibility increases. However, recent work has found that saliency methods exhibit weak rank correlations even when applied to the same model instance and advocated for alternative diagnostic methods. In our work, we demonstrate that rank correlation is not a good fit for evaluating agreement and argue that Pearson-r is a better-suited alternative. We further show that regularization techniques that increase faithfulness of attention explanations also increase agreement between saliency methods. By connecting our findings to instance categories based on training dynamics, we show that the agreement of saliency method explanations is very low for easy-to-learn instances. Finally, we connect the improvement in agreement across instance categories to local representation space statistics of instances, paving the way for work on analyzing which intrinsic model properties improve their predisposition to interpretability methods.

On Dataset Transferability in Active Learning for Transformers

Fran Jelenić, Josip Jukić, Nina Drobac, Jan Šnajder

In Findings of the Association for Computational Linguistics: ACL 2023, pages 2282–2295, Toronto, Canada. Association for Computational Linguistics.

PDF Cite

Active learning (AL) aims to reduce labeling costs by querying the examples most beneficial for model learning. While the effectiveness of AL for fine-tuning transformer-based pre-trained language models (PLMs) has been demonstrated, it is less clear to what extent the AL gains obtained with one model transfer to others. We consider the problem of transferability of actively acquired datasets in text classification and investigate whether AL gains persist when a dataset built using AL coupled with a specific PLM is used to train a different PLM. We link the AL dataset transferability to the similarity of instances queried by the different PLMs and show that AL methods with similar acquisition sequences produce highly transferable datasets regardless of the models used. Additionally, we show that the similarity of acquisition sequences is influenced more by the choice of the AL method than the choice of the model.

ALANNO: An Active Learning Annotation System for Mortals

Josip Jukić, Fran Jelenić, Miroslav Bićanić, Jan Šnajder

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 228–235, Dubrovnik, Croatia. Association for Computational Linguistics.

PDF Cite

Supervised machine learning has become the cornerstone of today’s data-driven society, increasing the need for labeled data. However, the process of acquiring labels is often expensive and tedious. One possible remedy is to use active learning (AL) – a special family of machine learning algorithms designed to reduce labeling costs. Although AL has been successful in practice, a number of practical challenges hinder its effectiveness and are often overlooked in existing AL annotation tools. To address these challenges, we developed ALANNO, an open-source annotation system for NLP tasks equipped with features to make AL effective in real-world annotation projects. ALANNO facilitates annotation management in a multi-annotator setup and supports a variety of AL methods and underlying models, which are easily configurable and extensible.

Target Two Birds With One SToNe: Entity-Level Sentiment and Tone Analysis in Croatian News Headlines

Ana Barić, Laura Majer, David Dukić, Marijana Grbeša-Zenzerović, and Jan Šnajder

In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), pages 78–85, Dubrovnik, Croatia. Association for Computational Linguistics.

PDF Cite

Sentiment analysis is often used to examine how different actors are portrayed in the media, and analysis of news headlines is of particular interest due to their attention-grabbing role. We address the task of entity-level sentiment analysis from Croatian news headlines. We frame the task as targeted sentiment analysis (TSA), explicitly differentiating between sentiment toward a named entity and the overall tone of the headline. We describe SToNe, a new dataset for this task with sentiment and tone labels. We implement several neural benchmark models, utilizing single- and multi-task training, and show that TSA can benefit from tone information. Finally, we gauge the difficulty of this task by leveraging dataset cartography.

You Are What You Talk About: Inducing Evaluative Topics for Personality Analysis

Josip Jukić, Iva Vukojević, Jan Šnajder

Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3986–3999, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

PDF Cite

Expressing attitude or stance toward entities and concepts is an integral part of human behavior and personality. Recently, evaluative language data has become more accessible with social media’s rapid growth, enabling large-scale opinion analysis. However, surprisingly little research examines the relationship between personality and evaluative language. To bridge this gap, we introduce the notion of evaluative topics, obtained by applying topic models to pre-filtered evaluative text from social media. We then link evaluative topics to individual text authors to build their evaluative profiles. We apply evaluative profiling to Reddit comments labeled with personality scores and conduct an exploratory study on the relationship between evaluative topics and Big Five personality facets, aiming for a more interpretable, facet-level analysis. Finally, we validate our approach by observing correlations consistent with prior research in personality psychology.

Personality adjectives in the digital world: A natural language processing study of big five adjectives and their usage on reddit

Iva Vukojević, Irina Masnikosa, Matej Gjurković, Nina Drobac, Ana Butković, Martina Lozić, Denis Bratko, Jan Šnajder

Journal of Research in Personality

PDF Cite

Psycholexical studies explore the intricate interplay between language and personality traits, focusing on trait representation in language. One aspect of such representation is the frequency of personality adjective usage. This study examines how linguistic and trait-label properties of personality adjectives relate to their usage frequency. Utilizing a corpus from the social media platform Reddit, we employ natural language processing to analyze Big Five adjectives in person-descriptions. Our results show that trait-label properties exhibit different patterns when considered together rather than separately from linguistic properties—for instance, prefixal composition nullifies the expected effect of polarity on frequency. These findings highlight the importance of considering both linguistic and trait-label properties when assessing the usage of personality adjectives.

SIMPA: statement-to-item matching personality assessment from text

Matej Gjurković, Iva Vukojević, Jan Šnajder

Future generation computer systems

PDF Cite

Automated text-based personality assessment (ATBPA) methods can analyze large amounts of text data and identify nuanced linguistic personality cues. However, current approaches lack the interpretability, explainability, and validity offered by standard questionnaire instruments. To address these weaknesses, we propose an approach that combines questionnaire-based and text-based approaches to personality assessment. Our Statement-to-Item Matching Personality Assessment (SIMPA) framework uses natural language processing methods to detect self-referencing descriptions of personality in a target’s text and utilizes these descriptions for personality assessment. The core of the framework is the notion of a trait-constrained semantic similarity between the target’s freely expressed statements and questionnaire items. The conceptual basis is provided by the realistic accuracy model (RAM), which describes the process of accurate personality judgments and which we extend with a feedback loop mechanism to improve the accuracy of judgments. We present a simple proof-of-concept implementation of SIMPA for ATBPA on the social media site Reddit. We show how the framework can be used directly for unsupervised estimation of a target’s Big 5 scores and indirectly to produce features for a supervised ATBPA model, demonstrating state-of-the-art results for the personality prediction task on Reddit.

Pandora talks: Personality and demographics on reddit

Matej Gjurković, Mladen Karan, Iva Vukojevic, Mihaela Bošnjak, Jan Šnajder

In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics

PDF Cite

Personality and demographics are important variables in social sciences and computational sociolinguistics. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first dataset of Reddit comments of 10k users partially labeled with three personality models and demographics (age, gender, and location), including 1.6k users labeled with the well-established Big 5 personality model. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.