TakeLab

We are a creative research group specializing in
natural language processing.

About Us

TakeLab is an academic research group at the Faculty of Electrical Engineering and Computing in Zagreb, Croatia, focused on advancing artificial intelligence, machine learning, and natural language processing (NLP). Our work centers on large language models (LLMs), with a commitment to refining methods for language comprehension and analyzing complex, unstructured data.

Advancing LLM research, with a focus on enhancing their generalization, robustness, and interpretability.
Creating representation learning techniques to improve semantic and contextual understanding in computational systems.
Exploring computational social science, using data-driven methods to study social interactions and societal trends.

Our research focuses on multiple aspects of representation learning, seeking a deeper understanding of the internal workings of LLMs. We also engage in interdisciplinary work within computational social science, utilizing NLP tools to analyze large datasets that reveal insights into human behavior, communication patterns, and evolving societal trends.

Latest Research

Explore our recent research studies. Select a publication to read more about it.

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder

arXiv preprint

PDF Cite

TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.

Are ELECTRA’s Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Ivan Rep, David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9159–9169, Miami, Florida, USA. Association for Computational Linguistics.

PDF Cite

While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA’s sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator’s last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over 8 points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA’s generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.

Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?

Laura Majer and Jan Šnajder.

In Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER), pages 245–263, Miami, Florida, USA. Association for Computational Linguistics.

PDF Cite

The rising threat of disinformation underscores the need to fully or partially automate the fact-checking process. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs’ predictive accuracy on five CD/CW datasets from diverse domains, using corresponding annotation guidelines in prompts. We examine two key aspects: (1) how to best distill factuality and worthiness criteria into a prompt, and (2) how much context to provide for each claim. To this end, we experiment with different levels of prompt verbosity and varying amounts of contextual information given to the model. We additionally evaluate the top-performing models with ranking metrics, resembling prioritization done by fact-checkers. Our results show that optimal prompt verbosity varies, meta-data alone adds more performance boost than co-text, and confidence scores can be directly used to produce reliable check-worthiness rankings.

Closed-domain event extraction for hard news event monitoring: a systematic study

David Dukić, Filip Karlo Došilović, Domagoj Pluščec, Jan Šnajder

PeerJ Computer Science

PDF Cite

News event monitoring systems allow real-time monitoring of a large number of events reported in the news, including the urgent and critical events comprising the so-called hard news. These systems heavily rely on natural language processing (NLP) to perform automatic event extraction at scale. While state-of-the-art event extraction models are readily available, integrating them into a news event monitoring system is not as straightforward as it seems due to practical issues related to model selection, robustness, and scale. To address this gap, we present a study on the practical use of event extraction models for news event monitoring. Our study focuses on the key task of closed-domain main event extraction (CDMEE), which aims to determine the type of the story’s main event and extract its arguments from the text. We evaluate a range of state-of-the-art NLP models for this task, including those based on pre-trained language models. Aiming at a more realistic evaluation than done in the literature, we introduce a new dataset manually labeled with event types and their arguments. Additionally, we assess the scalability of CDMEE models and analyze the trade-off between accuracy and inference speed. Our results give insights into the performance of state-of-the-art NLP models on the CDMEE task and provide recommendations for developing effective, robust, and scalable news event monitoring systems.

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: ACL 2024, pages 14168–14181, Bangkok, Thailand. Association for Computational Linguistics.

PDF Cite

Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs’ poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs’ performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.

All publications

Projects

Explore our projects.

Retriever

TakeLab Retriever is a platform that scans articles and their metadata from Croatian news outlets and does text mining in real-time.

Alanno

We created a powerful annotation platform powered by active learning and designed to support a wide range of machine learning and deep learning models.

PsyTxt

With this project, we aim to set the ground for a truly interdisciplinary perspective on computational personality research by developing datasets and models for personality prediction and analysis based on online textual interactions.

Teaching

We take great pride and care in teaching the things we're good at and that inspire us. We design our courses mainly around the key topics in artificial intelligence, machine learning, NLP and IR, the topics that we deem relevant for our students' career success and professional development. Here's a list of courses we currently offer at the Faculty of Electrical Engineering and Computing, University of Zagreb.

News

Stay up to date with the latest news and updates from TakeLab.

Team

Get to know the people behind the work.

Ana Barić

Caporegime

I love deadlines. I love the whooshing sound they make as they go by.

David Dukić

Caporegime

Diamonds are made under pressure 💎

Iva Vukojević

Caporegime

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Jan Šnajder

Don

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Josip Jukić

Caporegime

It’s no coincidence a 90° angle is called the right one.

Laura Majer

Caporegime

She doesn't even go here!

Martin Tutek

Sottocapo

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Matej Gjurković

Consigliere

Explicabo voluptatem mollitia et repellat qui dolorum quasi

TakeLab

About Us

Latest Research

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

Are ELECTRA’s Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?

Closed-domain event extraction for hard news event monitoring: a systematic study

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

Projects

Retriever

Alanno

PsyTxt

Teaching

Intro to AI

Machine Learning 1

Text Analysis and Retrieval

Selected Topics in Natural Language Processing

News

Team

Ana Barić

David Dukić

Iva Vukojević

Jan Šnajder

Josip Jukić

Laura Majer

Martin Tutek

Matej Gjurković