TakeLab

We are a creative research group specializing in
natural language processing.


About Us

TakeLab is an academic research group at the Faculty of Electrical Engineering and Computing in Zagreb, Croatia, focused on advancing artificial intelligence, machine learning, and natural language processing (NLP). Our work centers on large language models (LLMs), with a commitment to refining methods for language comprehension and analyzing complex, unstructured data.

  • Advancing LLM research, with a focus on enhancing their generalization, robustness, and interpretability.
  • Creating representation learning techniques to improve semantic and contextual understanding in computational systems.
  • Exploring computational social science, using data-driven methods to study social interactions and societal trends.

Our research focuses on multiple aspects of representation learning, seeking a deeper understanding of the internal workings of LLMs. We also engage in interdisciplinary work within computational social science, utilizing NLP tools to analyze large datasets that reveal insights into human behavior, communication patterns, and evolving societal trends.

Wave

Latest Research

Explore our recent research studies. Select a publication to read more about it.

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder

arXiv preprint

TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.

Publication Image

Are ELECTRA’s Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Ivan Rep, David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9159–9169, Miami, Florida, USA. Association for Computational Linguistics.

While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA’s sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator’s last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over 8 points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA’s generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.

Publication Image

Closed-domain event extraction for hard news event monitoring: a systematic study

David Dukić​​, Filip Karlo Došilović​, Domagoj Pluščec, Jan Šnajder

PeerJ Computer Science

News event monitoring systems allow real-time monitoring of a large number of events reported in the news, including the urgent and critical events comprising the so-called hard news. These systems heavily rely on natural language processing (NLP) to perform automatic event extraction at scale. While state-of-the-art event extraction models are readily available, integrating them into a news event monitoring system is not as straightforward as it seems due to practical issues related to model selection, robustness, and scale. To address this gap, we present a study on the practical use of event extraction models for news event monitoring. Our study focuses on the key task of closed-domain main event extraction (CDMEE), which aims to determine the type of the story’s main event and extract its arguments from the text. We evaluate a range of state-of-the-art NLP models for this task, including those based on pre-trained language models. Aiming at a more realistic evaluation than done in the literature, we introduce a new dataset manually labeled with event types and their arguments. Additionally, we assess the scalability of CDMEE models and analyze the trade-off between accuracy and inference speed. Our results give insights into the performance of state-of-the-art NLP models on the CDMEE task and provide recommendations for developing effective, robust, and scalable news event monitoring systems.

Publication Image

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

David Dukić, Jan Šnajder

In Findings of the Association for Computational Linguistics: ACL 2024, pages 14168–14181, Bangkok, Thailand. Association for Computational Linguistics.

Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs’ poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs’ performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.

Publication Image

Disentangling Latent Shifts of In-Context Learning Through Self-Training

Josip Jukić, Jan Šnajder

arXiv preprint

In-context learning (ICL) has become essential in natural language processing, particularly with autoregressive large language models capable of learning from demonstrations provided within the prompt. However, ICL faces challenges with stability and long contexts, especially as the number of demonstrations grows, leading to poor generalization and inefficient inference. To address these issues, we introduce STICL (Self-Training ICL), an approach that disentangles the latent shifts of demonstrations from the latent shift of the query through self-training. STICL employs a teacher model to generate pseudo-labels and trains a student model using these labels, encoded in an adapter module. The student model exhibits weak-to-strong generalization, progressively refining its predictions over time. Our empirical results show that STICL improves generalization and stability, consistently outperforming traditional ICL methods and other disentangling strategies across both in-domain and out-of-domain data.

Publication Image

Projects

Explore our projects.

Retriever

TakeLab Retriever is a platform that scans articles and their metadata from Croatian news outlets and does text mining in real-time.

Alanno

We created a powerful annotation platform powered by active learning and designed to support a wide range of machine learning and deep learning models.

PsyTxt

With this project, we aim to set the ground for a truly interdisciplinary perspective on computational personality research by developing datasets and models for personality prediction and analysis based on online textual interactions.


Teaching

We take great pride and care in teaching the things we're good at and that inspire us. We design our courses mainly around the key topics in artificial intelligence, machine learning, NLP and IR, the topics that we deem relevant for our students' career success and professional development. Here's a list of courses we currently offer at the Faculty of Electrical Engineering and Computing, University of Zagreb.

Intro to AI

An introductory course covering fundamental concepts and techniques in artificial intelligence.

Machine Learning 1

A foundational course in machine learning, focused on key algorithms and exploring their underlying mechanisms.

Text Analysis and Retrieval

Examines modern approaches to text analysis and retrieval, grounded in fundamental principles.

Selected Topics in Natural Language Processing

Advanced topics in natural language processing, covering current research and applications.


News

Stay up to date with the latest news and updates from TakeLab.


Team

Get to know the people behind the work.

Ana Barić

Caporegime

Explicabo voluptatem mollitia et repellat qui dolorum quasi

David Dukić

Caporegime

Diamonds are made under pressure 💎

Iva Vukojević

Caporegime

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Jan Šnajder

Don

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Josip Jukić

Caporegime

It’s no coincidence a 90° angle is called the right one.

Laura Majer

Caporegime

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Martin Tutek

Sottocapo

Explicabo voluptatem mollitia et repellat qui dolorum quasi

Matej Gjurković

Consigliere

Explicabo voluptatem mollitia et repellat qui dolorum quasi