Ahmad Dawar Hakimi

Output Vector Editing for Memorization Mitigation in Large Language Models

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

arXiv preprint, 2026

Abstract

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7× gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ~14% of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60–64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

interpretability llms output vector editing memorization mitigation LLM memorization neuron editing MLP neurons weight editing residual stream mechanistic interpretability

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

arXiv preprint, 2026

Abstract

Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels ($43) achieves comparable F1-Macro to one trained on 3,800 human annotations ($316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.

active-learning llms active learning LLM annotation human vs LLM annotation hostility detection anti-immigrant classification TikTok comments German political discourse annotation cost weak supervision

On Relation-Specific Neurons in Large Language Models

Yihong Liu*, Runsheng Chen*, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze

EMNLP 2025

Abstract

In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While factual knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself – independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the LLama-2 family on a chosen set of relations, with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation r on the LLM's ability to handle (1) facts involving relation r and (2) facts involving a different relation r' ≠ r. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. (i) Neuron cumulativity: Multiple neurons jointly contribute to processing facts involving relation r, with no single neuron fully encoding a fact in r on its own. (ii) Neuron versatility: Neurons can be shared across multiple closely related as well as less related relations. In addition, some relation neurons transfer across languages. (iii) Neuron interference: Deactivating neurons specific to one relation can improve LLMs' factual recall performance for facts of other relations.

interpretability llms relation-specific neurons knowledge neurons mechanistic interpretability factual knowledge neuron interference cross-lingual neurons fact storage

Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models

Ahmad Dawar Hakimi, Ali Modarressi, Philipp Wicke, Hinrich Schütze

Findings of ACL 2025

Abstract

Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability, reliability, and efficiency. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its Attention Heads and Feed Forward Networks (FFNs) over the course of pre-training. We classify these components into four roles — general, entity, relation-answer, and fact-answer specific — and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, answer-specific attention heads display the highest turnover, whereas FFNs remain stable, continually refining stored knowledge. Furthermore, our probing experiments reveal that location-based relations converge to high accuracy earlier in training than name-based relations, highlighting how task complexity shapes acquisition dynamics. These insights offer a mechanistic view of knowledge formation in LLMs and have implications for model pruning, optimization, and transparency.

interpretability llms training dynamics training trajectory mechanistic interpretability circuits knowledge acquisition factual recall pre-training analysis

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Philipp Mondorf, Mingyang Wang, Sebastian Gerstner, Ahmad Dawar Hakimi, Yihong Liu, Leonor Veloso, Shijia Zhou, Hinrich Schütze, Barbara Plank

BlackboxNLP Workshop @ EMNLP 2025

Abstract

The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods — e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model–task combinations.

interpretability llms circuit localization MIB benchmark mechanistic interpretability benchmark ensemble methods edge pruning circuit identification subnetwork analysis

Citance-Contextualized Summarization of Scientific Papers

Shahbaz Syed*, Ahmad Dawar Hakimi*, Khalid Al-Khatib, Martin Potthast

Findings of EMNLP 2023

Abstract

Current approaches to automatic summarization of scientific papers generate informative summaries in the form of abstracts. However, abstracts are not intended to show the relationship between a paper and the references cited in it. We propose a new contextualized summarization approach that can generate an informative summary conditioned on a given sentence containing the citation of a reference (a so-called "citance"). This summary outlines the content of the cited paper relevant to the citation location. Thus, our approach extracts and models the citances of a paper, retrieves relevant passages from cited papers, and generates abstractive summaries tailored to each citance. We evaluate our approach using Webis-Context-SciSumm-2023, a new dataset containing 540K computer science papers and 4.6M citances therein.

summarization citance scientific paper summarization citation-aware summarization contextualized summarization abstractive summarization retrieval-augmented summarization scholarly NLP

Explorative Visual Analysis of Rap Music

Christofer Meinecke, Ahmad Dawar Hakimi, Stefan Jänicke

Information, 13(1), 2022

Abstract

Detecting references and similarities in music lyrics can be a difficult task. Crowdsourced knowledge platforms such as Genius can help in this process through user-annotated information about the artist and the song but fail to include visualizations to help users find similarities and structures on a higher and more abstract level. We propose a prototype to compute similarities between rap artists based on word embedding of their lyrics crawled from Genius. Furthermore, the artists and their lyrics can be analyzed using an explorative visualization system applying multiple visualization methods to support domain-specific tasks.

visualization rap music lyrics analysis word embeddings music similarity Genius lyrics exploratory visualization computational musicology hip-hop

Casting the Same Sentiment Classification Problem

Erik Körner, Ahmad Dawar Hakimi, Gerhard Heyer, Martin Potthast

Findings of EMNLP 2021

Abstract

We introduce and study a problem variant of sentiment analysis, namely the "same sentiment classification problem", where, given a pair of texts, the task is to determine if they have the same sentiment, disregarding the actual sentiment polarity. Among other things, our goal is to enable a more topic-agnostic sentiment classification. We study the problem using the Yelp business review dataset, demonstrating how sentiment data needs to be prepared for this task, and then carry out sequence pair classification using the BERT language model. In a series of experiments, we achieve an accuracy above 83% for category subsets across topics, and 89% on average.

sentiment same sentiment classification sentiment analysis BERT sequence pair classification

On Classifying Whether Two Texts Are on the Same Side of an Argument

Erik Körner, Gregor Wiedemann, Ahmad Dawar Hakimi, Gerhard Heyer, Martin Potthast

EMNLP 2021

Abstract

To ease the difficulty of argument stance classification, the task of same side stance classification (S3C) has been proposed. In contrast to actual stance classification, which requires a substantial amount of domain knowledge to identify whether an argument is in favor or against a certain issue, it is argued that, for S3C, only argument similarity within stances needs to be learned to successfully solve the task. We evaluate several transformer-based approaches on the dataset of the recent S3C shared task, followed by an in-depth evaluation and error analysis of our model and the task's hypothesis. We show that, although we achieve state-of-the-art results, our model fails to generalize both within as well as across topics and domains when adjusting the sampling strategy of the training and test set to a more adversarial scenario. Our evaluation shows that current state-of-the-art approaches cannot determine same side stance by considering only domain-independent linguistic similarity features, but appear to require domain knowledge and semantic inference, too.

argument-mining same side stance classification argument mining stance classification transformer models cross-topic generalization argument similarity

The Road Map to FAME: A Framework for Mining and Formal Evaluation of Arguments

Ringo Baumann, Gregor Wiedemann, Maximilian Heinrich, Ahmad Dawar Hakimi, Gerhard Heyer

Datenbank-Spektrum, 20(2): 107–113, 2020

Abstract

Two different perspectives on argumentation have been pursued in computer science research, namely approaches of argument mining in natural language processing on the one hand, and formal argument evaluation on the other hand. So far these research areas are largely independent and unrelated. This article introduces the agenda of our recently started project "FAME – A framework for argument mining and evaluation". The main project idea is to link the two perspectives on argumentation and their respective research agendas by employing controlled natural language as a convenient form of intermediate knowledge representation. Our goal is to develop a framework which integrates argument mining and formal argument evaluation to study patterns of empirical argumentation usage. If successful, this combination will allow for new types of queries to be answered by argumentation retrieval systems and large-scale content analysis.

argument-mining argument mining formal argumentation FAME framework controlled natural language argument evaluation knowledge representation argumentation retrieval

Research Focus

Mechanistic Interpretability Research

Active Learning

Large Language Models

Publications

Output Vector Editing for Memorization Mitigation in Large Language Models

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

On Relation-Specific Neurons in Large Language Models

Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Citance-Contextualized Summarization of Scientific Papers

Explorative Visual Analysis of Rap Music

Casting the Same Sentiment Classification Problem

On Classifying Whether Two Texts Are on the Same Side of an Argument

The Road Map to FAME: A Framework for Mining and Formal Evaluation of Arguments

Blog

Coming Soon

Let's Connect