Top 5 Papers
Recent, Relevant Research Papers & Reports
Each item has a one-line summary and a direct link (click the title).
These selections are current as of Oct 2, 2025.
AI Safety / Alignment
An alignment safety case sketch based on debate — outlines using debate between systems to surface failures and safety-relevant reasoning. (arXiv)
A New and Fragile Opportunity for AI Safety — analysis of vulnerabilities and new pathways for safety research (reasoning models, CoT concerns). (arXiv)
Findings from a pilot Anthropic–OpenAI alignment evaluation — public findings from cross-evaluation of model misalignment tendencies (sycophancy, self-preservation, misuse). (Alignment Science Blog)
Bias Protections & Debiasing
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Models — new benchmark and evaluation suite for comparing debiasing methods (very recent). (arXiv)
Mitigating Biases in Language Models via Bias Unlearning — proposes "BiasUnlearn", shows strong debiasing with preserved LM performance (very recent). (arXiv)
Bias and Fairness in Large Language Models: A Survey (MIT/Computational Linguistics) — authoritative survey of bias eval + mitigation methods. (MIT Press Direct)
Simulating a Bias Mitigation Scenario in Large Language Models — experimental study comparing pre-/in-/post-model mitigation strategies. (arXiv)
Data Safety — Privacy, Poisoning, Federated learning
Poisoning Prevention in Federated Learning and Differential Privacy via Stateful Proofs of Execution — defenses addressing poisoning during DP/FL training. (arXiv)
Poisoning Federated Learning under the Cover of Differential Privacy (DP-Poison) — attack demonstrating how DP can be abused by adversaries (ACM proceedings). (ACM Digital Library)
On Evaluating the Poisoning Robustness of Federated / LDP protocols — assessment of robustness and evaluation metrics for poisoning against local DP. (arXiv)
Federated Learning: A Survey on Privacy-Preserving Methods — up-to-date survey of FL privacy techniques, TEE uses, DP tradeoffs. (arXiv)
Benefits of Small Language Models (SLMs) vs LLMs
What is the Role of Small Models in the LLM Era: A Survey — detailed survey arguing where SLMs are preferable (cost, latency, privacy, deployability). (arXiv)
A Survey of Small Language Models — earlier/companion survey exploring architectures, on-device inference, and constrained deployments. (arXiv)
PiFi: integrating frozen LLM layers into SLMs — hybrid method that inserts parts of an LLM into a small model to get LLM-like strengths at lower cost. (Practical technique showing SLM/LLM hybrid benefits). (arXiv)
A Survey on Small Language Models in the Era of Large — ACM publication summarizing SLM roles in assisting LLMs via proxy tuning, guard models, and efficiency gains. (ACM Digital Library)
Quick notes (important)
“Latest” papers above include very recent entries (some posted within days/weeks of Oct 2, 2025). Where we flagged a submission as “very recent,” you’ll find the arXiv date or publisher date on the linked page. (arXiv)
For bias and data-poisoning, the field moves quickly — new attacks and defenses appear frequently; check benchmark papers (e.g., BiasFreeBench) for up-to-date comparisons. (arXiv)
Top 5 Papers
Selected Peer-Reviewed / Conference / ArXiv (research-style)
SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents Kunlun Zhu et al. (2025) — introduces a safety-aware AI scientist framework with a new benchmark of risky scientific tasks. SafeScientist (arXiv)
Position: AI Safety Must Embrace an Antifragile Perspective Ming Jin & Hyunin Lee (2025) — a position paper advocating that AI systems’ safety should improve under stress and uncertainty. Position: AI Safety Must Embrace an Antifragile Perspective (arXiv)
Towards Provable Probabilistic Safety for Scalable Embodied AI Systems Linxuan He et al. (2025) — defines a probabilistic safety boundary framework for embodied AI systems tackling rare failure modes. Towards Provable Probabilistic Safety (arXiv)
The BIG Argument for AI Safety Cases Ibrahim Habli et al. (2025) — presents a structured “safety case” methodology (Balanced, Integrated, Grounded) tailored to AI systems. The BIG Argument for AI Safety Cases (arXiv)
Small Language Models are the Future of Agentic AI P. Belcak et al. (2025) — argues that small models will power the next wave of AI “agents” because of low cost, modularity, and safety affordances. Small Language Models are the Future of Agentic AI (PDF)
Three-Sentence Summaries (for above 5)
SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents SafeScientist is an LLM agent pipeline designed to detect and refuse high-risk or unethical scientific tasks, using layers of monitoring (prompt, tool, collaboration, ethical review). It introduces SciSafetyBench, a benchmark of 240 risky scientific tasks across domains and 120 tool-related risk tasks, to evaluate safety in scientific agent behavior. In experiments, it improves safety (refusing or mitigating unsafe tasks) by ~ 35% over baseline systems, while preserving productive output.
Position: AI Safety Must Embrace an Antifragile Perspective This position paper argues that typical robustness tests are brittle: systems often fail when faced with rare or shifted distributions. Instead, the authors propose that AI safety research should aim to make systems antifragile — the system’s safety “margin” improves under stress or exposure to rare cases. They offer conceptual guidelines and critique static evaluation paradigms, advocating evolving adaptive safety benchmarks.
Towards Provable Probabilistic Safety for Scalable Embodied AI Systems In embodied AI (robotics, agents interacting with the physical world), exhaustive verification is infeasible due to enormous state spaces and rare corner-case failures. This work introduces a probabilistic safety guarantee framework: specify a tolerated risk threshold, and statistically bound failure probability. They show how to combine statistical methods and system analysis to certify that the risk remains below the threshold, enabling safer large-scale deployment.
The BIG Argument for AI Safety Cases The authors adapt the idea of safety cases (widely used in engineering of critical systems) to AI, with a specific instantiation called BIG (Balanced, Integrated, Grounded). They argue AI systems require holistic reasoning about hazards, traceability, context, and emergent behaviors beyond narrow correctness. They illustrate how BIG safety arguments can structure assurances across system layers (model, deployment, human oversight).
Small Language Models are the Future of Agentic AI Belcak et al. argue that small LMs (SLMs) are more suitable than massive LLMs for powering agentic systems (i.e. systems that plan and act) because of cost, latency, modular composition, and trust/safety advantages. They show that in many agent applications, the incremental capability of scaling to huge model sizes yields diminishing returns relative to the overhead. They further suggest architectures and techniques (e.g. modular tool invocation, verification layers) that let SLMs scale in practice.
Last updated
Was this helpful?