# NeurIPS 2023

### New Orleans LA

# NeurIPS in Numbers

- 3,218 main track papers
- 3,116 workshop papers
- 21,582 authors
- 15,760 participants

### Institution by Main Track Paper Count

# Awards

## Test of Time

### Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov et al.

The authors reflect on learnings from the decade-old Word2Vec paper to see if they hold true today: semi-supervised learning is key to language understanding (true); tokenization solves nuanced problems (true); treating language as sequences of dense vectors is powerful (true).

## Outstanding Main Track Papers

### Are Emergent Abilities of Large Language Models a Mirage? by Rylan Schaeffer et al.

Large language models’ emergent abilities may actually be a result of metric choices. Particularly, smooth, continuous, predictable changes in model performance, as scale increases, can appear sharp and unpredictable if the researcher chooses a non-linear or discontinuous metric.

### Privacy Auditing with One (1) Training Run by Thomas Steinke et al.

Privacy auditing usually estimates lower bounds on differential privacy algorithms by training on datasets differing by one record many times. Instead, training once, we use coin flips to randomly include/exclude data and infer the bounds by guessing what data was included.

## Outstanding Main Track Runner-Ups

### Scaling Data-Constrained Language Models by Niklas Muennighoff et al.

Scaling large language models is data-constrained (papers and books are estimated to be exhausted by 2024). Several strategies can help: repeat data for ~4 epochs for large language models, fill ~60% of the data budget with Python code and reduce the use of perplexity filters.

### Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafael Rafailov et al.

Direct Preference Optimization is a method to fine tune language models from human feedback without reinforcement learning, which can be complex and unstable. A mapping between reward functions and optimal policies can reduce this to a classification problem on human preferences.

## Outstanding Datasets and Benchmark Track Papers

### ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation by Sungduk Yu et al.

The ClimSim dataset encourages the development of ~1-km scale climate models using hybrid ML-physics techniques. It consists of 5.7B pairs of multivariate inputs (local climate states) and targets (cloud formation processes) generated by the E3SM-MMF climate simulator.

### DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models by Boxin Wang et al.

DecodingTrust is a trustworthiness evaluation for large language models and includes benchmarks from 8 perspectives: toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

# Invited Talks

### NextGenAI: The Delusion of Scaling and the Future of Generative AI by Björn Ommer

The co-author of the Stable Diffusion paper talks about the scaling limits and the need for improved compute and data efficiency. The future of generative models lie in combining different architectures and using retrieval augmentation to offload detail from the main model.

### The Many Faces of Responsible AI by Lora Aroyo

Rater disagreement is an important signal caused by content ambiguity and raters’ diverse perspectives. Human annotation should be allowed to capture the natural variance and bias in the problem and model quality metrics should reflect this.

### Coherence statistics, self-generated experience and why young humans are much smarter than current AI. by Linda Smith

How do babies rapidly learn across many domains from sparse data? They control the input by moving their heads (short-term), form episodes of connected experience at task time (medium-term) and change the curriculum according to developmental time (long-term).

### Sketching: core tools, learning-augmentation, and adaptive robustness by Jelani Nelson

A sketch is a compressed representation of a dataset still useful for downstream tasks. This talk discusses some core sketching tools (CountSketch, Random Projections and Matrix Sketching) as well more recent advancements such as learning-augmentation and adaptive robustness.

### Beyond Scaling Panel with Alexander Rush, Aakanksha Chowdhery, Angela Fan, Percy Liang

Aakanksha Chowdhery (trained PaLM at Google), Angela Fan (trained Llama-2 at Meta) and Percy Liang (Associate Professor at Stanford) sit down to discuss a range of topics including Architectures/Engineering, Data/Alignment, Evaluation/Transparency and Creators/Contributors.

### Systems for Foundation Models, and Foundation Models for Systems by Christopher Ré

This talk looks at new ways of efficiently building foundation models. Inspired by database research, Flash Attention reduces the IO cost by processing multiple query blocks at a time. Beyond the Transformer, architectures based on signal processing, like S4, show promise.

### Online Reinforcement Learning in Digital Health Interventions by Susan Murphy

HeartSteps is a physical activity coach with an activity suggester. Through different iterations of HeartSteps, various challenges of online reinforcement learning are addressed including actions that operate on different timescales, and learning while modifying the algorithm.

# Oral Sessions

## Reinforcement Learning

### Ordering-based Conditions for Global Convergence of Policy Gradient Methods by Jincheng Mei et al.

Softmax Policy Gradient and Natural Policy Gradient can achieve global convergence with non-zero approximation errors. Non-zero approximation errors does not characterize global convergence for the algorithms. Linear realizability does not imply global convergence for Softmax PG.

### When Demonstrations meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning by Siliang Zeng et al.

Offline inverse reinforcement learning aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. A bi-level optimization formulation of the reward estimation task is proposed.

### Online RL in Linearly \(q^\pi\)-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore by Gellert Weisz et al.

The difference between linearly \(q^\pi\)-realizable and linear MDPs is characterized: the former may have “zero-range” states. An algorithm is derived that solves \(q^\pi\)-realizable MDPs using samples that are bounded by polynomial parameters.

### When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment by Tianwei Ni et al.

Reinforcement learning algorithms need to learn memory and credit assignment, both of which involve modeling long-term dependencies, a task which suits Transformers. Empirical results suggest that Transformers improves memory but not credit assignment.

### Bridging RL Theory and Practice with the Effective Horizon by Cassidy Laidlaw et al.

BRIDGE, a new dataset for developing practical bounds for deep learning MDPs, leads to the Effective Horizon measure, correlating deep reinforcement learning success with the optimality of greedy actions under random policy Q functions.

### Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafael Rafailov et al. 🥈

Direct Preference Optimization is a method to fine tune language models from human feedback without reinforcement learning, which can be complex and unstable. A mapping between reward functions and optimal policies can reduce this to a classification problem on human preferences.

### MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement Learning by Zeyuan Ma et al.

MetaBox, a new benchmark platform, facilitates the development and evaluation of MetaBBO-RL methods, which use meta-level reinforcement learning to reduce the need for manual fine-tuning of low-level black-box optimizers.

## Datasets & Benchmarks

### LeanDojo: Theorem Proving with Retrieval-Augmented Language Models by Kaiyu Yang et al.

LeanDojo is an open-source playground for Lean, a proof assistant. The playground includes ReProver, an LLM-based prover augmented with premises retrieved from a vast math library, and a benchmark consisting of 98,734 theorems and proofs extracted from Lean’s math library.

### OpenAssistant Conversations — Democratizing Large Language Model Alignment by Andreas Köpf et al.

OpenAssistant Conversations is a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 complete and fully annotated conversation trees. This dataset aims to democratize research on large-scale alignment.

### DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models by Boxin Wang et al. 🥇

DecodingTrust is a trustworthiness evaluation for large language models and includes benchmarks from 8 perspectives: toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

## Tractable Models

### How to Turn Your Knowledge Graph Embeddings into Generative Models by Lorenzo Loconte et al.

Knowledge graph embedding (KGE) models have issues with prediction confidence measurement, constraint satisfaction guarantees and entity scaling. To address this, KGEs can be turned into generative models called GeKCs through the use of probabilistic circuits.

### Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach by Fabian Zaiser et al.

Exact bayesian inference is possible for a large class of discrete models. Genfer takes a probabilistic program and by translating it to a probability generating function, makes it possible to extract the posterior probability masses and moments using automatic differentiation.

### Characteristic Circuits by Zhongjie Yu et al.

Characteristic circuits are a family of tractable probabilistic models, providing a unified formalization of distributions over heterogeneous data in the spectral domain, that facilitate efficient probabilistic inference even when no closed-form density function is available.

## Deep Learning Theory

### Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization by Kaiyue Wen at al.

A reason for why over-parameterized neural networks can generalize may be because of minimizer flatness. Non-generalizing flattest models are found implying that the reason lies elsewhere.

### Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows by Sibylle Marcotte et al.

Some properties of optimization initialization are retained by trained over-parameterized models. This paper defines how these quantities are conserved, and provides algorithms on computing the maximal number of independent conservation laws.

### A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning by Alicia Curth et al.

The relationship between model complexity and prediction error (assumed to be U-shaped) needs revision after the success of over-parametrized neural networks. An additional regime where a second descent in test error occurs seems to be present in non-neural network models also.

## Efficiency Learning

### Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture by Dan Fu et al.

Monarch Mixer (M2) is a new architecture that can scale sub-quadratically along sequence length and model dimension. M2 is applied to three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling.

### QLoRA: Efficient Finetuning of Quantized LLMs by Tim Dettmers et al.

QLoRA is a finetuning approach that reduces memory usage by backpropagating gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). It introduces several innovations: 4-bit NormalFloat, Double Quantization and Paged Optimizers.

### Scaling Data-Constrained Language Models by Niklas Muennighoff et al. 🥈

Scaling large language models is data-constrained (papers and books are estimated to be exhausted by 2024). Several strategies can help: repeat data for ~4 epochs for large language models, fill ~60% of the data budget with Python code and reduce the use of perplexity filters.

### Bridging Discrete and Backpropagation: Straight-Through and Beyond by Liyuan Liu et al.

Backpropagation is limited to computing gradients for continuous variables. A novel approach approximates the gradient of parameters involved in generating discrete latent variables.

## Objects / Neuroscience / Vision

### Rotating Features for Object Discovery by Sindy Löwe et al.

The binding problem concerns how the brain represents and connects objects within a fixed network of neural connections. An alternative to discrete slot-based methods is Rotating Features, an approach that learns continuous and distributed object-centric representations.

### Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment by Royi Rassin et al.

Text-conditioned image generation models often incorrectly associate entities and their visual attributes. SynGen is an approach that syntactically analyses the prompt and uses a novel loss function that encourages cross-attention maps to agree with linguistic bindings.

### Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation by Sébastien Lachapelle et al.

Additive decoders, suitable for decomposing images into object-specific elements, enable precise latent variable identification and innovative image generation, with minimal assumptions on latent factor distribution.

### Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity by Tianqin Li et al.

Deep learning models for object recognition are biased towards texture, while human visual systems are biased towards shape. Enforcing a sparse coding constraint using a non-differential top-k operation can introduce shape bias into the network.

## Causality

### Learning Linear Causal Representations from Interventions under General Nonlinear Mixing by Simon Buchholz et al.

Causal representations can be learnt from unknown, latent interventions in a general setting (Gaussian latent distribution but general mixing function) given unknown single-node interventions.

### A Measure-Theoretic Axiomatisation of Causality by Junhyung Park et al.

Kolmogorov’s measure-theoretic axiomatisation of probability is a good starting point towards an axiomatisation of causality.

### Conformal Meta-learners for Predictive Inference of Individual Treatment Effects by Ahmed Alaa et al.

Conformal meta-learners is a general framework for issuing predictive intervals for individual treatment effects (ITEs) in machine learning by applying the standard conformal prediction (CP) procedure on top of conditional average treatment effect (CATE) meta-learners.

### Causal Normalizing Flows: From Theory to Practice by Adrián Javaloy et al.

Causal normalizing flows is shown to capture the underlying causal data-generating process using non-linear ICA.

## Privacy

### Nearly Tight Bounds For Differentially Private Multiway Cut by Mina Dalirrooyfard et al.

The min \(s\)-\(t\) cuts problem involves removing the smallest number of graph edges so two terminals are disconnected. Tight bounds are given for this problem’s differential privacy and a differentially private algorithm for the multiway $k$-cut problem is presented.

### Privacy Auditing with One (1) Training Run by Thomas Steinke et al. 🥇

Privacy auditing usually estimates lower bounds on differential privacy algorithms by training on datasets differing by one record many times. Instead, training once, we use coin flips to randomly include/exclude data and infer the bounds by guessing what data was included.

### Private Everlasting Prediction by Moni Naor et al.

Private everlasting prediction, which protects the privacy of both the training set and ongoing queries, works for all concept classes with finite VC dimension.

### User-Level Differential Privacy With Few Examples Per User by Badih Ghazi et al.

The example-scarce regime (each user has only a few examples) can be handled for approximate-DP using a generic transformation of any item-level DP algorithm to a user-level algorithm, while for pure-DP, a technique can adapt the exponential mechanism to the user-level setting.

## Neuro

### Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models by Andrew Luo at al.

Brain Diffusion for Visual Exploration (BrainDiVE) attempts to synthesize images predicted to activate a given brain region using paired natural images and fMRI recordings, by combining large-scale diffusion models with brain-guided image synthesis.

### Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity by Zijiao Chen et al.

Mind-Video attempts to reconstruct human vision from fMRI brain data. It does this using masked brain modeling, mult-imodal contrastive learning with spatio-temporal attention, and co-training with an augmented Stable Diffusion model incorporating network temporal inflation.

### Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language by Kevin Ellis

An inductive learning model that can efficiently learn a broad range of concepts like a human can be implemented as a Bayesian reasoning process, where a language model proposes candidate hypotheses in natural language, which are then re-weighed by a prior and a likelihood.

## NLP/Tools

### ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings by Shibo Hao et al.

Instead of requiring fine-tuning or prompting, ToolkenGPT learns to master tools as predicting tokens through tool embeddings for solving complex tasks. Each tool is transformed into vector embeddings and enters a special function mode when triggered during text generation.

### Toolformer: Language Models Can Teach Themselves to Use Tools by Timo Schick et al.

Toolformer is a language model trained to decide which APIs to call, when to call them with what arguments, and how to use the results when predicting future tokens. A range of tools are included: a calculator, a Q&A system, a search engine, a translation system, and a calendar.

### Learning Transformer Programs by Dan Friedman et al.

Transformer Programs are modified Transformer models that can be trained using gradient-based optimization and that can automatically be converted into a discrete, human-readable programs that are mechanistically interpretable by design.

## Diffusion Models

### Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation by Diederik Kingma and Ruiqi Gao

On the surface, it seems that state-of-the-art diffusion models are optimized with objectives that are very different from Evidence Lower Bound (ELBO) objectives, when in fact these objectives are equal to a weighted integral of ELBOs over different noise levels.

### Entropic Neural Optimal Transport via Diffusion Processes by Nikita Gushchin et al.

A new neural algorithm computes the entropic optimal transport (EOT) plan between probability distributions that are accessible by samples. The end-to-end algorithm consists of a single learning step, has fast inference and permits small entropy regularization coefficients.

### DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models by Tsun-Hsuan Johnson Wang at al.

DiffuseBot is a physics-augmented diffusion model that generates soft robot morphologies. It augments the diffusion process with a physical dynamical simulation and uses a co-design procedure that jointly optimizes physical design and control.

## Optimization

### A Single-Loop Accelerated Extra-Gradient Difference Algorithm with Improved Complexity Bounds for Constrained Minimax Optimization by Yuanyuan Liu et al.

An extra-gradient difference acceleration algorithm for solving constrained nonconvex-nonconcave (NC-NC) minimax problems is proposed. An extra-gradient difference step obtains the quasi-cocoercivity property and momentum acceleration is used in the dual accelerating update step.

### Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models by Alex Damian et al.

This work closes the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns a single index model with respect to the isotropic Gaussian distribution in \(d\) dimensions with \(n \gtrsim d^{k^\star/2}\) samples.

### Generalizing Nonlinear ICA Beyond Structural Sparsity by Yujia Zheng and Kun Zhang

Nonlinear independent component analysis (ICA) aims to uncover true latent sources from observable nonlinear mixtures. A set of new nonlinear ICA identifiability results overcome the current bijectivity assumptions and sparsity and source independence constraints.

### Fine-Tuning Language Models with Just Forward Passes by Sadhika Malladi et al.

Backpropagation when fine-tuning language models requires a prohibitively large amount of memory. A memory-efficient zeroth-order optimizer (MeZO) can estimate gradients using only two forward passes and fine-tune language models with the same memory footprint as inference.

## Datasets & Benchmarks

### Mesogeos: A multi-purpose dataset for data-driven wildfire modeling in the Mediterranean by Spyridon Kondylatos et al.

The Mesogeos dataset encourages short-term wildfire danger forecasting and final burned area estimation given the ignition point. It consists of wildfire drivers and historical records of wildfire ignitions and burned areas in a grid of 1km × 1km × 1-day resolution.

### ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation by Sungduk Yu et al. 🥇

The ClimSim dataset encourages the development of ~1-km scale climate models using hybrid ML-physics techniques. It consists of 5.7B pairs of multivariate inputs (local climate states) and targets (cloud formation processes) generated by the E3SM-MMF climate simulator.

### Quilt-1M: One Million Image-Text Pairs for Histopathology by Wisdom Ikezogwo et al.

QUILT is a medical vision-language dataset, consisting of 802,144 image and text pairs, collected from educational YouTube videos by expert clinicians in the field of histopathology (the microscopic examination of tissue).

### BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks by Stephanie Milani et al.

BASALT Evaluation and Demonstrations Dataset (BEDD) is a resource for algorithm development and performance assessment in Minecraft. It consists of 26M image-action pairs from nearly 14K videos of human players completing BASALT tasks in Minecraft.

## Chain of Thought / Reasoning

### Why think step by step? Reasoning emerges from the locality of experience by Ben Prystawski et al.

Chain-of-thought reasoning (CoT), where intermediate steps are generated, is useful in language models when overlapping local variable clusters strongly influence each other. It enables local inference chaining to estimate relationships not seen together in training.

### Tree of Thoughts: Deliberate Problem Solving with Large Language Models by Shunyu Yao et al.

Tree of Thoughts (ToT) enables exploration over coherent text units (thoughts) that are intermediate steps toward problem solving. The technique helps language models make decisions through consideration of different reasoning paths, self-evaluation and look-ahead/backtracking.

### Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection by Yu Bai et al.

A comprehensive statistical theory for transformers’ in-context learning (ICL), or prompting, capabilities is given, showing they can implement and adaptively select various machine learning algorithms in context.

### Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective by Guhao Feng et al.

Circuit complexity theory shows that bounded-depth Transformers can directly produce correct answers for basic arithmetic/equation tasks only when model size grows super-polynomially with input length. Using chain-of-thought with constant-size Transformers solves both tasks.

## Graph Neural Networks / Invariance

### Going beyond persistent homology using persistent homology by Johanna Immonen et al.

Color-separating sets resolve the problem of identifying the class of message-passing graph neural networks (MP-GNNs) that persistent homology (PH) can recognize. Graphs are distinguished using the persistence of their connected components, from filtering vertex and edge colors.

### Clifford Group Equivariant Neural Networks by David Ruhe et al.

Clifford Group Equivariant Neural Networks are a novel approach for constructing \(\mathrm{O}(n)\)- and \(\mathrm{E}(n)\)-equivariant models.

### Evaluating Post-hoc Explanations for Graph Neural Networks via Robustness Analysis by Junfeng Fang et al.

OOD-resistant Adversarial Robustness (OAR) is a new evaluation metric for explaining graph neural networks, that avoids the out-of-distribution (OOD) issues plaguing current explanation methods, which work by feeding the explanatory subgraph and measuring output difference.

## Privacy / Fairness

### Students Parrot Their Teachers: Membership Inference on Model Distillation by Matthew Jagielski et al.

This work presents membership inference attacks which show that model distillation only provides limited privacy leakage reduction. Attacks on private datasets can succeed even if the target model is never queried on actual training points.

### Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition by Samuel Dooley et al.

Face recognition systems exhibit bias across socio-demographic dimensions. Model biases, thought to arise from biased training data, are due to neural network architectures themselves. To test this, a neural architecture and hyperparameter search for fairness is performed.

### Ethical Considerations for Responsible Data Curation by Jerone Andrews et al.

This work presents proactive, domain-specific recommendations for curating human-centric computer vision (HCCV) datasets that address purpose, privacy and consent, and diversity concerns.

## Probability / Sampling

### Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent by Jihao Andreas Lin et al.

Stochastic gradient descent can be a computationally efficient solution for the linear systems in Gaussian processes, remove the cubic cost in dataset size and conditioning sensitivity.

### A Rigorous Link between Deep Ensembles and (Variational) Bayesian Methods by Veit David Wild et al.

A rigorous link between Bayesian, Variational Bayesian, and Deep Ensemble methods is established by reformulating the non-convex optimisation problem typically encountered in deep learning as a convex optimisation in the space of probability measures.

### Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Method by Constantine Caramanis et al.

A framework is developed analyzing the effectiveness of deep neural networks to tackle combinatorial problems as a solution generator successively improved using gradient-based methods. The framework examines the expressiveness, tractability and optimization landscape benignness.

## Vision

### Visual Instruction Tuning by Haotian Liu et al.

LLaVA: Large Language and Vision Assistant is an end-to-end large multimodal model connecting a vision encoder and an LLM for visual and language understanding. The model is instruction tuned on multimodal language-image instruction-following data generated by text-only GPT-4.

### EgoEnv: Human-centric environment representations from egocentric video by Tushar Nagarajan et al.

Egocentric video can be linked to the environment by learning representations that are predictive of the camera-wearer’s local surroundings. These models are trained using videos from simulated fully observable 3D environments and tested on human-captured real-world videos.

### DataComp: In search of the next generation of multimodal datasets by Samir Yitzhak Gadre et al.

DataComp is a testbed for multimodal dataset experiments centered around a candidate pool of 12.8B image-text pairs from Common Crawl. Users create new datasets by filtering and curating data sources, evaluating them by testing the trained models on 38 downstream test sets.

### Siamese Masked Autoencoders by Agrim Gupta et al.

Siamese Masked Autoencoders (SiamMAE) is an extension of Masked Autoencoders for learning visual correspondence from videos. It operates on video frame pairs and asymmetrically masks 95% of the patches in future frames, encouraging the learning of object-centric representations.

### Image Captioners Are Scalable Vision Learners Too by Michael Tschannen et al.

Contrastive pretraining on image-text pairs is thought to be a superior pretraining strategy to image captioning. After matching training data, compute, and model capacity, image captioning alone is surprisingly effective, even surpassing contrastively pretraining in some tasks.

### The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation by Saurabh Saxena et al.

Denoising diffusion probabilistic models excel in estimating optical flow and monocular depth, without task-specific architectures and loss functions. With self-supervised pre-training and infilling and denoising diffusion training, these models achieve SOTA results.

### Spatial-frequency channels, shape bias, and adversarial robustness by Ajay Subramanian et al.

In neuroscience, critical band masking is used to reveal the spatial frequency data used for object recognition. Using it as a task for network-human comparison, the network channel is more than twice as wide as the human channel, with adversarial training worsening it.

## Large Language Models

### Are Emergent Abilities of Large Language Models a Mirage? by Rylan Schaeffer et al. 🥇

Large language models’ emergent abilities may actually be a result of metric choices. Particularly, smooth, continuous, predictable changes in model performance, as scale increases, can appear sharp and unpredictable if the researcher chooses a non-linear or discontinuous metric.

### Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models by Guillermo Ortiz-Jimenez et al.

Task arithmetic, a cost-effective method for editing pre-trained models, improves/forgets tasks by adding/subtracting their fine-tuned weights. This works by weight disentanglement, when distinct weight space directions govern separate, localized regions in function space.

### The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks by Ziqian Zhong et al.

Can neural networks rediscover known algorithms when trained on well-understood tasks? Using the modular addition task, models are able to learn both the known Clock algorithm and an unknown, but comprehensible, Pizza algorithm.

### Jailbroken: How Does LLM Safety Training Fail? by Alexander Wei et al.

Jailbreak design can be guided by two failure modes: competing objectives (when a model’s capabilities and safety goals conflict) and mismatched generalization (safety training fails to generalize to a domain for which capabilities exist).

## Theory

### Optimal Learners for Realizable Regression: PAC Learning and Online Learning by Idan Attias et al.

This study advances the understanding of statistical complexity in realizable regression, introducing new dimensions and learners that qualitatively and quantitatively define learnability in both PAC and online settings.

### Random Cuts are Optimal for Explainable k-Medians by Konstantin Makarychev and Liren Shan

This works shows that the RandomCoordinateCut algorithm gives the optimal competitive ratio for explainable \(k\)-medians in \(\ell_1\). A tight analysis of the algorithm is provided and its competitive ratio is proven to be upper bounded by \(2\ln k+2\).

### Tester-Learners for Halfspaces: Universal Algorithms by Aravind Gollakota et al.

This work introduces the first tester-learner for halfspaces succeeding universally over a wide class of structured distributions, running in fully polynomial time, guaranteeing that the learner achieves error \(O(\mathrm{opt}) + \epsilon\) on any accepted labeled distribution.

### Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures by Hamish Flynn et al.

This study introduces improved algorithms with worst-case regret guarantees for the stochastic linear bandit problem, utilizing a novel tail bound for adaptive martingale mixtures to construct confidence sequences which are suitable for stochastic bandits.

# Other Highlights

### Neural Scaling Laws Panel with Nick Bostrom, Yoshua Bengio Yann LeCun, Max Tegmark, Percy Liang, Julia Bossmann, Jenia Jitsev, Nora Belrose, Irina Rish, Ethan Caballero, Quintin Pope, and Joscha Bach

Members of the panel take turns answering several questions: Should powerful AI be made public? When and on what did you change your mind recently?

### GraphCast: Global Weather Prediction with Rémi Lam

GraphCast predicts the Earth’s surface and atmospheric weather, ten days ahead, at high spatial resolution. It significantly outperforms the industry deterministic gold standard (HRES).

### Isomorphic Labs with Sergei Yakneen and Max Jaderberg

DeepMind’s drug discovery spin-off is thinking about: How far can architectures (GNNs, SSMs, Transformers) be scaled? How do diffusion models and flow matching methods connect to the underlying physics? Are there golden self-supervised tasks for atoms, energies and proteins?

### In Defense of Zero Imputation for Tabular Deep Learning by John Van Ness and Madeleine Udell

Multi-layer perceptrons (MLPs) paired with zero imputation perform as well on real-world data as more powerful deep impute-then-predict neural network models (such as NeuMiss, supMIWAE and GRAPE) that allow for joint optimization of imputations and predictions.

### Teaching Arithmetic to Small Transformers by Nayoung Lee et al.

Small transformers, trained from random initialization, can efficiently learn arithmetic operations and elementary functions using the next-token prediction objective if simple formatting changes are made to the training data.

### From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces by Peter Shaw et al.

Digital agents for graphical user interfaces usually rely on HTML. Pix2Act relies solely on pixel-based input and selects actions corresponding to basic mouse and keyboard functions. It uses a pre-trained image-to-text Transformer model tuned with Behavioral Cloning and MCTS.

### Counterfactual Memorization in Neural Language Models by Chiyuan Zhang et al.

Larger models results in more memorization, leading to privacy, copyright, and generalization issues. Counterfactual memorization characterizes how a model’s predictions change if a document is omitted during training, since memorized examples are ones that commonly occur.

Timothy Leung