Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback
Abstract
We investigate whether Transformer architectures can approximate search algorithms without external components. We introduce a framework called unknown tree search with bandit feedback, where tree extensions and feedback signals are externally provided for controlled evaluation. We show that Transformers possess sufficient expressiveness to implement distinct search strategies, can be trained to approximate those strategies from scratch, and demonstrate generalization to longer horizons and deeper trees. Additionally, fine-tuning pretrained language models on search trajectories unlocks enhanced capabilities.