In this paper we present a system that allows to “touch”, grab and manipulate sounds in mid-air. Further, arbitrary objects can seem to emit sound. We use spatial sound reproduction for sound rendering and computer vision for tracking. Using our approach, sounds can be heard from anywhere in the room and always appear to originate from the same (possibly moving) position, regardless of the listener’s position. We demonstrate that direct “touch” interaction with sound is an interesting alternative to indirect interaction mediated through controllers or visual interfaces. We show that sound localization is surprisingly accurate (11.5 cm), even in the presence of distractors. We propose to leverage the ventriloquist effect to further increase localization accuracy. Finally, we demonstrate how affordances of real objects can create synergies of auditory and visual feedback. As an application of the system, we built a spatial music mixing room.