We propose a hierarchical approach for the design of gesture-to-sound mappings, with the goal to take into account multilevel time structures in both gesture and sound processes. This allows for the integration of temporal mapping strategies, complementing mapping systems based on instantaneous relationships between gesture and sound synthesis parameters. Specifically, we propose the implementation of Hierarchical Hidden Markov Models to model gesture input, with a flexible structure that can be authored by the user. Moreover, some parameters can be adjusted through a learning phase. We show some examples of gesture segmentations based on this approach, considering several phases such as preparation, attack, sustain, release. Finally we describe an application, developed in Max/MSP, illustrating the use of accelerometer-based sensors to control phase vocoder synthesis techniques based on this approach.