With the rapid development of artificial intelligence technology, multimodal robots are playing an increasingly important role in preschool children’s education, entertainment, and daily life. Existing studies have primarily focused on the effects of single sensory cues of robots on children’s perception, while systematic research on multisensory integration effects remains limited. To explore how robots’ multimodal features jointly influence children’s emotional preferences and visual attention, 318 children aged 4-6 years were recruited to participate in an eye-tracking experiment. The experiment adopted a 2 (appearance features: humanoid vs. animal-like) × 3 (voice guidance: male voice, female voice, none) × 2 (gesture guidance: present vs. absent) mixed factorial design, with robot appearance features (humanoid vs. animal-like) and behavioral features (voice and gesture guidance) as independent variables, and children’s emotional preferences and eye-tracking indicators as dependent variables, thereby systematically examining the effects of multimodal features on child users. The results showed that, in terms of appearance features, no significant difference was observed in subjective preference ratings between humanoid and animal-like robots. However, humanoid robots attracted longer total fixation duration, more fixation counts, and shorter first-fixation latency, indicating superior attention-related performance compared with animal-like robots. Children were more readily attracted to humanoid robots during the initial stage of visual contact, and anthropomorphic design showed greater advantages in sustaining children’s attention. In terms of behavioral features, robots with gesture guidance received significantly higher subjective preference ratings than those without gestures, and also elicited longer total fixation duration and more fixation counts. Robots with female voices received slightly higher subjective preference ratings than those with male voices, and both were significantly preferred over robots without voices. Robots with male voices had slightly longer total fixation duration than those with female voices, and both significantly outperformed robots without voices. The difference in fixation counts between male- and female-voice robots was not significant, but both attracted significantly more fixations than robots without voices. Robots with gesture guidance and voice (especially female voice) performed better in subjective ratings and visual attention allocation, suggesting that behavioral features substantially enhanced children’s emotional preferences and interactive experiences. Furthermore, the effects of appearance and behavioral features on children’s emotional preferences and visual attention were relatively independent, and no significant interaction effects were observed. This study revealed the mechanisms through which robot appearance and behavioral features influenced preschool children’s emotional preferences and visual attention, thereby providing scientific evidence for designing child-oriented robots that align with users’ emotional needs.