On the cocktail-party problem: Do children use their exquisite hearing at frequencies above 8 kHz?

In complex listening environments, such as daycares, playgrounds and classrooms, target speech and competing sounds interact before reaching the ear, making it challenging for listeners to understand the target speech—a phenomenon commonly referred to as the "cocktail party problem" (Cherry and Taylor, 1954). Understanding the target talker mixed with multiple sound sources from different directions requires a combination of peripheral encoding, central auditory processing, and cognitive factors, collectively known as auditory scene analysis (Alain et al., 2001; Bregman, 1990). Once information is transmitted from the auditory periphery to the brain, central auditory mechanisms identify and segregate the target from competing sources, called stream segregation, in order to obtain perceptually meaningful elements. Additionally, selective attention and cognitive factors such as linguistic knowledge further assist in managing the complexity of the auditory scene, allowing listeners to focus on the target speech signal amidst the competing sounds.

Auditory stream segregation relies on acoustic cues generated by factors such as talker sex and spatial separation, which help listeners distinguish target speech from competing sounds (Bregman, 1990; Bronkhorst, 2015; Darwin and Carlyon, 1995; Yost, 1997). Adults effectively use differences in vocal characteristics and spatial cues to separate speech streams, enhancing speech perception in complex environments like cocktail-party scenarios (Arbogast et al., 2002; Brungart, 2001; Brungart et al., 2001; Freyman et al., 1999; Kidd et al., 1998; Rennies et al., 2019; Rennies and Kidd, 2018). In contrast, children face greater challenges in noisy settings, requiring larger acoustic differences to segregate sounds compared to adults (Litovsky, 2005; Wightman and Kistler, 2005). Although this need for larger differences diminishes with age, children continue to struggle more with segregating speech in the presence of competing talkers than with stationary noise (Buss et al., 2017; Calandruccio et al., 2020; Corbin et al., 2016; Goldsworthy and Markle, 2019; Johnstone and Litovsky, 2006; Leibold and Buss, 2013). The development of auditory stream segregation abilities is prolonged, taking up to 17 years to fully mature, even when both talker-sex and spatial cues are available (Brown et al., 2010; Cameron et al., 2011, 2006).

The ability to use talker sex cues, such as variations in fundamental frequency (F0) and vocal tract length (Darwin et al., 2003), to segregate speech streams develops gradually throughout childhood and continues into adolescence (Corbin et al., 2016; Wightman and Kistler, 2005). Infants aged 16 months or younger do not benefit from a mismatch between target and masker sex (Leibold and Buss, 2013; Newman and Morini, 2017). However, by 30 months, children demonstrate improved word recognition when the target voice is female, and the masker's voice is male, compared to when both voices are female (Newman and Morini, 2017). Wightman and Kistler (2005) found that children as young as 4 years old can benefit from a target-masker sex mismatch when the target speech is male, but their performance does not reach adult-like levels even at 16 years of age. Similarly, Leibold et al. (2018) reported that children (aged 5–10 years) and adults benefit from a sex mismatch in two-talker masker conditions, yet substantial differences in speech-in-speech thresholds remain between these age groups. In addition, Flaherty et al. (2019) observed that children younger than 13 derive less benefit from talker-sex cues compared to older children and adults, while those younger than 7 years gain no advantage, even with large F0 separations that are easily discriminable in quiet (Zaltz et al., 2020).

Spatial cues also contribute to speech segregation. Children as young as three can use spatial cues, such as monaural head shadow effects and interaural differences, to improve speech perception (Ching et al., 2011; Garadat and Litovsky, 2007). However, the spatial release from masking (SRM) observed in children is generally lower than in adults, and the age at which SRM matures seems to depend on listening conditions, including the complexity of the target speech and its similarity to the maskers (Ching et al., 2011; Garadat and Litovsky, 2007; Jones and Litovsky, 2008; Litovsky, 2005; Misurelli and Litovsky, 2015, 2012; Murphy et al., 2011). For instance, SRM reaches maturity by around 5 years of age when the target consists of monosyllabic or bisyllabic words but is delayed when sentences serve as the target speech (Brown et al., 2010; Cameron et al., 2006; Litovsky, 2005; Misurelli and Litovsky, 2012). Recent studies highlight hearing sensitivity above 8 kHz (extended high frequencies, EHF) in processing talker-sex and spatial cues in adults. Jain et al. (2022) evaluated auditory stream segregation using a rapid sequence of vowel stimuli differing in F0 and found that individuals with impaired hearing in the EHF range required larger F0 differences for effective stream segregation. In terms of spatial cues, Monson et al. (2019) found that limiting access to EHFs by low-pass filtering speech at 8 kHz led to decreased speech recognition thresholds (SRTs) for the co-located maskers, with the effect being more pronounced when the target talker's head orientation deviated from the listener. Similarly, Trine and Monson (2020) reported that adding noise-vocoded EHF cues to low-pass filtered speech at 8 kHz improved speech recognition in co-located conditions, with further improvements when spectral information was included. Monson et al. (2023) found that speech recognition was significantly better when the target speech contained EHF energy and the masker was low-pass filtered at 8 kHz, with this effect being more pronounced for two-talker maskers. A detailed analysis revealed that EHFs provide crucial phonetic information and segregation cues for the target speech, while EHFs in the masker do not offer similar benefits. In addition, Ananthanarayana et al. (2024) demonstrated that the contribution of EHF cues increased when the masker was not aligned with the listener’s head orientation.

Children are exquisitely sensitivity to EHFs compared to adults (Lee et al., 2012; Mishra et al., 2022; Valiente et al., 2014) and benefit from EHF cues for listening in stationary noise, suggesting that they may derive significant benefits from EHF cues in complex listening environments. While existing research indicates that EHF information aids children in stationary noise conditions (Mishra et al., 2022), the potential role of these cues in more dynamic scenarios, such as cocktail party listening, remains largely unexplored. This gap is particularly notable given that the ability to segregate auditory streams in environments with multiple competing talkers is a critical skill for effective communication and learning in children. Recent studies provide some evidence supporting the importance of EHF cues for speech segregation in children. For instance, Petley et al. (2021), in the context of listening disorders, identified a significant interaction between EHF hearing loss and talker-sex cues, implying that EHF deficits could impact the ability to differentiate between speakers based on voice characteristics. Similarly, Flaherty et al. (2021), employing a methodology similar to that of Monson et al. (2019), demonstrated that school-age children (5–18 years; n = 39) with normal EHF hearing showed a significant improvement in SRTs (∼1.6 dB) when listening to broadband speech compared to speech that was low-pass filtered at 8 kHz. Further supporting this notion, Braza et al. (2022) demonstrated that masker projection significantly affected SRTs in children aged 5 to 7 years with normal EHF hearing (n = 15), using co-located and spatially separated speech maskers that were either facing towards or away from the listener. The target and maskers in both these studies were female voices. Collectively, these findings raise compelling questions about the broader implications of EHF sensitivity in children, particularly in real-world listening situations involving multiple competing talkers. Understanding the extent to which children utilize EHF cues for stream segregation could offer valuable insights into auditory development and inform strategies for managing hearing loss that extends beyond conventional audiometric ranges, or certain disorders, such as listening disorders.

To date, few studies have examined the role of EHFs on speech stream segregation in children (Braza et al., 2022; Flaherty et al., 2021; Petley et al., 2021), with only one study focusing on the effects of both talker-sex and spatial cues in isolation (Petley et al., 2021). In real-world listening environments, these cues often occur simultaneously, emphasizing the need for a more comprehensive understanding of their combined effect. The aim of the present study was to investigate the role of EHF audibility in speech stream segregation in children. It was hypothesized that enhanced sensitivity to EHF would support stream segregation, thereby improving the ability to distinguish target speech from competing talkers. The study employed co-located, spatially separated, and talker- and masker-sex-matched and mismatched conditions. Previous studies in children have typically used sentence-based tasks (Braza et al., 2022; Flaherty et al., 2021; Petley et al., 2021), which require significant cognitive and linguistic abilities. In contrast, the present study utilized digit triplets to assess stream segregation. This method offers an advantage when testing younger participants, as it minimizes cognitive load while still providing meaningful data. Unlike sentence tasks, digit triplets involve simple, phonetically distinct numbers (0–10), which children typically learn by age 3 or 4 (Silver et al., 2022). The use of a number keyboard further simplifies the response process, minimizing the need for verbal language. This approach enables reliable testing in children as young as 4 years old, addressing the challenges of conducting speech perception assessments with younger populations (Koopmans et al., 2018; Moore et al., 2019). The present study expanded upon the work of Brown et al. (2010), who measured sentence recognition in the presence of masker speech, with conditions in which the target and masker were either of the same or different sex, and either co-located or spatially separated. Unlike Brown et al. (2010), the present study tested a wider age range of children, used digit triplets as target stimuli presented via loudspeakers, and included EHF hearing thresholds as a predictor of performance

Comments (0)

No login
gif