Psychometricians and experimental psychologists inhabit radically different worlds (Borsboom et al., 2009; Cronbach, 1957). The former explore domains where performance varies substantially across participants, looking for ever better ways to reliably detect those individual differences. The latter, in contrast, crave robust effects that, ideally, will be observed in every single person, often with little or no variation from one participant to the next. These two species rarely venture into each other’s habitat and in the exceptional cases when this happens the outcome is not always a happy one.
Although concerns about the suitability of experimental tasks for studying individual differences have been long recognized in fields such as intelligence and aging research, other areas – such as inhibition or implicit learning and memory (Rouder & Haaf, 2019; Rouder et al., 2023b) – are only now beginning to address these challenges. Unfortunately, although these tasks are known to yield large effects at the group level, they tend to be less sensitive to individual differences (Hedge et al., 2018). Consequently, attempts to use these tasks to study individual differences have sometimes resulted in contradictory findings and failed replications (e.g., Karr et al., 2018, Paap and Greenberg, 2013, Rey-Mermet et al., 2019 Ross et al., 2015; Von Bastian et al., 2020). One potential reason for this is that, when experimenters enter into the realm of psychometrics, they commit to certain assumptions without evaluating whether their data validate them. A clear example consists in submitting variables to analyses that implicitly assume they have been measured with little or no measurement noise (Vadillo et al., 2020, Vadillo et al., 2022). On the occasions when this assumption has been evaluated, the psychometric properties of the measures provided by these tasks, although sometimes acceptable (e.g., Viviani et al., 2024), often leave much to be desired (e.g., Draheim et al., 2019, Enkavi et al., 2019, Garre-Frutos et al., 2024, Hedge et al., 2018, Hernández-Gutiérrez et al., 2025, Paap and Sawi, 2016, Rothkirch et al., 2022, Vadillo et al., 2022, Yaron et al., 2024).
Measurement noise, also known as measurement error, refers to the random variability, inherent in the measurement process, which cannot be attributed to the effect intended to be measured. Note that this noise is the opposite of reliability and is not intrinsic to the task; rather, it is also a function of the sample and the administration conditions under which the measure was obtained. Other things being equal, the larger the trial sample (i.e., number of trials), the smaller the impact of this variability on the overall measurement, due to the averaging out of individual trial noise, thereby reducing the measurement noise in the data. However, assuming that having a large amount of trials will strictly lead to adequate reliability might be risky, so it is always advisable to empirically estimate the measurement noise with methods such as split-half coefficients or signal-to-noise ratios (Rouder et al., 2023a).
The study of unconscious learning and memory processes provides many examples of strong inferences drawn with insufficient consideration of psychometric requirements. A common argument to support the claim that learning and memory can operate unconsciously is to report evidence that an improvement in task performance is uncorrelated with participants’ awareness of the regularities driving performance (e.g., Colagiuri and Livesey, 2016, Jiang et al., 2018, Salvador et al., 2018, Soto et al., 2011). For instance, Soto et al. found a null correlation between performance in a working memory task and awareness of the cues’ visibility, which served as an argument for the inference that working memory can operate with unconscious representations.
The intuition behind these analyses is that if participants’ awareness and learning are uncorrelated, they must be based on different processes or representations. Learning must be based on something that leaves no trace on awareness measures. However, an alternative and far simpler explanation for the lack of correlation is that either or both measures are contaminated by substantial amounts of measurement noise, which attenuates the observed correlation between them (Hunter & Schmidt, 2015; Spearman, 1904). If this is indeed the case they will not correlate with each other, even if at the latent level they tap onto the same cognitive processes.
One cannot discriminate between these two hypotheses (i.e., learning is unconscious vs. is conscious but there is excessive measurement noise) without having some estimation of the internal consistency of the dependent variables involved in the analysis. Sadly, reliabilities are only seldom reported in these experiments (Vadillo et al., 2022) and, in the rare occasions when they are estimated, reliabilities often turn out to be sufficiently low to cast doubt on the appropriateness of the analyses (e.g., Anderson and Kim, 2019, Arnon, 2020, Bogaerts et al., 2018, Erickson et al., 2016, Kalra et al., 2019, Kaufman et al., 2010, Siegelman and Frost, 2015, Smyth and Shanks, 2008, Vadillo et al., 2020, Vadillo et al., 2022, Vadillo et al., 2024 West et al., 2018, Yaron et al., 2024).
To make things worse, several tasks used in the study of unconscious learning and memory apply this logic to single-trial measures of awareness (such as probability cuing, e.g., Jiang et al., 2015, Vadillo et al., 2020; attentional capture, e.g., Adams & Gaspelin, 2020; and distractor suppression in the additional singleton task, Wang and Theeuwes, 2018a, Wang and Theeuwes, 2018b, Wang and Theeuwes, 2018c), commonly taken at the end of the experiment. Although these designs typically involve large trial samples to ensure low measurement noise for search times, no comparable level of confidence is sought for the awareness measure. And this is further exacerbated because the internal consistency of a single measurement is, by definition, undetermined, so reliability cannot be estimated with conventional methods (such as split-half reliability).
On another level, which is more extensively discussed in the literature, is the concept of sampling noise. Similar to measurement noise, sampling noise refers to the random variability, inherent in the participant sampling process, which cannot be attributed to the population effect intended to be estimated. This is why sample size is one of the main considerations in designing experiments: the larger the sample size, the smaller the sampling noise in the estimated parameter. Despite its importance, there is still no well-established routine to adequately justify sample sizes (Lakens, 2022), leading to samples that may not be sufficiently powered to detect the desired effects – in this case, an interaction between awareness and learning. A common argument to justify testing a specific number of participants is that the sample size is similar to previous ones in the same literature, but this does not guarantee that those sample sizes were sufficiently informative. For instance, in the domain of location probability learning, which is the main focus of the present study, median sample sizes are typically between 16 and 24.1 With these samples, the minimum detectable correlation with 90 % power is .705 for contextual and probabilistic cuing tasks, and .601 for the additional singleton task.2 Given that some amount of measurement noise is a very reasonable assumption in the measures taken in these tasks, there are good reasons to expect that the empirical correlations involving them will often be substantially smaller than these best-case values.
It is worth noting that measurement and sampling noise have related but different impacts on statistical analysis. Measurement noise induces a systematic attenuation on the effect, while sampling noise produces an unsystematic bias in its estimation. Ensuring only a large sample size without addressing the potential measurement noise from having only one trial for the awareness test is particularly risky; we could end up being very confident about an effect that is, in fact, substantially attenuated (Rouder et al., 2023b).
In the present study we illustrate the scope of these problems – measurement and sampling noise – in the domain of location probability learning by adopting a modeling-based approach. This approach will allow us to achieve two goals: First, to provide a model-based estimate of how much measurement and sampling noise there is in empirical data from these tasks; second, to find out how often researchers would mistakenly conclude that learning is unconscious, given a model based on the alternative assumption (learning is conscious) but varying the amounts of measurement and sampling noise in participants’ responses. In sum, this study serves as a timely illustration for experimental psychologists on the risks of neglecting basic psychometric requirements when assessing individual differences.
Numerous studies have argued that our ability to guide attention is strongly influenced by unconscious learning processes that enable us to detect and memorize statistical regularities in our environment (Chun and Jiang, 1998, Chun and Turk-Browne, 2008, Ferrante et al., 2018, Geng and Behrmann, 2005 Jiang, 2018, Gaspelin and Luck, 2018, Krishnan et al., 2022, Stilwell et al., 2019, Turk-Browne et al., 2005). While much of this research is concerned with how we learn to attend to locations where a target is likely to appear, many studies have made the complementary claim that with sufficient practice, we can unconsciously memorize and ignore locations in a scene where salient but irrelevant distractors typically appear. This phenomenon is commonly studied using the additional singleton task (e.g., Di Caro et al., 2019, Gao and Theeuwes, 2020, Gao and Theeuwes, 2022, Lin et al., 2021, Van Moorselaar and Theeuwes, 2021, Wang and Theeuwes, 2018a, Wang and Theeuwes, 2018b, Wang and Theeuwes, 2018c). This increasingly popular location probability learning task serves as a perfect example for the present purposes.
In a typical setting, participants are instructed to find a target with a particular shape among a series of shapes (e.g., a diamond among circles) and report the orientation of the line inside the target (see Fig. 1). Participants are warned that in some trials the display includes a salient distractor presented in a different color (e.g., a red circle among green figures, the singleton). Response times (RTs) reveal that visual search is slower when this singleton is present (e.g., Theeuwes, 1992). Presumably this happens because participants engage a strategy to detect any stimulus with a unique feature, even if this leads them to pay attention to irrelevant stimuli as well (i.e., singleton detection mode; Adams & Gaspelin, 2020; Bacon & Egeth, 1994). Crucially, if the singleton distractor appears frequently at a particular location throughout the experiment, participants are able to memorize this regularity and suppress attention to that particular location (e.g., Vicente-Conesa et al., 2023, Wang and Theeuwes, 2018a, Wang and Theeuwes, 2018b, Wang and Theeuwes, 2018c). This suppression effect is shown by the fact that, as learning progresses, visual search for the target is memory-guided: RTs are faster when the singleton appears at the high-probability (HP) location compared to when it appears at any other low-probability (LP) location. In essence, participants can memorize the HP location and learn not to be distracted by the singleton when it appears there. In fact, research on visual statistical learning has shaped current theorizing on memory systems (e.g., role of the hippocampus, Covington et al., 2018). Fig. 1 illustrates two possible displays of the additional singleton task for one participant.
At the end of these experiments, participants are informed that the distractor appeared more frequently at a certain location and then they are asked to recall and report this location (i.e., the awareness test). Although the suppression effect is very robust in RTs, many participants seem to be unable to correctly select the HP location (e.g., van Moorselaar & Theeuwes, 2021). When this happens, authors infer that the location probability learning was implicit. In other studies, researchers remove those participants who correctly selected the HP location – the aware ones3 – from analyses and observe that the suppression effect is still significant in the remainder (e.g., Di Caro et al., 2019). Again, this is taken as evidence that learning was unconscious, in the sense that awareness is not necessary for the effect to occur.
A third strategy consists of dividing the sample between those participants who correctly selected the HP location and those who did not – the aware and the unaware ones – and comparing their suppression effects. This is usually performed through a two-way mixed ANOVA for RTs with one within-subjects factor that assesses suppression, the distractor location (HP or LP), and one between-subjects factor, accuracy in the awareness test (aware or unaware participants).4 Obtaining a non-significant interaction between these factors has been typically interpreted as evidence that location probability learning is unconscious (e.g., Failing et al., 2019, Gao and Theeuwes, 2022, Lin et al., 2021). In other words, if the distractor suppression effect, as measured in RTs, is not related to correctly responding to the awareness test, this is taken as evidence that the suppression effect and awareness must be generated by unrelated (or weakly related) latent processes (as Rouder & Haaf, 2019, suggest for inhibition).5 At first glance, the logic of these analyses seems completely reasonable. However, somehow implicitly, we have moved from experimental logic and entered the realm of individual differences – not because we are now interested in each individual’s capacity for unconscious probabilistic learning, but because we are dividing the group based on an individual-level score (i.e., awareness accuracy). It is here where psychometric properties reign and, therefore, we need to address them to guarantee valid inferences.
An astute reader may have noticed that claiming support for a theory based on a non-significant result is scientifically unsound: As Carl Sagan (1977) said, “absence of evidence is not evidence of absence” (p. 7). Yes, a non-significant interaction can be interpreted as evidence for the claim that location probability learning is independent from awareness, but have we discounted alternative plausible scenarios in which a non-significant interaction is expected to occur? In other words, let's put ourselves in the opposite scenario and imagine that RTs and awareness are driven by a common underlying process. Would we have any reason to expect a non-significant interaction in that case too?
Imagine we are omniscient and know that there is a true interaction effect, that is, the true suppression effect in aware participants is indeed greater than in unaware ones. From our privileged point of view, the correct claim would be that “location probability learning is conscious” or, more precisely, that there is a true relationship between distractor suppression and awareness. Now imagine two research teams that want to check this interaction with empirical data. The first team has collected data and is preparing to analyse them. Unfortunately, these researchers encountered difficulty in recruitment and, although participants completed a large amount of trials, the study ends up with a small sample size. As a consequence of their lack of statistical power to detect the effect, the researchers obtain an empirically non-significant interaction. The second research team, with much better access to participants, has obtained their measures with a sufficient sample size. However, these participants had less time available and so only completed a small number of trials, increasing the measurement noise of their dependent variables. As has been well known for decades now, measurement noise tends to attenuate true effects (Hunter and Schmidt, 2015, Lord et al., 1968). Unfortunately, these researchers’ measurement noise was sufficient to attenuate the actual interaction effect to the point that they also obtain an empirically non-significant interaction.
What claim might both teams risk drawing from those non-significant interactions? That the suppression effect is unrelated to awareness responses and, thus, that there is evidence supporting the claim that location probability learning is unconscious. From our omniscient perspective, we know both teams will be making an erroneous claim and can alert them about the need to first eliminate both sources of noise affecting the design (insufficient sample size: sampling noise) and measures (insufficient reliability: measurement noise).
We should not underestimate the potential impact this could be having in current theorizing. For instance, the whole idea that selection history is a “new” attentional control mechanism, different from bottom-up or top-down attention (Awh et al., 2012), is almost entirely built on the assumption that these effects are unconscious (and therefore not top-down; Giménez-Fernández et al., 2023).
Unfortunately we are not omniscient, but we can adopt the privileged position of data simulation, where the ground truth is known. To simulate data, we first need a model of how our dependent variables are distributed as a function of the independent variables manipulated in the experiment. In this article, we present a statistical model to describe both RTs and awareness responses in each participant in the additional singleton task. Given the handicaps of estimating the reliability in experimental data (impossible for the single-trial awareness measure), the model allows us to simulate data from a single-process theory (i.e., assuming that learning is conscious) while varying sampling and measurement noise.
This modeling exercise will serve as an informative illustration of how neglecting basic psychometric requirements, such as sufficient sample size and minimal measurement noise, can undermine the statistical power to detect effects of interest. Note that our goal is not to find the most accurate underlying model for real cognitive processes involved in the additional singleton task nor in location probability learning. Instead, what we will try to show is how an empirical result often found in the literature, and interpreted as evidence for a certain theory, can be generated from a contrasting theory plus some near-ubiquitous statistical artifacts. Therefore, our aim is to encourage future researchers, working on this and related tasks and research questions, to meticulously assess whether their data fulfil the crucial requirements for drawing valid conclusions about individual differences in the experimental context.
Comments (0)