This work presents a series of experiments that compare the performance of human speech recognition (HSR) and automatic speech recognition (ASR). The goal of this line of research is to learn from the differences between HSR and ASR and to use this knowledge to incorporate new signal processing strategies from the human auditory system in automatic classifiers. A database with noisy nonsense utterances is used both for HSR and ASR experiments with focus on the influence of intrinsic variation (arising from changes in speaking rate, effort, and style). A standard ASR system is found to reach human performance level only when the signal-to-noise ratio is increased by 15 dB, which can be seen as the human–machine gap for speech recognition on a sub-lexical level. The sources of intrinsic variation are found to severely degrade phoneme recognition scores both in HSR and in ASR. A comparison of utterances produced at different speaking rates indicates that temporal cues are not optimally exploited in ASR, which results in a strong increase of vowel confusions. Alternative feature extraction methods that take into account temporal and spectro-temporal modulations of speech signals are discussed.