Meaning through corpora
The seeds of semantic change are found in synchronic meaning variation in everyday discourse. Although many differences in meaning and usage between different people are obvious, still more of this variation occurs below the threshold of our awareness, and we fail to notice the ways in which language is changing all around us. Only in exceptional circumstances do we become aware of the extent of the variation in meaning and usage. Stubbs (2001) reports the story of Monica Baldwin, who entered a strict convent shut off from the outside world in 1913. She spent the next 28 years in complete isolation, without access even to newspapers or radio, the other members of the convent providing her only company. In her autobiography, Baldwin records the words and expressions she did not understand after she left the convent in 1941. These included cocktail, hard boiled, have a hunch, nosey parker, it’s your funeral, jazz, close-up, streamlining, plus fours, cutie, robot, parking, Hollywood, believe it or not and striptease. Some of these expressions may, of course, now be unfamiliar to contemporary readers.
Because of the vast amount of data involved, studying this language wide variation in depth is not feasible without the use of computers. The fi eld of corpus linguistics uses computers to store and analyse corpora (Latin: ‘bodies’; singular corpus), collections of large amounts of text. There are several large corpora of written and spoken English, such as the British National Corpus, the Cobuild corpus, the London–Lund corpus of British English (LLC), the Lancaster-Oslo/Bergen corpus (LOB) and the Australian Corpus of English. Corpora of various kinds are also available for many other languages, including Japanese, Chinese and most European languages. The wide availability of these corpora means that the methods of corpus linguistics are replicable: a statistical analysis of a corpus can be repeated and confirmed by any number of researchers.
Corpora are useful for semantic analysis because they can reveal unsuspected patterns of collocation, or regular word combination. The extent to which discourse consists of predictable word-sequences is easy to underestimate. In general, corpus study has shown that many words (called nodes in corpus linguistics) have fairly predictable patterns of col location within a given span (the number of words taken into account before and after the node). Take brightly as an example. Stubbs (2001: 81) reports that a search of the Cobuild corpus shows that brightly occurs 1,467 times. In 26% of these occurrences, it occurs within four words of coloured. This is conventionally represented in the following way:

Such a high degree of collocational predictability isn’t uncommon. Roughly four per cent of a sample of the headwords in the Cobuild corpus fell into a category in which the most frequent collocate occurred with the node in at least twenty per cent of cases (Stubbs 2001: 81). Here are some examples:

For another twenty per cent of all nodes, the top collocate was found with the node in between ten and twenty per cent of the node’s occurrences, while for forty per cent of nodes, the top collocate was recorded in between five and ten per cent of cases.
If one considers variant forms of the collocates, these scores often go up impressively. Stubbs (2001: 83) gives the following examples, in which the combined proportion of the collocates’ occurrences, given as the final per centage, accounts for a significant proportion of the nodes’ occurrences:

Thus, almost one in five occurrences of cheering is collocated with crowd(s), while more than two in five occurrences of resemblance occurs with a form of the verb bear. And once one takes into account semantically related words (synonyms, hyponyms, hyperonyms, etc.), the figures skyrocket (Stubbs 2001: 83):

Findings like these, according to Stubbs, ‘show that there is a level of organization beyond lexis and syntax, which is only starting to be systematically studied, and which is not reducible to any other level of organization’ (2001: 97).
Studies of collocation can give surprising results. In an earlier study, Stubbs found that nearly 80% of the 38,000 occurrences of cause (both the noun and the verb) in a million-word corpus were paired with clearly negative collocates in the span of three words before or after. The most frequent collocates are the following:

On the evidence of this corpus, cause is not used neutrally, as most speakers would probably guess, but has a strong tendency to be associated with negative events. This tendency is not yet strong enough to count as a con notation of cause, but it constitutes a striking regularity which would come as a surprise to most speakers. Simply introspecting about the meaning of cause would be unlikely to reveal the collocational tendencies uncovered by the corpus search.
The situation with cause is not unusual. Stubbs comments that ‘[a]ll of the most frequent content words in the language are involved in [collocational] patterning. This is not a peripheral phenomenon (collocations are not an idiosyncratic feature of just a few words), but a central part of communicative competence’ (2001: 96). Another example of this situation comes from Channell (2000). Consider regime. Intuitively, one would say that it simply refers to a ruling political administration. Channell discovered, however, that the most frequent collocates of the word in the British Cobuild corpus were military, communist, ancien, Nazi, Soviet, Vichy, fascist, present and Iraqi. Channell comments that these are words ‘which from a British perspective represent those types of government which are generally disapproved of’ (2000: 46). Regime, in other words, seems to have a tendency to occur in unfavourable contexts. Native English speakers would not necessarily have predicted this result through merely introspecting about the word’s meaning. Channell also investigated the phrase roam the streets. There are 113 occurrences of this in the Bank of English corpus, with the subjects prostitutes, vagrant children, armed men, mobs, looters, right-wing youth gangs and neo-Nazis, vandals, wild dogs and bigots (Channell 2000: 53). The activities associated in the corpus with roam the streets included searching for food, attacking people, stoning cars, randomly beating people, burning and looting and rioting. This collocation is, then, typically associated with activities that are dangerous, threatening and censured. Again, this is not a result that is available through mere introspection. Channell predicts on the basis of these data that the negative evaluation associated with roam in these collocations will extend to all uses of the verb, and become one of its regular connotations.
Partington (2004) examined the English adverbs completely, entirely, totally and utterly. These share a large number of collocates with each other, and, as a group, share very few collocates with apparently broadly synonymous adverbs like perfectly or absolutely. Partington reports some interesting patterns. Utterly, for instance, modifies items that ‘almost invariably express either the general sense of “absence of a quality” or some kind of “change of state”’ (2004: 147), such as helpless, useless, unable, forgotten; changed, different; failed, ruined and destroyed. Only two of the collocates of utterly had positive connotations: pleasant and clear. Totally also had many ‘absence’ or ‘lack of’ collocates, such as bald, exempt, incapable, irrelevant, lost, oblivious, uneducated, unemployed, unexpected, unknown, unpredictable, unsuited, ignored, excluded, unfamiliar, blind, ignorant, meaningless, unaware, unable, vanished, naked and without. Similar patterns of collocation were found for completely and entirely.
Speakers are mostly unaware of these sorts of patterns. As Channell observes (2000: 54), ‘it is disturbing to discover that important aspects of the use of lexical items are not open to conscious reflection’. The regularities of use demonstrated by Stubbs, Partington and Channell are clearly robust enough to warrant linguists’ attention, but they are hard to come to grips with theoretically. Specifically, the regularities of use revealed by corpus study seem not to appropriately fi t into the categories of either an expression’s literal meaning or its connotation (see 1.4.2). The differences between synonymous adverb intensifiers demonstrated by Partington operate among words with near-identical literal meanings. Perhaps, you might think, that shows they are connotational: perhaps utterly, for example, has a connotation ‘absence of a quality’ or ‘change of state’. But this suggestion is clearly not plausible: most con notations are fairly stable aspects of an expression’s meaning which are hard to cancel. The correlations we have seen in this section are mostly not like this. It isn’t a connotation of utterly that it refer to absences of a quality: we can say things like the meal was utterly perfect without the slightest feeling of clash. Similarly, it is not a connotation of cause that it be associated with negative occurrences. Nevertheless, corpus data demonstrate that these words show these associations in a significant proportion of cases. This raises the questions of just what, on the speaker level, causes these patterns, and of how they are to be described linguistically. As Channell points out, to talk of the collocational facts discussed here as facts about ‘meaning’ is to use that term in a non-standard sense.