COHA Corpus
Corpus of Historical American English (1810-2000)
Corpus Overview
The Corpus of Historical American English (COHA) is a balanced corpus of historical American English, containing over 472 million words of text from 1810-2000. COHA is the largest corpus of historical American English and enables diachronic (historical) analysis of language change over nearly two centuries.
For comprehensive documentation, corpus design details, and research applications, see the official COHA website.
Version Note: The COHA corpus available on this platform is an older version that was purchased for offline use. This version covers 1810-2000 and is separate from the continuously updated online version available on the official COHA website.
Key Statistics
| Metric | Value |
|---|---|
| Total Words | ~472 million |
| Time Period | 1820-2019 |
| Genres | 4 (see below) |
| Texts | ~116,000 |
| Language | Historical American English |
Genre Distribution
COHA is balanced across five major genres:
| Genre | Code | Description |
|---|---|---|
| ACAD | Academic | Journal articles, books, reports |
| FIC | Fiction | Novels, short stories, plays |
| MAG | Magazine | News magazines, lifestyle |
| NEWS | News | Newspapers, news websites |
| SPOK | Spoken | TV shows, movies, transcripts |
Time Periods
The corpus is organized by decade. The distribution reflects the availability of digitized texts over time:
| Decade | Years | Approximate Tokens |
|---|---|---|
| 1810s | 1810-1819 | ~1.4M |
| 1820s | 1820-1829 | ~8.1M |
| 1830s | 1830-1839 | ~16.1M |
| 1840s | 1840-1849 | ~18.7M |
| 1850s-1980s | Various | ~19-30M per decade |
| 1990s | 1990-1999 | ~33.0M |
| 2000 | 2000 | ~34.8M (single year) |
Note: Early decades (1810s-1840s) have fewer tokens due to limited availability of digitized historical texts. The corpus size increases over time as more material becomes available. The 1990s and 2000 have the most tokens, reflecting the greater availability of digital texts from this period.
Available Attributes
The corpus includes the following searchable attributes:
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="the"] |
lemma |
Base form (lemma) | [lemma="run"] |
tag |
POS tag (CLAWS7, lowercase) | [tag="nn.*"] |
doc.year |
Publication year | Filter: 1850-1900 |
doc.genre |
Genre category | Filter: ACAD, FIC, etc. |
POS Tagging
COHA uses the CLAWS7 tagset (stored in lowercase). Common tags include:
nn.*- Nouns (nn1, nn2, np0, nnt1, etc.)v.*- Verbs (vvd, vvg, vvn, vbz, etc.)jj.*- Adjectives (jj, jjr, jjt)rr.*- Adverbs (rr, rrr, rrt)
See the CLAWS7 tagset documentation for complete reference.
Diachronic Analysis
COHA’s primary strength is enabling diachronic (historical) analysis of language change. The corpus is organized by decade, allowing you to track how words, phrases, and grammatical patterns change over time.
Timeline Feature
Use the Frequency feature with decade filters to visualize language change:
- Run a query: [lemma="technology"]
- Use the Frequency panel to see distribution by
doc.year - Observe how frequency increases dramatically in the 20th century
Tracking Language Change
Example: Track the emergence of new terms:
Filter by decade to see when internet first appears (1990s) and how its frequency increases.
CQL Query Examples
Basic Word Search
Find all occurrences of a word:
Lemma Search (All Forms)
Find all forms of a word across two centuries:
This finds: develop, develops, developed, developing, development, etc., across all time periods.
Temporal Queries
Search within a specific time period:
- Filter by decade (e.g., 1850-1900)
- Run: lemma="railroad"
Compare usage across different decades to observe semantic shifts and frequency changes.
Genre-Specific Historical Analysis
Compare genre usage over time:
- Select FIC (fiction) genre filter
- Filter by decade (e.g., 1900-1950)
- Run: lemma="automobile"
Compare with NEWS genre to see how fiction vs. news usage differs historically.
Combining Attributes
Find plural nouns in a specific time period:
Filter by 1900-2000 to see urbanization reflected in language.
Programmatic Access (API)
Query the corpus programmatically via the REST API:
Example Request (curl)
curl -X POST http://localhost:8000/cql/run \
-H "Content-Type: application/json" \
-d '{
"cql": "[lemma=\"technology\"]",
"corpus_id": "coha",
"limit": 50
}'Example Request (Python)
import requests
response = requests.post(
"http://localhost:8000/cql/run",
json={
"cql": '[lemma="technology"]',
"corpus_id": "coha",
"limit": 50
}
)
results = response.json()
print(f"Found {results['total']} matches")
for hit in results["kwic"][:5]:
print(f"{hit['metadata']['year']}: {hit['kwic']}")Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="society"] |
| All forms (lemma) | [lemma="develop"] |
| Plural nouns only | [lemma="city" & tag="nn2"] |
| Past tense verbs | [lemma="think" & tag="vvd"] |
| Adjacent words | [lemma="steam"] [lemma="engine"] |
| Words within 3 | [lemma="rail"] []{0,3} [lemma="road"] |
Further Reading
- COHA Official Website - Full documentation, design principles, and research applications
- Sketch Engine CQL Documentation - Complete CQL reference
- CLAWS7 Tagset - Part-of-speech tag reference