COHA Corpus

Corpus of Historical American English (1810-2000)

Corpus Overview

The Corpus of Historical American English (COHA) is a balanced corpus of historical American English, containing over 472 million words of text from 1810-2000. COHA is the largest corpus of historical American English and enables diachronic (historical) analysis of language change over nearly two centuries.

Note

For comprehensive documentation, corpus design details, and research applications, see the official COHA website.

Important

Version Note: The COHA corpus available on this platform is an older version that was purchased for offline use. This version covers 1810-2000 and is separate from the continuously updated online version available on the official COHA website.

Key Statistics

Metric	Value
Total Words	~472 million
Time Period	1820-2019
Genres	4 (see below)
Texts	~116,000
Language	Historical American English

Genre Distribution

COHA is balanced across five major genres:

Genre	Code	Description
ACAD	Academic	Journal articles, books, reports
FIC	Fiction	Novels, short stories, plays
MAG	Magazine	News magazines, lifestyle
NEWS	News	Newspapers, news websites
SPOK	Spoken	TV shows, movies, transcripts

Time Periods

The corpus is organized by decade. The distribution reflects the availability of digitized texts over time:

Decade	Years	Approximate Tokens
1810s	1810-1819	~1.4M
1820s	1820-1829	~8.1M
1830s	1830-1839	~16.1M
1840s	1840-1849	~18.7M
1850s-1980s	Various	~19-30M per decade
1990s	1990-1999	~33.0M
2000	2000	~34.8M (single year)

Note: Early decades (1810s-1840s) have fewer tokens due to limited availability of digitized historical texts. The corpus size increases over time as more material becomes available. The 1990s and 2000 have the most tokens, reflecting the greater availability of digital texts from this period.

Available Attributes

The corpus includes the following searchable attributes:

Attribute	Description	Example
`word`	Exact word form	[word="the"]
`lemma`	Base form (lemma)	[lemma="run"]
`tag`	POS tag (CLAWS7, lowercase)	[tag="nn.*"]
`doc.year`	Publication year	Filter: 1850-1900
`doc.genre`	Genre category	Filter: ACAD, FIC, etc.

POS Tagging

COHA uses the CLAWS7 tagset (stored in lowercase). Common tags include:

nn.* - Nouns (nn1, nn2, np0, nnt1, etc.)
v.* - Verbs (vvd, vvg, vvn, vbz, etc.)
jj.* - Adjectives (jj, jjr, jjt)
rr.* - Adverbs (rr, rrr, rrt)

See the CLAWS7 tagset documentation for complete reference.

Diachronic Analysis

COHA’s primary strength is enabling diachronic (historical) analysis of language change. The corpus is organized by decade, allowing you to track how words, phrases, and grammatical patterns change over time.

Timeline Feature

Use the Frequency feature with decade filters to visualize language change:

Run a query: [lemma="technology"]
Use the Frequency panel to see distribution by doc.year
Observe how frequency increases dramatically in the 20th century

Tracking Language Change

Example: Track the emergence of new terms:

[lemma="internet"]

Filter by decade to see when internet first appears (1990s) and how its frequency increases.

CQL Query Examples

Basic Word Search

Find all occurrences of a word:

[word="society"]

Lemma Search (All Forms)

Find all forms of a word across two centuries:

[lemma="develop"]

This finds: develop, develops, developed, developing, development, etc., across all time periods.

Temporal Queries

Search within a specific time period:

Filter by decade (e.g., 1850-1900)
Run: lemma="railroad"

Compare usage across different decades to observe semantic shifts and frequency changes.

Genre-Specific Historical Analysis

Compare genre usage over time:

Select FIC (fiction) genre filter
Filter by decade (e.g., 1900-1950)
Run: lemma="automobile"

Compare with NEWS genre to see how fiction vs. news usage differs historically.

Combining Attributes

Find plural nouns in a specific time period:

[lemma="city" & tag="nn2"]

Filter by 1900-2000 to see urbanization reflected in language.

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"technology\"]",
    "corpus_id": "coha",
    "limit": 50
  }'

Example Request (Python)

import requests

response = requests.post(
    "http://localhost:8000/cql/run",
    json={
        "cql": '[lemma="technology"]',
        "corpus_id": "coha",
        "limit": 50
    }
)

results = response.json()
print(f"Found {results['total']} matches")
for hit in results["kwic"][:5]:
    print(f"{hit['metadata']['year']}: {hit['kwic']}")

Quick Reference

Goal	CQL Query
Exact word	[word="society"]
All forms (lemma)	[lemma="develop"]
Plural nouns only	[lemma="city" & tag="nn2"]
Past tense verbs	[lemma="think" & tag="vvd"]
Adjacent words	[lemma="steam"] [lemma="engine"]
Words within 3	[lemma="rail"] []{0,3} [lemma="road"]

Corpus Overview

Key Statistics

Genre Distribution

Time Periods

Available Attributes

POS Tagging

Diachronic Analysis

Timeline Feature

Tracking Language Change

CQL Query Examples

Basic Word Search

Lemma Search (All Forms)

Temporal Queries

Genre-Specific Historical Analysis

Combining Attributes

Programmatic Access (API)

Example Request (curl)

Example Request (Python)

Quick Reference

Further Reading