COHA Corpus

Corpus of Historical American English (1810-2000)

Corpus Overview

The Corpus of Historical American English (COHA) is a balanced corpus of historical American English, containing over 472 million words of text from 1810-2000. COHA is the largest corpus of historical American English and enables diachronic (historical) analysis of language change over nearly two centuries.

Note

For comprehensive documentation, corpus design details, and research applications, see the official COHA website.

Important

Version Note: The COHA corpus available on this platform is an older version that was purchased for offline use. This version covers 1810-2000 and is separate from the continuously updated online version available on the official COHA website.

Key Statistics

Metric Value
Total Words ~472 million
Time Period 1820-2019
Genres 4 (see below)
Texts ~116,000
Language Historical American English

Genre Distribution

COHA is balanced across five major genres:

Genre Code Description
ACAD Academic Journal articles, books, reports
FIC Fiction Novels, short stories, plays
MAG Magazine News magazines, lifestyle
NEWS News Newspapers, news websites
SPOK Spoken TV shows, movies, transcripts

Time Periods

The corpus is organized by decade. The distribution reflects the availability of digitized texts over time:

Decade Years Approximate Tokens
1810s 1810-1819 ~1.4M
1820s 1820-1829 ~8.1M
1830s 1830-1839 ~16.1M
1840s 1840-1849 ~18.7M
1850s-1980s Various ~19-30M per decade
1990s 1990-1999 ~33.0M
2000 2000 ~34.8M (single year)

Note: Early decades (1810s-1840s) have fewer tokens due to limited availability of digitized historical texts. The corpus size increases over time as more material becomes available. The 1990s and 2000 have the most tokens, reflecting the greater availability of digital texts from this period.


Available Attributes

The corpus includes the following searchable attributes:

Attribute Description Example
word Exact word form [word="the"]
lemma Base form (lemma) [lemma="run"]
tag POS tag (CLAWS7, lowercase) [tag="nn.*"]
doc.year Publication year Filter: 1850-1900
doc.genre Genre category Filter: ACAD, FIC, etc.

POS Tagging

COHA uses the CLAWS7 tagset (stored in lowercase). Common tags include:

  • nn.* - Nouns (nn1, nn2, np0, nnt1, etc.)
  • v.* - Verbs (vvd, vvg, vvn, vbz, etc.)
  • jj.* - Adjectives (jj, jjr, jjt)
  • rr.* - Adverbs (rr, rrr, rrt)

See the CLAWS7 tagset documentation for complete reference.


Diachronic Analysis

COHA’s primary strength is enabling diachronic (historical) analysis of language change. The corpus is organized by decade, allowing you to track how words, phrases, and grammatical patterns change over time.

Timeline Feature

Use the Frequency feature with decade filters to visualize language change:

  1. Run a query: [lemma="technology"]
  2. Use the Frequency panel to see distribution by doc.year
  3. Observe how frequency increases dramatically in the 20th century

Tracking Language Change

Example: Track the emergence of new terms:

[lemma="internet"]

Filter by decade to see when internet first appears (1990s) and how its frequency increases.


CQL Query Examples

Lemma Search (All Forms)

Find all forms of a word across two centuries:

[lemma="develop"]

This finds: develop, develops, developed, developing, development, etc., across all time periods.

Temporal Queries

Search within a specific time period:

  1. Filter by decade (e.g., 1850-1900)
  2. Run: lemma="railroad"

Compare usage across different decades to observe semantic shifts and frequency changes.

Genre-Specific Historical Analysis

Compare genre usage over time:

  1. Select FIC (fiction) genre filter
  2. Filter by decade (e.g., 1900-1950)
  3. Run: lemma="automobile"

Compare with NEWS genre to see how fiction vs. news usage differs historically.

Combining Attributes

Find plural nouns in a specific time period:

[lemma="city" & tag="nn2"]

Filter by 1900-2000 to see urbanization reflected in language.


Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"technology\"]",
    "corpus_id": "coha",
    "limit": 50
  }'

Example Request (Python)

import requests

response = requests.post(
    "http://localhost:8000/cql/run",
    json={
        "cql": '[lemma="technology"]',
        "corpus_id": "coha",
        "limit": 50
    }
)

results = response.json()
print(f"Found {results['total']} matches")
for hit in results["kwic"][:5]:
    print(f"{hit['metadata']['year']}: {hit['kwic']}")

Quick Reference

Goal CQL Query
Exact word [word="society"]
All forms (lemma) [lemma="develop"]
Plural nouns only [lemma="city" & tag="nn2"]
Past tense verbs [lemma="think" & tag="vvd"]
Adjacent words [lemma="steam"] [lemma="engine"]
Words within 3 [lemma="rail"] []{0,3} [lemma="road"]

Further Reading