COCA Corpus

Corpus of Contemporary American English (1990-2012)

Corpus Overview

The Corpus of Contemporary American English (COCA) is a balanced corpus of contemporary American English, containing over 414 million words of text from 1990-2012. COCA is one of the largest freely-available corpora of American English and is widely used for linguistic research, lexicography, and language teaching.

Note

For comprehensive documentation, corpus design details, and research applications, see the official COCA website.

Important

Version Note: The COCA corpus available on this platform is an older version that was purchased for offline use. This version covers 1990-2012 and is separate from the continuously updated online version available on the official COCA website.

Key Statistics

Metric	Value
Total Words	~414 million
Time Period	1990-2012
Genres	5 (see below)
Texts	~189,000
Language	American English

Genre Distribution

COCA is balanced across five major genres:

Genre	Code	Description
ACAD	Academic	Journal articles, books, reports
FIC	Fiction	Novels, short stories, plays
MAG	Magazine	News magazines, lifestyle
NEWS	News	Newspapers, news websites
SPOK	Spoken	TV shows, movies, transcripts

Available Attributes

The corpus includes the following searchable attributes:

Attribute	Description	Example
`word`	Exact word form	[word="the"]
`lemma`	Base form (lemma)	[lemma="run"]
`tag`	POS tag (CLAWS7, lowercase)	[tag="nn.*"]
`doc.year`	Publication year	Filter: 2000-2010
`doc.genre`	Genre category	Filter: ACAD, FIC, etc.

POS Tagging

COCA uses the CLAWS7 tagset (stored in lowercase). Common tags include:

nn.* - Nouns (nn1, nn2, np0, nnt1, etc.)
v.* - Verbs (vvd, vvg, vvn, vbz, etc.)
jj.* - Adjectives (jj, jjr, jjt)
rr.* - Adverbs (rr, rrr, rrt)

See the CLAWS7 tagset documentation for complete reference.

CQL Query Examples

Basic Word Search

Find all occurrences of a word:

[word="technology"]

Lemma Search (All Forms)

Find all forms of a word:

[lemma="develop"]

This finds: develop, develops, developed, developing, development, etc.

Genre-Specific Queries

Search within a specific genre:

Select ACAD in the genre filter to search academic texts only
Run: [lemma="research"]{.cql}

Compare usage across genres by running the same query with different genre filters.

Temporal Analysis

Track language change over time:

Filter by year range (e.g., 1990-2000 vs 2000-2012)
Run: [lemma="internet"]{.cql}

Observe how terms like internet, smartphone, social media emerge and increase in frequency.

Combining Attributes

Find plural nouns:

[lemma="study" & tag="nn2"]

Find past tense verbs:

[lemma="think" & tag="vvd"]

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"technology\"]",
    "corpus_id": "coca",
    "limit": 50
  }'

Example Request (Python)

import requests

response = requests.post(
    "http://localhost:8000/cql/run",
    json={
        "cql": '[lemma="technology"]',
        "corpus_id": "coca",
        "limit": 50
    }
)

results = response.json()
print(f"Found {results['total']} matches")
for hit in results["kwic"][:5]:
    print(hit["kwic"])

Quick Reference

Goal	CQL Query
Exact word	[word="technology"]
All forms (lemma)	[lemma="develop"]
Plural nouns only	[lemma="study" & tag="nn2"]
Past tense verbs	[lemma="think" & tag="vvd"]
Adjacent words	[lemma="social"] [lemma="media"]
Words within 3	[lemma="artificial"] []{0,3} [lemma="intelligence"]

Corpus Overview

Key Statistics

Genre Distribution

Available Attributes

POS Tagging

CQL Query Examples

Basic Word Search

Lemma Search (All Forms)

Genre-Specific Queries

Temporal Analysis

Combining Attributes

Programmatic Access (API)

Example Request (curl)

Example Request (Python)

Quick Reference

Further Reading