COCA Corpus
Corpus of Contemporary American English (1990-2012)
Corpus Overview
The Corpus of Contemporary American English (COCA) is a balanced corpus of contemporary American English, containing over 414 million words of text from 1990-2012. COCA is one of the largest freely-available corpora of American English and is widely used for linguistic research, lexicography, and language teaching.
For comprehensive documentation, corpus design details, and research applications, see the official COCA website.
Version Note: The COCA corpus available on this platform is an older version that was purchased for offline use. This version covers 1990-2012 and is separate from the continuously updated online version available on the official COCA website.
Key Statistics
| Metric | Value |
|---|---|
| Total Words | ~414 million |
| Time Period | 1990-2012 |
| Genres | 5 (see below) |
| Texts | ~189,000 |
| Language | American English |
Genre Distribution
COCA is balanced across five major genres:
| Genre | Code | Description |
|---|---|---|
| ACAD | Academic | Journal articles, books, reports |
| FIC | Fiction | Novels, short stories, plays |
| MAG | Magazine | News magazines, lifestyle |
| NEWS | News | Newspapers, news websites |
| SPOK | Spoken | TV shows, movies, transcripts |
Available Attributes
The corpus includes the following searchable attributes:
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="the"] |
lemma |
Base form (lemma) | [lemma="run"] |
tag |
POS tag (CLAWS7, lowercase) | [tag="nn.*"] |
doc.year |
Publication year | Filter: 2000-2010 |
doc.genre |
Genre category | Filter: ACAD, FIC, etc. |
POS Tagging
COCA uses the CLAWS7 tagset (stored in lowercase). Common tags include:
nn.*- Nouns (nn1, nn2, np0, nnt1, etc.)v.*- Verbs (vvd, vvg, vvn, vbz, etc.)jj.*- Adjectives (jj, jjr, jjt)rr.*- Adverbs (rr, rrr, rrt)
See the CLAWS7 tagset documentation for complete reference.
CQL Query Examples
Basic Word Search
Find all occurrences of a word:
Lemma Search (All Forms)
Find all forms of a word:
This finds: develop, develops, developed, developing, development, etc.
Genre-Specific Queries
Search within a specific genre:
- Select ACAD in the genre filter to search academic texts only
- Run:
[lemma="research"]{.cql}
Compare usage across genres by running the same query with different genre filters.
Temporal Analysis
Track language change over time:
- Filter by year range (e.g., 1990-2000 vs 2000-2012)
- Run:
[lemma="internet"]{.cql}
Observe how terms like internet, smartphone, social media emerge and increase in frequency.
Combining Attributes
Find plural nouns:
Find past tense verbs:
Programmatic Access (API)
Query the corpus programmatically via the REST API:
Example Request (curl)
curl -X POST http://localhost:8000/cql/run \
-H "Content-Type: application/json" \
-d '{
"cql": "[lemma=\"technology\"]",
"corpus_id": "coca",
"limit": 50
}'Example Request (Python)
import requests
response = requests.post(
"http://localhost:8000/cql/run",
json={
"cql": '[lemma="technology"]',
"corpus_id": "coca",
"limit": 50
}
)
results = response.json()
print(f"Found {results['total']} matches")
for hit in results["kwic"][:5]:
print(hit["kwic"])Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="technology"] |
| All forms (lemma) | [lemma="develop"] |
| Plural nouns only | [lemma="study" & tag="nn2"] |
| Past tense verbs | [lemma="think" & tag="vvd"] |
| Adjacent words | [lemma="social"] [lemma="media"] |
| Words within 3 | [lemma="artificial"] []{0,3} [lemma="intelligence"] |
Further Reading
- COCA Official Website - Full documentation, design principles, and research applications
- Sketch Engine CQL Documentation - Complete CQL reference
- CLAWS7 Tagset - Part-of-speech tag reference