COCA Corpus

Corpus of Contemporary American English (1990-2012)

Corpus Overview

The Corpus of Contemporary American English (COCA) is a balanced corpus of contemporary American English, containing over 414 million words of text from 1990-2012. COCA is one of the largest freely-available corpora of American English and is widely used for linguistic research, lexicography, and language teaching.

Note

For comprehensive documentation, corpus design details, and research applications, see the official COCA website.

Important

Version Note: The COCA corpus available on this platform is an older version that was purchased for offline use. This version covers 1990-2012 and is separate from the continuously updated online version available on the official COCA website.

Key Statistics

Metric Value
Total Words ~414 million
Time Period 1990-2012
Genres 5 (see below)
Texts ~189,000
Language American English

Genre Distribution

COCA is balanced across five major genres:

Genre Code Description
ACAD Academic Journal articles, books, reports
FIC Fiction Novels, short stories, plays
MAG Magazine News magazines, lifestyle
NEWS News Newspapers, news websites
SPOK Spoken TV shows, movies, transcripts

Available Attributes

The corpus includes the following searchable attributes:

Attribute Description Example
word Exact word form [word="the"]
lemma Base form (lemma) [lemma="run"]
tag POS tag (CLAWS7, lowercase) [tag="nn.*"]
doc.year Publication year Filter: 2000-2010
doc.genre Genre category Filter: ACAD, FIC, etc.

POS Tagging

COCA uses the CLAWS7 tagset (stored in lowercase). Common tags include:

  • nn.* - Nouns (nn1, nn2, np0, nnt1, etc.)
  • v.* - Verbs (vvd, vvg, vvn, vbz, etc.)
  • jj.* - Adjectives (jj, jjr, jjt)
  • rr.* - Adverbs (rr, rrr, rrt)

See the CLAWS7 tagset documentation for complete reference.


CQL Query Examples

Lemma Search (All Forms)

Find all forms of a word:

[lemma="develop"]

This finds: develop, develops, developed, developing, development, etc.

Genre-Specific Queries

Search within a specific genre:

  1. Select ACAD in the genre filter to search academic texts only
  2. Run: [lemma="research"]{.cql}

Compare usage across genres by running the same query with different genre filters.

Temporal Analysis

Track language change over time:

  1. Filter by year range (e.g., 1990-2000 vs 2000-2012)
  2. Run: [lemma="internet"]{.cql}

Observe how terms like internet, smartphone, social media emerge and increase in frequency.

Combining Attributes

Find plural nouns:

[lemma="study" & tag="nn2"]

Find past tense verbs:

[lemma="think" & tag="vvd"]

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"technology\"]",
    "corpus_id": "coca",
    "limit": 50
  }'

Example Request (Python)

import requests

response = requests.post(
    "http://localhost:8000/cql/run",
    json={
        "cql": '[lemma="technology"]',
        "corpus_id": "coca",
        "limit": 50
    }
)

results = response.json()
print(f"Found {results['total']} matches")
for hit in results["kwic"][:5]:
    print(hit["kwic"])

Quick Reference

Goal CQL Query
Exact word [word="technology"]
All forms (lemma) [lemma="develop"]
Plural nouns only [lemma="study" & tag="nn2"]
Past tense verbs [lemma="think" & tag="vvd"]
Adjacent words [lemma="social"] [lemma="media"]
Words within 3 [lemma="artificial"] []{0,3} [lemma="intelligence"]

Further Reading