enRed Corpus
English Reddit Corpus — Regional Varieties (2010-2024)
English Reddit corpus: 5 subreddits with 1.25B words, 34.3M documents. Full dependency parsing and NER are enabled. Includes SNA metadata (parent_id, link_id, score) for network analysis.
Corpus Overview
The enRed (English Reddit Corpus) contains English-language posts and comments from regional English subreddits, providing access to contemporary informal written English discourse from the UK (Wales, Scotland, Northern Ireland, and general UK) and the United States.
The corpus enables studies of regional English varieties, comparing British Isles English (Welsh, Scottish, Northern Irish, general British) with American English.
Language detection uses fastText (lid.176.bin model) with ISO 639-1 codes. The vast majority of content is English (en), with occasional misclassified content filtered by the 1% threshold in frequency displays.
The corpus is processed with spaCy en_core_web_lg (large English model) with full pipeline including tokenization, POS tagging (Penn Treebank), lemmatization, dependency parsing, and named entity recognition.
Key Statistics
| Metric | Value |
|---|---|
| Total Words | 1.25 billion |
| Documents | 34,300,000 |
| Subreddits | 5 |
| Time Period | 2010-2024 |
| Language | English |
| spaCy Model | en_core_web_lg |
Subreddits Included
| Subreddit | Region | Documents | Description |
|---|---|---|---|
| Wales | Wales | 472,000 | Welsh English community |
| Scotland | Scotland | 4,500,000 | Scottish English community |
| northernireland | Northern Ireland | 3,500,000 | Northern Irish English community |
| AskUK | United Kingdom | 15,000,000 | General British English Q&A |
| AskAnAmerican | United States | 10,800,000 | American English Q&A community |
Data Source
The corpus is derived from:
- Academic Torrents: Reddit archive dumps via academictorrents.com
- fastText Language ID: Comments filtered using lid.176.bin model
Available Attributes
The corpus includes the following searchable attributes:
Token Attributes
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="the"] |
lemma |
Base form (lemma) | [lemma="think"] |
tag |
POS tag (Penn Treebank) | [tag="NN"] |
head_n |
Head position (1-based, 0 for root) | Dependency parsing |
head |
Head lemma | Dependency parsing |
deprel |
Dependency relation (UD) | [deprel="nsubj"] |
ent_type |
Named entity type (GPE, PERSON, ORG, etc.) | [ent_type="GPE"] |
ent_iob |
NER IOB tag (B=begin, I=inside, O=outside) | [ent_iob="B"] |
morph |
Morphological features (UD FEATS format) | [morph=".Number=Plur."] |
Note: The corpus uses Penn Treebank POS tags (fine-grained, e.g., NN, VBD, JJ) rather than Universal Dependencies coarse tags.
Document Attributes
| Attribute | Description | Example |
|---|---|---|
doc.id |
Reddit post/comment ID | Filter by specific post |
doc.subreddit |
Subreddit name | Filter: Wales, Scotland, northernireland, AskUK, AskAnAmerican |
doc.author |
Reddit username | Filter by author |
doc.date |
Publication date (ISO format) | Filter by date range |
doc.year |
Publication year | Filter by year |
doc.doc_type |
Type: comment or submission | Filter by type |
doc.permalink |
Full URL to source | Link back to Reddit |
doc.parent_id |
Parent post/comment ID | Reply tree reconstruction |
doc.link_id |
Thread ID | Group comments by thread |
doc.score |
Vote score | Engagement filtering |
doc.lang |
Detected language (ISO 639-1) | Filter by language |
SNA Metadata
The corpus includes Social Network Analysis fields for studying Reddit discourse structure:
parent_id: Links comments to their parent (enables reply tree reconstruction)link_id: Groups all comments in a thread togetherscore: Reddit vote score for engagement analysispermalink: Direct link to source for verification
CQL Query Examples
Basic Word Search
Find all occurrences of a word:
Find exact word form:
Named Entity Recognition
English spaCy uses these NER labels: PERSON, ORG, GPE, LOC, DATE, MONEY, PERCENT, etc.
Find all person names:
Find all geopolitical entities (countries, cities, etc.):
Find all organizations:
Part-of-Speech Queries
Penn Treebank tagset uses fine-grained tags.
Find all singular nouns:
Find all plural nouns:
Find all verbs (past tense):
Find all adjectives:
Find all proper nouns:
Common POS Tag Patterns
| Pattern | Meaning |
|---|---|
NN.* |
All noun forms |
VB.* |
All verb forms |
JJ.* |
All adjective forms |
RB.* |
All adverb forms |
PRP.* |
All pronouns |
NNP.* |
All proper nouns |
Morphological Features
Find plural nouns:
Find third person verbs:
Dependency Relations
Find subjects:
Find direct objects:
Find prepositional objects:
Combined Queries
Find verbs followed by nouns within 3 words:
Find adjective + noun combinations:
Filter by Subreddit
Search within a specific subreddit:
- Use the metadata filter to select a subreddit (e.g.,
Scotland) - Run your query
Regional Variation Studies
The corpus enables comparative studies across regional English varieties:
| Region | Subreddit | Features |
|---|---|---|
| Wales | Wales | Welsh English, bilingual influences |
| Scotland | Scotland | Scottish English, Scots vocabulary |
| Northern Ireland | northernireland | Northern Irish English, Ulster Scots |
| United Kingdom | AskUK | General British English, Q&A format |
| United States | AskAnAmerican | American English, Q&A format |
Example research questions:
- Lexical differences between British and American English
- Regional discourse markers across UK varieties
- Spelling differences (colour/color, realise/realize)
- Code-switching patterns with Welsh, Scots, or Irish
- Comparative analysis of Celtic Fringe vs. general British English
Programmatic Access (API)
Query the corpus programmatically via the REST API:
Example Request (curl)
curl -X POST http://localhost:8000/cql/run \
-H "Content-Type: application/json" \
-d '{
"cql": "[lemma=\"think\"]",
"corpus_id": "enRed",
"limit": 50
}'Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="Scottish"] |
| All forms (lemma) | [lemma="think"] |
| Adjacent words | [lemma="I"] [lemma="think"] |
| Person names | [ent_type="PERSON"] |
| Organizations | [ent_type="ORG"] |
| Locations (GPE) | [ent_type="GPE"] |
| Singular nouns | [tag="NN"] |
| Plural nouns | [tag="NNS"] |
| Past tense verbs | [tag="VBD"] |
| Adjectives | [tag="JJ"] |
| Subjects | [deprel="nsubj"] |
| Direct objects | [deprel="dobj"] |
References
- Academic Torrents Reddit Archive: academictorrents.com
- fastText Language ID: fasttext.cc
- spaCy English Model: en_core_web_lg
- Penn Treebank Tagset: upenn.edu