enRed Corpus

English Reddit Corpus — Regional Varieties (2010-2024)

Current Status

English Reddit corpus: 5 subreddits with 1.25B words, 34.3M documents. Full dependency parsing and NER are enabled. Includes SNA metadata (parent_id, link_id, score) for network analysis.

Corpus Overview

The enRed (English Reddit Corpus) contains English-language posts and comments from regional English subreddits, providing access to contemporary informal written English discourse from the UK (Wales, Scotland, Northern Ireland, and general UK) and the United States.

The corpus enables studies of regional English varieties, comparing British Isles English (Welsh, Scottish, Northern Irish, general British) with American English.

Language detection uses fastText (lid.176.bin model) with ISO 639-1 codes. The vast majority of content is English (en), with occasional misclassified content filtered by the 1% threshold in frequency displays.

The corpus is processed with spaCy en_core_web_lg (large English model) with full pipeline including tokenization, POS tagging (Penn Treebank), lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric	Value
Total Words	1.25 billion
Documents	34,300,000
Subreddits	5
Time Period	2010-2024
Language	English
spaCy Model	en_core_web_lg

Subreddits Included

Subreddit	Region	Documents	Description
Wales	Wales	472,000	Welsh English community
Scotland	Scotland	4,500,000	Scottish English community
northernireland	Northern Ireland	3,500,000	Northern Irish English community
AskUK	United Kingdom	15,000,000	General British English Q&A
AskAnAmerican	United States	10,800,000	American English Q&A community

Data Source

The corpus is derived from:

Academic Torrents: Reddit archive dumps via academictorrents.com
fastText Language ID: Comments filtered using lid.176.bin model

Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute	Description	Example
`word`	Exact word form	[word="the"]
`lemma`	Base form (lemma)	[lemma="think"]
`tag`	POS tag (Penn Treebank)	[tag="NN"]
`head_n`	Head position (1-based, 0 for root)	Dependency parsing
`head`	Head lemma	Dependency parsing
`deprel`	Dependency relation (UD)	[deprel="nsubj"]
`ent_type`	Named entity type (GPE, PERSON, ORG, etc.)	[ent_type="GPE"]
`ent_iob`	NER IOB tag (B=begin, I=inside, O=outside)	[ent_iob="B"]
`morph`	Morphological features (UD FEATS format)	[morph=".Number=Plur."]

Note: The corpus uses Penn Treebank POS tags (fine-grained, e.g., NN, VBD, JJ) rather than Universal Dependencies coarse tags.

Document Attributes

Attribute	Description	Example
`doc.id`	Reddit post/comment ID	Filter by specific post
`doc.subreddit`	Subreddit name	Filter: Wales, Scotland, northernireland, AskUK, AskAnAmerican
`doc.author`	Reddit username	Filter by author
`doc.date`	Publication date (ISO format)	Filter by date range
`doc.year`	Publication year	Filter by year
`doc.doc_type`	Type: comment or submission	Filter by type
`doc.permalink`	Full URL to source	Link back to Reddit
`doc.parent_id`	Parent post/comment ID	Reply tree reconstruction
`doc.link_id`	Thread ID	Group comments by thread
`doc.score`	Vote score	Engagement filtering
`doc.lang`	Detected language (ISO 639-1)	Filter by language

SNA Metadata

The corpus includes Social Network Analysis fields for studying Reddit discourse structure:

parent_id: Links comments to their parent (enables reply tree reconstruction)
link_id: Groups all comments in a thread together
score: Reddit vote score for engagement analysis
permalink: Direct link to source for verification

CQL Query Examples

Basic Word Search

Find all occurrences of a word:

[lemma="think"]

Find exact word form:

[word="Scottish"]

Named Entity Recognition

English spaCy uses these NER labels: PERSON, ORG, GPE, LOC, DATE, MONEY, PERCENT, etc.

Find all person names:

[ent_type="PERSON"]

Find all geopolitical entities (countries, cities, etc.):

[ent_type="GPE"]

Find all organizations:

[ent_type="ORG"]

Part-of-Speech Queries

Penn Treebank tagset uses fine-grained tags.

Find all singular nouns:

[tag="NN"]

Find all plural nouns:

[tag="NNS"]

Find all verbs (past tense):

[tag="VBD"]

Find all adjectives:

[tag="JJ"]

Find all proper nouns:

[tag="NNP"]

Common POS Tag Patterns

Pattern	Meaning
`NN.*`	All noun forms
`VB.*`	All verb forms
`JJ.*`	All adjective forms
`RB.*`	All adverb forms
`PRP.*`	All pronouns
`NNP.*`	All proper nouns

Morphological Features

Find plural nouns:

[tag="NN.*" & morph=".Number=Plur."]

Find third person verbs:

[tag="VB.*" & morph=".Person=3."]

Dependency Relations

Find subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="dobj"]

Find prepositional objects:

[deprel="pobj"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VB.*"] []{0,3} [tag="NN.*"]

Find adjective + noun combinations:

[tag="JJ.*"] [tag="NN.*"]

Filter by Subreddit

Search within a specific subreddit:

Use the metadata filter to select a subreddit (e.g., Scotland)
Run your query

Regional Variation Studies

The corpus enables comparative studies across regional English varieties:

Region	Subreddit	Features
Wales	Wales	Welsh English, bilingual influences
Scotland	Scotland	Scottish English, Scots vocabulary
Northern Ireland	northernireland	Northern Irish English, Ulster Scots
United Kingdom	AskUK	General British English, Q&A format
United States	AskAnAmerican	American English, Q&A format

Example research questions:

Lexical differences between British and American English
Regional discourse markers across UK varieties
Spelling differences (colour/color, realise/realize)
Code-switching patterns with Welsh, Scots, or Irish
Comparative analysis of Celtic Fringe vs. general British English

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"think\"]",
    "corpus_id": "enRed",
    "limit": 50
  }'

Quick Reference

Goal	CQL Query
Exact word	[word="Scottish"]
All forms (lemma)	[lemma="think"]
Adjacent words	[lemma="I"] [lemma="think"]
Person names	[ent_type="PERSON"]
Organizations	[ent_type="ORG"]
Locations (GPE)	[ent_type="GPE"]
Singular nouns	[tag="NN"]
Plural nouns	[tag="NNS"]
Past tense verbs	[tag="VBD"]
Adjectives	[tag="JJ"]
Subjects	[deprel="nsubj"]
Direct objects	[deprel="dobj"]

References

Academic Torrents Reddit Archive: academictorrents.com
fastText Language ID: fasttext.cc
spaCy English Model: en_core_web_lg
Penn Treebank Tagset: upenn.edu