enRed Corpus

English Reddit Corpus — Regional Varieties (2010-2024)

NoteCurrent Status

English Reddit corpus: 5 subreddits with 1.25B words, 34.3M documents. Full dependency parsing and NER are enabled. Includes SNA metadata (parent_id, link_id, score) for network analysis.

Corpus Overview

The enRed (English Reddit Corpus) contains English-language posts and comments from regional English subreddits, providing access to contemporary informal written English discourse from the UK (Wales, Scotland, Northern Ireland, and general UK) and the United States.

The corpus enables studies of regional English varieties, comparing British Isles English (Welsh, Scottish, Northern Irish, general British) with American English.

Language detection uses fastText (lid.176.bin model) with ISO 639-1 codes. The vast majority of content is English (en), with occasional misclassified content filtered by the 1% threshold in frequency displays.

The corpus is processed with spaCy en_core_web_lg (large English model) with full pipeline including tokenization, POS tagging (Penn Treebank), lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric Value
Total Words 1.25 billion
Documents 34,300,000
Subreddits 5
Time Period 2010-2024
Language English
spaCy Model en_core_web_lg

Subreddits Included

Subreddit Region Documents Description
Wales Wales 472,000 Welsh English community
Scotland Scotland 4,500,000 Scottish English community
northernireland Northern Ireland 3,500,000 Northern Irish English community
AskUK United Kingdom 15,000,000 General British English Q&A
AskAnAmerican United States 10,800,000 American English Q&A community

Data Source

The corpus is derived from:


Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute Description Example
word Exact word form [word="the"]
lemma Base form (lemma) [lemma="think"]
tag POS tag (Penn Treebank) [tag="NN"]
head_n Head position (1-based, 0 for root) Dependency parsing
head Head lemma Dependency parsing
deprel Dependency relation (UD) [deprel="nsubj"]
ent_type Named entity type (GPE, PERSON, ORG, etc.) [ent_type="GPE"]
ent_iob NER IOB tag (B=begin, I=inside, O=outside) [ent_iob="B"]
morph Morphological features (UD FEATS format) [morph=".Number=Plur."]

Note: The corpus uses Penn Treebank POS tags (fine-grained, e.g., NN, VBD, JJ) rather than Universal Dependencies coarse tags.

Document Attributes

Attribute Description Example
doc.id Reddit post/comment ID Filter by specific post
doc.subreddit Subreddit name Filter: Wales, Scotland, northernireland, AskUK, AskAnAmerican
doc.author Reddit username Filter by author
doc.date Publication date (ISO format) Filter by date range
doc.year Publication year Filter by year
doc.doc_type Type: comment or submission Filter by type
doc.permalink Full URL to source Link back to Reddit
doc.parent_id Parent post/comment ID Reply tree reconstruction
doc.link_id Thread ID Group comments by thread
doc.score Vote score Engagement filtering
doc.lang Detected language (ISO 639-1) Filter by language

SNA Metadata

The corpus includes Social Network Analysis fields for studying Reddit discourse structure:

  • parent_id: Links comments to their parent (enables reply tree reconstruction)
  • link_id: Groups all comments in a thread together
  • score: Reddit vote score for engagement analysis
  • permalink: Direct link to source for verification

CQL Query Examples

Named Entity Recognition

English spaCy uses these NER labels: PERSON, ORG, GPE, LOC, DATE, MONEY, PERCENT, etc.

Find all person names:

[ent_type="PERSON"]

Find all geopolitical entities (countries, cities, etc.):

[ent_type="GPE"]

Find all organizations:

[ent_type="ORG"]

Part-of-Speech Queries

Penn Treebank tagset uses fine-grained tags.

Find all singular nouns:

[tag="NN"]

Find all plural nouns:

[tag="NNS"]

Find all verbs (past tense):

[tag="VBD"]

Find all adjectives:

[tag="JJ"]

Find all proper nouns:

[tag="NNP"]

Common POS Tag Patterns

Pattern Meaning
NN.* All noun forms
VB.* All verb forms
JJ.* All adjective forms
RB.* All adverb forms
PRP.* All pronouns
NNP.* All proper nouns

Morphological Features

Find plural nouns:

[tag="NN.*" & morph=".Number=Plur."]

Find third person verbs:

[tag="VB.*" & morph=".Person=3."]

Dependency Relations

Find subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="dobj"]

Find prepositional objects:

[deprel="pobj"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VB.*"] []{0,3} [tag="NN.*"]

Find adjective + noun combinations:

[tag="JJ.*"] [tag="NN.*"]

Filter by Subreddit

Search within a specific subreddit:

  1. Use the metadata filter to select a subreddit (e.g., Scotland)
  2. Run your query

Regional Variation Studies

The corpus enables comparative studies across regional English varieties:

Region Subreddit Features
Wales Wales Welsh English, bilingual influences
Scotland Scotland Scottish English, Scots vocabulary
Northern Ireland northernireland Northern Irish English, Ulster Scots
United Kingdom AskUK General British English, Q&A format
United States AskAnAmerican American English, Q&A format

Example research questions:

  • Lexical differences between British and American English
  • Regional discourse markers across UK varieties
  • Spelling differences (colour/color, realise/realize)
  • Code-switching patterns with Welsh, Scots, or Irish
  • Comparative analysis of Celtic Fringe vs. general British English

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"think\"]",
    "corpus_id": "enRed",
    "limit": 50
  }'

Quick Reference

Goal CQL Query
Exact word [word="Scottish"]
All forms (lemma) [lemma="think"]
Adjacent words [lemma="I"] [lemma="think"]
Person names [ent_type="PERSON"]
Organizations [ent_type="ORG"]
Locations (GPE) [ent_type="GPE"]
Singular nouns [tag="NN"]
Plural nouns [tag="NNS"]
Past tense verbs [tag="VBD"]
Adjectives [tag="JJ"]
Subjects [deprel="nsubj"]
Direct objects [deprel="dobj"]

References