German Reddit Corpus

GeRedE — German Reddit (2010-2018)

Current Status

FAU-filtered corpus: 11 subreddits with 198M words, 6.2M documents. Full dependency parsing and NER are enabled.

Corpus Overview

The GeRedE (German Reddit Corpus) contains German-language posts and comments from Reddit, providing access to contemporary informal written German discourse from online communities.

This corpus is based on the GeRedE project by FAU Erlangen-Nürnberg, which identified German-language threads from Reddit archives. Comments are filtered using FAU’s pre-computed thread language scores (threshold: score >= 0.1) to ensure German-only content.

The corpus is processed with spaCy de_core_news_lg (large German model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric	Value
Total Words	198 million
Documents	~6,200,000 comments
Subreddits	11
Time Period	2010-2018
Language	German
spaCy Model	de_core_news_sm

Subreddits Included

Subreddit	FAU Match Rate	Description
Austria	91.5%	Austrian community
rocketbeans	99.4%	German gaming/streaming
de_IAmA	99.3%	German Ask Me Anything
Finanzen	96.8%	German finance discussions
ich_iel	92.5%	German memes (ich_irl)
wien	52.6%	Vienna community
FragReddit	99.7%	German Q&A community
einfach_posten	99.3%	Casual German posting
de_EDV	99.5%	German IT/tech discussions
Fahrrad	99.4%	German cycling community
VeganDE	99.1%	German vegan community

Data Source

The corpus is derived from:

Arctic Shift API: Historical Reddit comments (2010-2018) via arctic-shift.photon-reddit.com
FAU GeRedE metadata: Thread language scores from the fau-klue/german-reddit-korpus repository (389K German threads identified)

Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute	Description	Example
`word`	Exact word form	[word="der"]
`lemma`	Base form (lemma)	[lemma="sein"]
`tag`	POS tag (STTS - Stuttgart-Tübingen Tagset)	[tag="NN.*"]
`head_n`	Head position (1-based, 0 for root)	Dependency parsing
`head`	Head lemma	Dependency parsing
`deprel`	Dependency relation	[deprel="sb"]
`ent_type`	Named entity type (ORG, PERSON, LOC, etc.)	[ent_type="PER"]
`ent_iob`	NER IOB tag (B=begin, I=inside, O=outside)	[ent_iob="B"]
`morph`	Morphological features (UD FEATS format)	[morph=".Case=Nom."]

Note: All attributes are fully populated. The corpus was processed with spaCy’s de_core_news_lg model with full dependency parsing and named entity recognition.

Document Attributes

Attribute	Description	Example
`doc.id`	Document ID	Filter by specific post
`doc.subreddit`	Subreddit name	Filter: de, ich_iel, etc.
`doc.author`	Reddit username	Filter by author
`doc.date`	Publication date	Filter by date range
`doc.year`	Publication year	Filter by year

CQL Query Examples

Basic Word Search

Find all occurrences of [lemma="sein"] in the corpus.

Find all occurrences of a word:

[lemma="sein"]

Find exact word form:

[word="ist"]

Named Entity Recognition

German spaCy uses these NER labels: PER, ORG, LOC, MISC

Find all person names:

[ent_type="PER"]

Find all organizations:

[ent_type="ORG"]

Find all locations:

[ent_type="LOC"]

Find miscellaneous entities:

[ent_type="MISC"]

Part-of-Speech Queries

Find all nouns (STTS tagset):

See the STTS tagset documentation for the full list.

[tag="NN"]

Find all verbs:

[tag="VVFIN"]

Morphological Features

Find nominative case:

[morph=".Case=Nom."]

Find plural nouns:

[tag="NN" & morph=".Number=Plur."]

Dependency Relations

German spaCy uses TIGER/TüBa-D/Z dependency labels (not Universal Dependencies).

Find subjects:

[deprel="sb"]

Find accusative objects (direct objects):

[deprel="oa"]

Find dative objects (indirect objects):

[deprel="da"]

Find modifiers:

[deprel="mo"]

Find noun kernels (head nouns in NPs):

[deprel="nk"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VVFIN"] []{0,3} [tag="NN"]

Filter by Subreddit

Search within a specific subreddit:

Use the metadata filter to select a subreddit (e.g., de)
Run: [lemma="deutsch"]{.cql}

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"sein\"]",
    "corpus_id": "german_reddit",
    "limit": 50
  }'

Quick Reference

Goal	CQL Query
Exact word	[word="der"]
All forms (lemma)	[lemma="sein"]
Adjacent words	[lemma="ich"] [lemma="bin"]
Person names	[ent_type="PER"]
Organizations	[ent_type="ORG"]
Locations	[ent_type="LOC"]
Nouns	[tag="NN"]
Verbs (finite)	[tag="VVFIN"]
Adjectives	[tag="ADJA"] or [tag="ADJD"]
Subjects	[deprel="sb"]
Direct objects	[deprel="oa"]
Nominative case	[morph=".Case=Nom."]

References

Original Paper: Wiegand, Michael, et al. (2020) “GeRedE: A German Reddit Corpus for the Analysis of Online Language Use.” Proceedings of LREC 2020. ACL Anthology
FAU Repository: fau-klue/german-reddit-korpus
STTS Tagset: Stuttgart-Tübingen Tagset Documentation