German Reddit Corpus

GeRedE — German Reddit (2010-2018)

NoteCurrent Status

FAU-filtered corpus: 11 subreddits with 198M words, 6.2M documents. Full dependency parsing and NER are enabled.

Corpus Overview

The GeRedE (German Reddit Corpus) contains German-language posts and comments from Reddit, providing access to contemporary informal written German discourse from online communities.

This corpus is based on the GeRedE project by FAU Erlangen-Nürnberg, which identified German-language threads from Reddit archives. Comments are filtered using FAU’s pre-computed thread language scores (threshold: score >= 0.1) to ensure German-only content.

The corpus is processed with spaCy de_core_news_lg (large German model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric Value
Total Words 198 million
Documents ~6,200,000 comments
Subreddits 11
Time Period 2010-2018
Language German
spaCy Model de_core_news_sm

Subreddits Included

Subreddit FAU Match Rate Description
Austria 91.5% Austrian community
rocketbeans 99.4% German gaming/streaming
de_IAmA 99.3% German Ask Me Anything
Finanzen 96.8% German finance discussions
ich_iel 92.5% German memes (ich_irl)
wien 52.6% Vienna community
FragReddit 99.7% German Q&A community
einfach_posten 99.3% Casual German posting
de_EDV 99.5% German IT/tech discussions
Fahrrad 99.4% German cycling community
VeganDE 99.1% German vegan community

Data Source

The corpus is derived from:


Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute Description Example
word Exact word form [word="der"]
lemma Base form (lemma) [lemma="sein"]
tag POS tag (STTS - Stuttgart-Tübingen Tagset) [tag="NN.*"]
head_n Head position (1-based, 0 for root) Dependency parsing
head Head lemma Dependency parsing
deprel Dependency relation [deprel="sb"]
ent_type Named entity type (ORG, PERSON, LOC, etc.) [ent_type="PER"]
ent_iob NER IOB tag (B=begin, I=inside, O=outside) [ent_iob="B"]
morph Morphological features (UD FEATS format) [morph=".Case=Nom."]

Note: All attributes are fully populated. The corpus was processed with spaCy’s de_core_news_lg model with full dependency parsing and named entity recognition.

Document Attributes

Attribute Description Example
doc.id Document ID Filter by specific post
doc.subreddit Subreddit name Filter: de, ich_iel, etc.
doc.author Reddit username Filter by author
doc.date Publication date Filter by date range
doc.year Publication year Filter by year

CQL Query Examples

Named Entity Recognition

German spaCy uses these NER labels: PER, ORG, LOC, MISC

Find all person names:

[ent_type="PER"]

Find all organizations:

[ent_type="ORG"]

Find all locations:

[ent_type="LOC"]

Find miscellaneous entities:

[ent_type="MISC"]

Part-of-Speech Queries

Find all nouns (STTS tagset):

See the STTS tagset documentation for the full list.

[tag="NN"]

Find all verbs:

[tag="VVFIN"]

Morphological Features

Find nominative case:

[morph=".Case=Nom."]

Find plural nouns:

[tag="NN" & morph=".Number=Plur."]

Dependency Relations

German spaCy uses TIGER/TüBa-D/Z dependency labels (not Universal Dependencies).

Find subjects:

[deprel="sb"]

Find accusative objects (direct objects):

[deprel="oa"]

Find dative objects (indirect objects):

[deprel="da"]

Find modifiers:

[deprel="mo"]

Find noun kernels (head nouns in NPs):

[deprel="nk"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VVFIN"] []{0,3} [tag="NN"]

Filter by Subreddit

Search within a specific subreddit:

  1. Use the metadata filter to select a subreddit (e.g., de)
  2. Run: [lemma="deutsch"]{.cql}

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"sein\"]",
    "corpus_id": "german_reddit",
    "limit": 50
  }'

Quick Reference

Goal CQL Query
Exact word [word="der"]
All forms (lemma) [lemma="sein"]
Adjacent words [lemma="ich"] [lemma="bin"]
Person names [ent_type="PER"]
Organizations [ent_type="ORG"]
Locations [ent_type="LOC"]
Nouns [tag="NN"]
Verbs (finite) [tag="VVFIN"]
Adjectives [tag="ADJA"] or [tag="ADJD"]
Subjects [deprel="sb"]
Direct objects [deprel="oa"]
Nominative case [morph=".Case=Nom."]

References