German Reddit Corpus
GeRedE — German Reddit (2010-2018)
FAU-filtered corpus: 11 subreddits with 198M words, 6.2M documents. Full dependency parsing and NER are enabled.
Corpus Overview
The GeRedE (German Reddit Corpus) contains German-language posts and comments from Reddit, providing access to contemporary informal written German discourse from online communities.
This corpus is based on the GeRedE project by FAU Erlangen-Nürnberg, which identified German-language threads from Reddit archives. Comments are filtered using FAU’s pre-computed thread language scores (threshold: score >= 0.1) to ensure German-only content.
The corpus is processed with spaCy de_core_news_lg (large German model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.
Key Statistics
| Metric | Value |
|---|---|
| Total Words | 198 million |
| Documents | ~6,200,000 comments |
| Subreddits | 11 |
| Time Period | 2010-2018 |
| Language | German |
| spaCy Model | de_core_news_sm |
Subreddits Included
| Subreddit | FAU Match Rate | Description |
|---|---|---|
| Austria | 91.5% | Austrian community |
| rocketbeans | 99.4% | German gaming/streaming |
| de_IAmA | 99.3% | German Ask Me Anything |
| Finanzen | 96.8% | German finance discussions |
| ich_iel | 92.5% | German memes (ich_irl) |
| wien | 52.6% | Vienna community |
| FragReddit | 99.7% | German Q&A community |
| einfach_posten | 99.3% | Casual German posting |
| de_EDV | 99.5% | German IT/tech discussions |
| Fahrrad | 99.4% | German cycling community |
| VeganDE | 99.1% | German vegan community |
Data Source
The corpus is derived from:
- Arctic Shift API: Historical Reddit comments (2010-2018) via arctic-shift.photon-reddit.com
- FAU GeRedE metadata: Thread language scores from the fau-klue/german-reddit-korpus repository (389K German threads identified)
Available Attributes
The corpus includes the following searchable attributes:
Token Attributes
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="der"] |
lemma |
Base form (lemma) | [lemma="sein"] |
tag |
POS tag (STTS - Stuttgart-Tübingen Tagset) | [tag="NN.*"] |
head_n |
Head position (1-based, 0 for root) | Dependency parsing |
head |
Head lemma | Dependency parsing |
deprel |
Dependency relation | [deprel="sb"] |
ent_type |
Named entity type (ORG, PERSON, LOC, etc.) | [ent_type="PER"] |
ent_iob |
NER IOB tag (B=begin, I=inside, O=outside) | [ent_iob="B"] |
morph |
Morphological features (UD FEATS format) | [morph=".Case=Nom."] |
Note: All attributes are fully populated. The corpus was processed with spaCy’s de_core_news_lg model with full dependency parsing and named entity recognition.
Document Attributes
| Attribute | Description | Example |
|---|---|---|
doc.id |
Document ID | Filter by specific post |
doc.subreddit |
Subreddit name | Filter: de, ich_iel, etc. |
doc.author |
Reddit username | Filter by author |
doc.date |
Publication date | Filter by date range |
doc.year |
Publication year | Filter by year |
CQL Query Examples
Basic Word Search
Find all occurrences of [lemma="sein"] in the corpus.
Find all occurrences of a word:
Find exact word form:
Named Entity Recognition
German spaCy uses these NER labels: PER, ORG, LOC, MISC
Find all person names:
Find all organizations:
Find all locations:
Find miscellaneous entities:
Part-of-Speech Queries
Find all nouns (STTS tagset):
See the STTS tagset documentation for the full list.
Find all verbs:
Morphological Features
Find nominative case:
Find plural nouns:
Dependency Relations
German spaCy uses TIGER/TüBa-D/Z dependency labels (not Universal Dependencies).
Find subjects:
Find accusative objects (direct objects):
Find dative objects (indirect objects):
Find modifiers:
Find noun kernels (head nouns in NPs):
Combined Queries
Find verbs followed by nouns within 3 words:
Filter by Subreddit
Search within a specific subreddit:
- Use the metadata filter to select a subreddit (e.g.,
de) - Run:
[lemma="deutsch"]{.cql}
Programmatic Access (API)
Query the corpus programmatically via the REST API:
Example Request (curl)
curl -X POST http://localhost:8000/cql/run \
-H "Content-Type: application/json" \
-d '{
"cql": "[lemma=\"sein\"]",
"corpus_id": "german_reddit",
"limit": 50
}'Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="der"] |
| All forms (lemma) | [lemma="sein"] |
| Adjacent words | [lemma="ich"] [lemma="bin"] |
| Person names | [ent_type="PER"] |
| Organizations | [ent_type="ORG"] |
| Locations | [ent_type="LOC"] |
| Nouns | [tag="NN"] |
| Verbs (finite) | [tag="VVFIN"] |
| Adjectives | [tag="ADJA"] or [tag="ADJD"] |
| Subjects | [deprel="sb"] |
| Direct objects | [deprel="oa"] |
| Nominative case | [morph=".Case=Nom."] |
References
- Original Paper: Wiegand, Michael, et al. (2020) “GeRedE: A German Reddit Corpus for the Analysis of Online Language Use.” Proceedings of LREC 2020. ACL Anthology
- FAU Repository: fau-klue/german-reddit-korpus
- STTS Tagset: Stuttgart-Tübingen Tagset Documentation