German Reddit (deRed)
German Reddit Corpus — Dialect, Regional & Meme Research (2018-2025)
14 subreddits with 51M words and 1.9M documents. Features GlotLID dialect detection (Bavarian, Alemannic, Swiss German), regional author flairs for location research, and dedicated meme communities (ich_iel, aeiou) for internet language studies.
Corpus Overview
deRed (German Reddit) is a corpus of German-language Reddit posts and comments designed for dialect research, regional variation studies, and meme/internet language analysis. It features automatic dialect detection using GlotLID v3 from LMU Munich, enabling studies of:
- Dialect variation: Bavarian, Alemannic, Swiss German, Low German
- Regional identity: Author flairs indicating Austrian states, German regions
- Meme language: ich_iel and aeiou communities with image metadata
- Register variation: From formal Q&A to creative internet discourse
- Temporal patterns: Full timeline 2018-2025
Key Statistics
| Metric | Value |
|---|---|
| Total Words | ~51,000,000 |
| Documents | 1,914,012 |
| Comments | 1,779,884 (93%) |
| Submissions | 134,128 (7%) |
| Subreddits | 14 |
| Time Period | 2018-2025 |
| Language | German |
| spaCy Model | de_core_news_lg |
| Dialect Detection | GlotLID v3 |
Corpus Size Roadmap
| Size | Subreddits | Tokens | Status |
|---|---|---|---|
| nano | 5 | ~2M | Test corpus |
| mini | 14 | 61M | Available |
| medium | 14+ | ~400M | Planned |
| large | 14+ | ~2B | Planned |
Subreddits Included
Austrian Communities
| Subreddit | Tokens | Description |
|---|---|---|
| Austria | 7.7M | Main Austrian subreddit. Politics, culture |
| aeiou | 2.2M | Austrian meme community. Dialect-heavy humor |
| wien | 40K docs | Vienna community. Urban Austrian German |
| graz | 29K docs | Graz/Styria. Styrian dialect influence |
German Communities
| Subreddit | Tokens | Description |
|---|---|---|
| de | 27K docs | Main German subreddit. News, politics |
| ich_iel | 33K docs | German memes (“ich im echten Leben”) |
| FragReddit | 30K docs | German Q&A. Conversational register |
| de_IAmA | 34K docs | German AMA. Interview format |
| de_EDV | 25K docs | IT/tech community. Technical register |
| Finanzen | 17K docs | Finance community. Formal register |
Bavarian Communities
| Subreddit | Tokens | Description |
|---|---|---|
| Munich | 10K docs | Munich/Upper Bavaria. Urban Bavarian |
| bavaria | 24K docs | Bavaria state. Regional Bavarian identity |
| fcbayern | 22K docs | FC Bayern Munich fans. Sports register |
| 1860Munich | 58K | TSV 1860 Munich fans. Local sports community |
Dialect Detection (GlotLID)
The corpus uses GlotLID v3 from LMU Munich for dialect-aware language detection, enabling comparative studies of German varieties.
Dialect Distribution
| Language Group | Tokens | Percentage | Description |
|---|---|---|---|
de_standard |
7,754,498 | 77.5% | Standard High German |
en |
1,546,335 | 15.5% | English (code-switching) |
de_bavarian |
345,018 | 3.4% | Bavarian/Austrian dialect |
low_conf |
215,644 | 2.2% | Low confidence detection |
other |
109,931 | 1.1% | Other languages |
de_alemannic |
25,021 | 0.25% | Alemannic/Swiss German |
de_low |
2,001 | 0.02% | Low German (Plattdeutsch) |
de_related |
1,552 | 0.01% | Related varieties |
Dialect Queries
Filter by detected dialect:
Compare standard vs. dialectal negation:
vs.
Available Attributes
Token Attributes
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="Grüß"] |
lemma |
Base form (lemma) | [lemma="gehen"] |
tag |
POS tag (STTS) | [tag="VVFIN"] |
head_n |
Head position (dependency) | Dependency parsing |
head |
Head lemma | Dependency parsing |
deprel |
Dependency relation | [deprel="sb"] |
ent_type |
Named entity type | [ent_type="PER"] |
ent_iob |
NER IOB tag | [ent_iob="B"] |
morph |
Morphological features | [morph=".Number=Plur."] |
Document Attributes
| Attribute | Description | Research Use |
|---|---|---|
doc.subreddit |
Subreddit name | Community filtering |
doc.author |
Reddit username | Individual variation |
doc.author_flair |
User’s location flair | Regional variation research |
doc.post_flair |
Post category flair | Topic filtering |
doc.date |
Publication date (ISO) | Temporal filtering |
doc.year |
Publication year | Diachronic analysis |
doc.doc_type |
comment or submission | Content type filtering |
doc.lang_group |
Dialect grouping | Dialect research |
doc.lang_conf |
Detection confidence | Quality filtering |
doc.is_image |
Image post flag | Meme research |
doc.title |
Submission title | Topic analysis |
doc.score |
Vote score | Engagement filtering |
Linguistic Annotations
The corpus includes full spaCy annotations with the German STTS tagset.
Annotation Statistics
| Feature | Count | Description |
|---|---|---|
| Subject deps | 4,573,404 | [deprel="sb"] subjects |
| Person entities | 737,575 | [ent_type="PER"] names |
| Location entities | ~200K | [ent_type="LOC"] places |
| Org entities | ~150K | [ent_type="ORG"] organizations |
Dependency Queries
Find subjects:
Find objects:
Named Entity Queries
Find person names:
Find locations:
Research Applications
Dialect Research Questions
- Lexical variation: Bavarian “ned” vs. standard “nicht”
- Austrian markers: Discourse particles (Oida, eh, halt)
- Regional greetings: Servus, Grüß Gott, Moin distribution
- Code-switching: Dialect ↔︎ standard alternation
Bavarian Relativizer wo
A distinctive Bavarian feature is the use of wo as a relative clause marker, often combined with der/die/das:
| Example | Translation |
|---|---|
| Der oane der wo beim Starmania gsungen hat | The one who sang at Starmania |
| die wo denken das es echt is | those who think it’s real |
| Der Sturm der wo des kloane Ding ausradiert | The storm that wiped out the little thing |
Regional Identity Studies
The author flair data enables:
- Austrian regional identity: State-based variation
- Urban vs. rural: Vienna vs. other states
- Cross-border comparison: Austrian vs. German communities
Register Variation
Comparing formal (r/Finanzen) vs. informal (r/ich_iel) communities reveals systematic register differences:
| Feature | r/Finanzen | r/ich_iel | Ratio |
|---|---|---|---|
| allerdings (however) | 260 pmw | 76 pmw | 3.4× more formal |
| jedoch (however) | 138 pmw | — | Formal-only |
| ne? (tag question) | 3.5 pmw | 13.3 pmw | 3.8× more informal |
Formal connectors (allerdings, jedoch) cluster in financial discussion; discourse particles (ne?) mark informal register.
Individual Variation (Idiolect)
Author-level analysis reveals individual style preferences. Example: intensifier choice with gut (“good”):
| Author | sehr gut | ziemlich gut | n |
|---|---|---|---|
| YMK1234 | 22% | 3% | 93 |
| LolaRuns | 6% | 12% | 223 |
Meme Language Analysis
The ich_iel (German) and aeiou (Austrian) meme communities are ideal for internet language research:
- Creative orthography: Deliberate “germanization” of English words (e.g., “Hochwähli” for upvote)
- Youth language: Slang evolution, neologisms, in-group markers
- Multimodal analysis:
is_imageattribute identifies image posts for meme studies - Humor linguistics: Wordplay, puns, cultural references
- Code-switching: German-English alternation in meme contexts
Community-Specific Markers
| Community | Phrase | Frequency | Notes |
|---|---|---|---|
| ich_iel | Hurensohn | 7,121 | In-group term |
| ich_iel | Sprich Deutsch | 365 | Anti-English meme |
| ich_iel | meine Kerle | 165 | Wednesday frog meme |
| aeiou | ned | 13,546 | Bavarian negation |
| aeiou | Oida | 5,046 | Austrian interjection |
| aeiou | Piefke | 714 | Term for Germans |
Meme Research Queries
Find image posts in meme communities:
Study “ich_iel” naming convention variations:
Find germanized English:
Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="Oida"] |
| All forms (lemma) | [lemma="gehen"] |
| Person names | [ent_type="PER"] |
| Locations | [ent_type="LOC"] |
| Subjects | [deprel="sb"] |
| Finite verbs | [tag="VVFIN"] |
| Bavarian dialect | within <doc lang_group="de_bavarian"/> |
| Vienna users | within <doc author_flair="Wien"/> |
| Austrian subreddit | within <doc subreddit="Austria"/> |
| Image posts | within <doc is_image="True"/> |
Data Sources & References
- Academic Torrents: Reddit archive dumps (June 2005 - December 2024)
- GlotLID v3: cis-lmu/glotlid — LMU Munich dialect detection
- spaCy German Model: de_core_news_lg — Large German NLP model
- STTS Tagset: Stuttgart-Tübingen Tagset
- TIGER Dependency Scheme: German dependency annotation standard