German Reddit (deRed)

German Reddit Corpus — Dialect, Regional & Meme Research (2018-2025)

NoteCurrent Version: deRed_mini

14 subreddits with 51M words and 1.9M documents. Features GlotLID dialect detection (Bavarian, Alemannic, Swiss German), regional author flairs for location research, and dedicated meme communities (ich_iel, aeiou) for internet language studies.

Corpus Overview

deRed (German Reddit) is a corpus of German-language Reddit posts and comments designed for dialect research, regional variation studies, and meme/internet language analysis. It features automatic dialect detection using GlotLID v3 from LMU Munich, enabling studies of:

  • Dialect variation: Bavarian, Alemannic, Swiss German, Low German
  • Regional identity: Author flairs indicating Austrian states, German regions
  • Meme language: ich_iel and aeiou communities with image metadata
  • Register variation: From formal Q&A to creative internet discourse
  • Temporal patterns: Full timeline 2018-2025

Key Statistics

Metric Value
Total Words ~51,000,000
Documents 1,914,012
Comments 1,779,884 (93%)
Submissions 134,128 (7%)
Subreddits 14
Time Period 2018-2025
Language German
spaCy Model de_core_news_lg
Dialect Detection GlotLID v3

Corpus Size Roadmap

Size Subreddits Tokens Status
nano 5 ~2M Test corpus
mini 14 61M Available
medium 14+ ~400M Planned
large 14+ ~2B Planned

Subreddits Included

Austrian Communities

Subreddit Tokens Description
Austria 7.7M Main Austrian subreddit. Politics, culture
aeiou 2.2M Austrian meme community. Dialect-heavy humor
wien 40K docs Vienna community. Urban Austrian German
graz 29K docs Graz/Styria. Styrian dialect influence

German Communities

Subreddit Tokens Description
de 27K docs Main German subreddit. News, politics
ich_iel 33K docs German memes (“ich im echten Leben”)
FragReddit 30K docs German Q&A. Conversational register
de_IAmA 34K docs German AMA. Interview format
de_EDV 25K docs IT/tech community. Technical register
Finanzen 17K docs Finance community. Formal register

Bavarian Communities

Subreddit Tokens Description
Munich 10K docs Munich/Upper Bavaria. Urban Bavarian
bavaria 24K docs Bavaria state. Regional Bavarian identity
fcbayern 22K docs FC Bayern Munich fans. Sports register
1860Munich 58K TSV 1860 Munich fans. Local sports community

Dialect Detection (GlotLID)

The corpus uses GlotLID v3 from LMU Munich for dialect-aware language detection, enabling comparative studies of German varieties.

Dialect Distribution

Language Group Tokens Percentage Description
de_standard 7,754,498 77.5% Standard High German
en 1,546,335 15.5% English (code-switching)
de_bavarian 345,018 3.4% Bavarian/Austrian dialect
low_conf 215,644 2.2% Low confidence detection
other 109,931 1.1% Other languages
de_alemannic 25,021 0.25% Alemannic/Swiss German
de_low 2,001 0.02% Low German (Plattdeutsch)
de_related 1,552 0.01% Related varieties

Dialect Queries

Filter by detected dialect:

[lemma="gehen"] within <doc lang_group="de_bavarian"/>

Compare standard vs. dialectal negation:

[word="nicht"] within <doc lang_group="de_standard"/>

vs.

[word="ned"|word="net"] within <doc lang_group="de_bavarian"/>

Regional Author Flairs

Author flairs enable geographic filtering for regional variation research. Users self-select their location, providing ground-truth regional data.

Austrian States (Top Flairs)

Author Flair Tokens Region
Wien 645,023 Vienna
Steiermark 275,871 Styria
Oberösterreich 247,676 Upper Austria
Niederösterreich 181,232 Lower Austria
Burgenland 112,912 Burgenland
Salzburg 80,795 Salzburg
Tirol 62,009 Tyrol
Kärnten 53,564 Carinthia
Vorarlberg 28,901 Vorarlberg

Regional Queries

Find posts from Vienna users:

[lemma="sein"] within <doc author_flair="Wien"/>

Compare Austrian regions:

[word="Oida"] within <doc author_flair="Wien"/>

vs.

[word="Oida"] within <doc author_flair="Steiermark"/>

Available Attributes

Token Attributes

Attribute Description Example
word Exact word form [word="Grüß"]
lemma Base form (lemma) [lemma="gehen"]
tag POS tag (STTS) [tag="VVFIN"]
head_n Head position (dependency) Dependency parsing
head Head lemma Dependency parsing
deprel Dependency relation [deprel="sb"]
ent_type Named entity type [ent_type="PER"]
ent_iob NER IOB tag [ent_iob="B"]
morph Morphological features [morph=".Number=Plur."]

Document Attributes

Attribute Description Research Use
doc.subreddit Subreddit name Community filtering
doc.author Reddit username Individual variation
doc.author_flair User’s location flair Regional variation research
doc.post_flair Post category flair Topic filtering
doc.date Publication date (ISO) Temporal filtering
doc.year Publication year Diachronic analysis
doc.doc_type comment or submission Content type filtering
doc.lang_group Dialect grouping Dialect research
doc.lang_conf Detection confidence Quality filtering
doc.is_image Image post flag Meme research
doc.title Submission title Topic analysis
doc.score Vote score Engagement filtering

Linguistic Annotations

The corpus includes full spaCy annotations with the German STTS tagset.

Annotation Statistics

Feature Count Description
Subject deps 4,573,404 [deprel="sb"] subjects
Person entities 737,575 [ent_type="PER"] names
Location entities ~200K [ent_type="LOC"] places
Org entities ~150K [ent_type="ORG"] organizations

Dependency Queries

Find subjects:

[deprel="sb"]

Find objects:

[deprel="oa"]

Named Entity Queries

Find person names:

[ent_type="PER"]

Find locations:

[ent_type="LOC"]

Common STTS Tags

Pattern Meaning
N.* All nouns
V.* All verbs
VVFIN Finite full verbs
ADJ.* All adjectives
ADV Adverbs
ART Articles
APPR Prepositions

Research Applications

Dialect Research Questions

  • Lexical variation: Bavarian “ned” vs. standard “nicht”
  • Austrian markers: Discourse particles (Oida, eh, halt)
  • Regional greetings: Servus, Grüß Gott, Moin distribution
  • Code-switching: Dialect ↔︎ standard alternation

Bavarian Relativizer wo

A distinctive Bavarian feature is the use of wo as a relative clause marker, often combined with der/die/das:

Example Translation
Der oane der wo beim Starmania gsungen hat The one who sang at Starmania
die wo denken das es echt is those who think it’s real
Der Sturm der wo des kloane Ding ausradiert The storm that wiped out the little thing
[word="wo"] within

Regional Identity Studies

The author flair data enables:

  • Austrian regional identity: State-based variation
  • Urban vs. rural: Vienna vs. other states
  • Cross-border comparison: Austrian vs. German communities

Register Variation

Comparing formal (r/Finanzen) vs. informal (r/ich_iel) communities reveals systematic register differences:

Feature r/Finanzen r/ich_iel Ratio
allerdings (however) 260 pmw 76 pmw 3.4× more formal
jedoch (however) 138 pmw Formal-only
ne? (tag question) 3.5 pmw 13.3 pmw 3.8× more informal

Formal connectors (allerdings, jedoch) cluster in financial discussion; discourse particles (ne?) mark informal register.

[word="allerdings"] within

Individual Variation (Idiolect)

Author-level analysis reveals individual style preferences. Example: intensifier choice with gut (“good”):

Author sehr gut ziemlich gut n
YMK1234 22% 3% 93
LolaRuns 6% 12% 223
[tag="ADV" & head="gut"] within

Meme Language Analysis

The ich_iel (German) and aeiou (Austrian) meme communities are ideal for internet language research:

  • Creative orthography: Deliberate “germanization” of English words (e.g., “Hochwähli” for upvote)
  • Youth language: Slang evolution, neologisms, in-group markers
  • Multimodal analysis: is_image attribute identifies image posts for meme studies
  • Humor linguistics: Wordplay, puns, cultural references
  • Code-switching: German-English alternation in meme contexts

Community-Specific Markers

Community Phrase Frequency Notes
ich_iel Hurensohn 7,121 In-group term
ich_iel Sprich Deutsch 365 Anti-English meme
ich_iel meine Kerle 165 Wednesday frog meme
aeiou ned 13,546 Bavarian negation
aeiou Oida 5,046 Austrian interjection
aeiou Piefke 714 Term for Germans

Meme Research Queries

Find image posts in meme communities:

[word=".*"] within <doc subreddit="ich_iel" & is_image="True"/>

Study “ich_iel” naming convention variations:

[word=".*iel"]

Find germanized English:

[word="Hochwähli"|word="Runterwähli"|word="Wiederpfosten"]

Quick Reference

Goal CQL Query
Exact word [word="Oida"]
All forms (lemma) [lemma="gehen"]
Person names [ent_type="PER"]
Locations [ent_type="LOC"]
Subjects [deprel="sb"]
Finite verbs [tag="VVFIN"]
Bavarian dialect within <doc lang_group="de_bavarian"/>
Vienna users within <doc author_flair="Wien"/>
Austrian subreddit within <doc subreddit="Austria"/>
Image posts within <doc is_image="True"/>

Data Sources & References

  • Academic Torrents: Reddit archive dumps (June 2005 - December 2024)
  • GlotLID v3: cis-lmu/glotlid — LMU Munich dialect detection
  • spaCy German Model: de_core_news_lg — Large German NLP model
  • STTS Tagset: Stuttgart-Tübingen Tagset
  • TIGER Dependency Scheme: German dependency annotation standard