German Reddit Dialects (deRedDial)

German Reddit Dialects Corpus — Regional Variation Research (2010-2024)

NoteCurrent Version: deRedDial_mid

17 subreddits with 145M words and ~4M documents. Features dual language identification: GlotLID (dialect detection) + fastText (baseline comparison). Covers Bavarian/Austrian, Standard German, Swiss/Alemannic, and Northern German dialect zones.

Corpus Overview

deRedDial (German Reddit Dialects) is a large-scale corpus of German-language Reddit posts and comments specifically designed for dialect and regional variation research. It features dual language identification systems enabling comparative studies of:

  • Dialect variation: Bavarian, Alemannic, Swiss German, Low German
  • Regional zones: Austrian, Bavarian, Swiss, Northern German communities
  • LID comparison: GlotLID v3 (dialect-aware) vs fastText (baseline)
  • Register variation: From formal Q&A to casual memes
  • Temporal patterns: Full timeline 2010-2024

Key Statistics

Metric Value
Total Words ~145,000,000
Documents ~4,000,000
Comments ~3.7M (93%)
Submissions ~300K (7%)
Subreddits 17
Time Period 2010-2024
Language German
spaCy Model de_core_news_lg
Dialect Detection GlotLID v3 + fastText

Corpus Size Variants

Size Tokens Status
nano ~2.5M Test corpus
mini ~25M Demo corpus
mid 174M Available

Subreddits Included

Austrian Communities

Subreddit Description
Austria Main Austrian subreddit. Politics, culture
aeiou Austrian meme community. Dialect-heavy humor
wien Vienna community. Urban Austrian German
graz Graz/Styria. Styrian dialect influence

German Communities

Subreddit Description
de Main German subreddit. News, politics
ich_iel German memes (“ich im echten Leben”)
FragReddit German Q&A. Conversational register
de_IAmA German AMA. Interview format
de_EDV IT/tech community. Technical register
Finanzen Finance community. Formal register

Bavarian Communities

Subreddit Description
Munich Munich/Upper Bavaria. Urban Bavarian
bavaria Bavaria state. Regional Bavarian identity
fcbayern FC Bayern Munich fans. Sports register
1860Munich TSV 1860 Munich fans. Local sports community

Swiss/Alemannic & Northern German

Subreddit Description
Switzerland Swiss German community
Hamburg Northern German / Low German influence
Berlin Berlin German community

Dual Language Identification

The corpus features two independent language identification systems, enabling comparative dialect research.

GlotLID (Dialect Detection)

GlotLID v3 from LMU Munich provides fine-grained dialect detection:

lang_group Description
de_standard Standard High German
de_bavarian Bavarian/Austrian dialect
de_alemannic Alemannic/Swiss German
de_low Low German (Plattdeutsch)
de_related Related varieties
en English (code-switching)
low_conf Low confidence detection
other Other languages

fastText (Baseline)

fastText provides standard language detection for comparison:

lang_ft_group Description
de German
en English
other Other languages

Dialect Queries

Filter by GlotLID dialect:

[lemma="gehen"] within <doc lang_group="de_bavarian"/>

Compare GlotLID vs fastText classification:

[word="ned"] within <doc lang_group="de_bavarian" & lang_ft_group="de"/>

Find Alemannic content:

[word=".*"] within <doc lang_group="de_alemannic"/>

Available Attributes

Token Attributes

Attribute Description Example
word Exact word form [word="Griaß"]
lemma Base form (lemma) [lemma="gehen"]
tag POS tag (STTS) [tag="VVFIN"]
head_n Head position (dependency) Dependency parsing
head Head lemma Dependency parsing
deprel Dependency relation [deprel="sb"]
ent_type Named entity type [ent_type="PER"]
ent_iob NER IOB tag [ent_iob="B"]
morph Morphological features [morph=".Number=Plur."]

Document Attributes

Attribute Description Research Use
doc.subreddit Subreddit name Community filtering
doc.author Reddit username Individual variation
doc.author_flair User’s location flair Regional variation research
doc.post_flair Post category flair Topic filtering
doc.date Publication date (ISO) Temporal filtering
doc.year Publication year Diachronic analysis
doc.doc_type comment or submission Content type filtering
doc.lang_group GlotLID dialect grouping Dialect research
doc.lang_ft_group fastText language grouping LID comparison
doc.is_image Image post flag Meme research

Linguistic Annotations

The corpus includes full spaCy annotations with the German STTS tagset.

Dependency Queries

Find subjects:

[deprel="sb"]

Find objects:

[deprel="oa"]

Named Entity Queries

Find person names:

[ent_type="PER"]

Find locations:

[ent_type="LOC"]

Common STTS Tags

Pattern Meaning
N.* All nouns
V.* All verbs
VVFIN Finite full verbs
ADJ.* All adjectives
ADV Adverbs
ART Articles
APPR Prepositions

Research Applications

Dialect Research Questions

  • Lexical variation: Bavarian “ned” vs. standard “nicht”
  • Austrian markers: Discourse particles (Oida, eh, halt)
  • Regional greetings: Servus, Grüß Gott, Moin distribution
  • Code-switching: Dialect ↔︎ standard alternation

LID Comparison Studies

Compare dialect detection accuracy between GlotLID and fastText:

  • GlotLID catches dialect-specific content fastText misses
  • Evaluate false positive rates for dialect classification
  • Study what linguistic features trigger dialect detection

Regional Identity Studies

The author flair data enables:

  • Austrian regional identity: State-based variation
  • Urban vs. rural: Vienna vs. other states
  • Cross-border comparison: Austrian vs. German communities

Quick Reference

Goal CQL Query
Exact word [word="Oida"]
All forms (lemma) [lemma="gehen"]
Person names [ent_type="PER"]
Locations [ent_type="LOC"]
Subjects [deprel="sb"]
Finite verbs [tag="VVFIN"]
Bavarian dialect within <doc lang_group="de_bavarian"/>
Alemannic dialect within <doc lang_group="de_alemannic"/>
Austrian subreddit within <doc subreddit="Austria"/>
Image posts within <doc is_image="True"/>

Data Sources & References

  • Academic Torrents: Reddit archive dumps (June 2005 - December 2024)
  • GlotLID v3: cis-lmu/glotlid — LMU Munich dialect detection
  • fastText Language ID: fasttext.cc — Baseline LID
  • spaCy German Model: de_core_news_lg — Large German NLP model
  • STTS Tagset: Stuttgart-Tübingen Tagset
  • TIGER Dependency Scheme: German dependency annotation standard