German Reddit Dialects (deRedDial)
German Reddit Dialects Corpus — Regional Variation Research (2010-2024)
17 subreddits with 145M words and ~4M documents. Features dual language identification: GlotLID (dialect detection) + fastText (baseline comparison). Covers Bavarian/Austrian, Standard German, Swiss/Alemannic, and Northern German dialect zones.
Corpus Overview
deRedDial (German Reddit Dialects) is a large-scale corpus of German-language Reddit posts and comments specifically designed for dialect and regional variation research. It features dual language identification systems enabling comparative studies of:
- Dialect variation: Bavarian, Alemannic, Swiss German, Low German
- Regional zones: Austrian, Bavarian, Swiss, Northern German communities
- LID comparison: GlotLID v3 (dialect-aware) vs fastText (baseline)
- Register variation: From formal Q&A to casual memes
- Temporal patterns: Full timeline 2010-2024
Key Statistics
| Metric | Value |
|---|---|
| Total Words | ~145,000,000 |
| Documents | ~4,000,000 |
| Comments | ~3.7M (93%) |
| Submissions | ~300K (7%) |
| Subreddits | 17 |
| Time Period | 2010-2024 |
| Language | German |
| spaCy Model | de_core_news_lg |
| Dialect Detection | GlotLID v3 + fastText |
Corpus Size Variants
| Size | Tokens | Status |
|---|---|---|
| nano | ~2.5M | Test corpus |
| mini | ~25M | Demo corpus |
| mid | 174M | Available |
Subreddits Included
Austrian Communities
| Subreddit | Description |
|---|---|
| Austria | Main Austrian subreddit. Politics, culture |
| aeiou | Austrian meme community. Dialect-heavy humor |
| wien | Vienna community. Urban Austrian German |
| graz | Graz/Styria. Styrian dialect influence |
German Communities
| Subreddit | Description |
|---|---|
| de | Main German subreddit. News, politics |
| ich_iel | German memes (“ich im echten Leben”) |
| FragReddit | German Q&A. Conversational register |
| de_IAmA | German AMA. Interview format |
| de_EDV | IT/tech community. Technical register |
| Finanzen | Finance community. Formal register |
Bavarian Communities
| Subreddit | Description |
|---|---|
| Munich | Munich/Upper Bavaria. Urban Bavarian |
| bavaria | Bavaria state. Regional Bavarian identity |
| fcbayern | FC Bayern Munich fans. Sports register |
| 1860Munich | TSV 1860 Munich fans. Local sports community |
Swiss/Alemannic & Northern German
| Subreddit | Description |
|---|---|
| Switzerland | Swiss German community |
| Hamburg | Northern German / Low German influence |
| Berlin | Berlin German community |
Dual Language Identification
The corpus features two independent language identification systems, enabling comparative dialect research.
GlotLID (Dialect Detection)
GlotLID v3 from LMU Munich provides fine-grained dialect detection:
lang_group |
Description |
|---|---|
de_standard |
Standard High German |
de_bavarian |
Bavarian/Austrian dialect |
de_alemannic |
Alemannic/Swiss German |
de_low |
Low German (Plattdeutsch) |
de_related |
Related varieties |
en |
English (code-switching) |
low_conf |
Low confidence detection |
other |
Other languages |
fastText (Baseline)
fastText provides standard language detection for comparison:
lang_ft_group |
Description |
|---|---|
de |
German |
en |
English |
other |
Other languages |
Dialect Queries
Filter by GlotLID dialect:
Compare GlotLID vs fastText classification:
Find Alemannic content:
Available Attributes
Token Attributes
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="Griaß"] |
lemma |
Base form (lemma) | [lemma="gehen"] |
tag |
POS tag (STTS) | [tag="VVFIN"] |
head_n |
Head position (dependency) | Dependency parsing |
head |
Head lemma | Dependency parsing |
deprel |
Dependency relation | [deprel="sb"] |
ent_type |
Named entity type | [ent_type="PER"] |
ent_iob |
NER IOB tag | [ent_iob="B"] |
morph |
Morphological features | [morph=".Number=Plur."] |
Document Attributes
| Attribute | Description | Research Use |
|---|---|---|
doc.subreddit |
Subreddit name | Community filtering |
doc.author |
Reddit username | Individual variation |
doc.author_flair |
User’s location flair | Regional variation research |
doc.post_flair |
Post category flair | Topic filtering |
doc.date |
Publication date (ISO) | Temporal filtering |
doc.year |
Publication year | Diachronic analysis |
doc.doc_type |
comment or submission | Content type filtering |
doc.lang_group |
GlotLID dialect grouping | Dialect research |
doc.lang_ft_group |
fastText language grouping | LID comparison |
doc.is_image |
Image post flag | Meme research |
Linguistic Annotations
The corpus includes full spaCy annotations with the German STTS tagset.
Dependency Queries
Find subjects:
Find objects:
Named Entity Queries
Find person names:
Find locations:
Research Applications
Dialect Research Questions
- Lexical variation: Bavarian “ned” vs. standard “nicht”
- Austrian markers: Discourse particles (Oida, eh, halt)
- Regional greetings: Servus, Grüß Gott, Moin distribution
- Code-switching: Dialect ↔︎ standard alternation
LID Comparison Studies
Compare dialect detection accuracy between GlotLID and fastText:
- GlotLID catches dialect-specific content fastText misses
- Evaluate false positive rates for dialect classification
- Study what linguistic features trigger dialect detection
Regional Identity Studies
The author flair data enables:
- Austrian regional identity: State-based variation
- Urban vs. rural: Vienna vs. other states
- Cross-border comparison: Austrian vs. German communities
Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="Oida"] |
| All forms (lemma) | [lemma="gehen"] |
| Person names | [ent_type="PER"] |
| Locations | [ent_type="LOC"] |
| Subjects | [deprel="sb"] |
| Finite verbs | [tag="VVFIN"] |
| Bavarian dialect | within <doc lang_group="de_bavarian"/> |
| Alemannic dialect | within <doc lang_group="de_alemannic"/> |
| Austrian subreddit | within <doc subreddit="Austria"/> |
| Image posts | within <doc is_image="True"/> |
Data Sources & References
- Academic Torrents: Reddit archive dumps (June 2005 - December 2024)
- GlotLID v3: cis-lmu/glotlid — LMU Munich dialect detection
- fastText Language ID: fasttext.cc — Baseline LID
- spaCy German Model: de_core_news_lg — Large German NLP model
- STTS Tagset: Stuttgart-Tübingen Tagset
- TIGER Dependency Scheme: German dependency annotation standard