Center for Information and Language Processing (CIS), LMU
The wants and needs of linguists:
Needs and interest also at other LMU institutes:
maybe also: Center for Information and Language Processing (CIS)
Goal: collaborative, modern, sustainable corpuslinguistic work.
First step: developing an app to explore potential.
https://www.wuerschinger.org/mcl
Username: lipp@lmu.de
Password: Sch3ll!ng1860
Backend: No Sketch Engine
React frontend

No Sketch Engine frontend

Reddit: Social media platform with topical communities (subreddits). (Baumgartner et al. 2020).
174M tokens from 17 subreddits (2008–2024)
| Subreddit | Tokens |
|---|---|
| r/de | ~33M |
| r/Finanzen | ~27M |
| r/Austria | ~19M |
| r/de_EDV | ~16M |
| r/FragReddit | ~14M |
| r/Switzerland | ~13M |
| r/wien | ~12M |
| r/Munich | ~11M |
| r/fcbayern | ~8M |
| Subreddit | Tokens |
|---|---|
| r/hamburg | ~6M |
| r/FinanzenAT | ~6M |
| r/graz | ~4M |
| r/BUENZLI | ~2M |
| r/ich_iel | ~2M |
| r/aeiou | ~1M |
| r/okoidawappler | ~1M |
| r/bavaria | ~1M |
All features of Sketch Engine:
Excluded: Word Sketches, Ngrams, and Thesaurus
Additional features:
NLP annotation with spaCy de_core_news_lg (Honnibal et al. 2020):
| Attribute | Query | Finds |
|---|---|---|
word |
[word="München"] | München (exact form) |
lemma |
[lemma="sein"] | ist, sind, war, waren, bin … |
tag |
[tag="VERB"] | all verbs |
deprel |
[deprel="nsubj"] | subjects: ich, es, er, München … |
morph |
[morph=".Case=Dat."] | dative: dem, der, mir, dir … |
ent_type |
[ent_type="PER"] | persons: Söder, Merkel, Scholz … |
27 document attributes:
subreddit, author, date, yearlang: ISO code (de, gsw, bar, en)lang_group: language family (german, swiss_german, bavarian)lang_conf: confidence (0–1)parent_id: parent comment → reply networkslink_id: thread ID → discussion structuresscore: upvotes → influence weightingcontroversiality, gilded, num_commentstitle, post_flair (Politics, Meme, Question)




[lemma="Mensch" & deprel="da"] [tag="VVFIN"] in the GeRedE corpus:









Language identification with fastText (Joulin et al. 2017):
| Language | Tokens | % |
|---|---|---|
| German (de) | ~130M | 74.58% |
| English (en) | ~41M | 23.28% |
| (unidentified) | ~3M | 1.98% |
| French (fr) | ~119K | 0.07% |
| Dutch (nl) | ~42K | 0.02% |
| Italian (it) | ~34K | 0.02% |
| Portuguese (pt) | ~29K | 0.02% |
Less than 0.01% for the following languages:
Polish (pl), Luxembourgish (lb), Slovenian (sl), Spanish (es), Slovak (sk), Hungarian (hu)
Problem: fastText labels dialect text as standard German (de):
Bavarian (r/aeiou, r/okoidawappler):
→ fastText: de ✗
Swiss German (r/BUENZLI):
→ fastText: de ✗
Using GlotLID (Kargaran et al. 2023) for fine-grained dialect identification:
| Variety | Tokens | % | Notes |
|---|---|---|---|
| Standard German | ~127M | 72.90% | r/de, r/Finanzen |
| Alemannic (Swiss) | ~2M | 1.10% | r/BUENZLI |
| Bavarian | ~433K | 0.25% | r/okoidawappler, r/aeiou |
| Low German | ~6K | <0.01% | r/hamburg |
Bavarian (r/aeiou, r/okoidawappler):
→ fastText: de ✗ | GlotLID: de_bavarian ✓
Swiss German (r/BUENZLI):
→ fastText: de ✗ | GlotLID: de_alemannic ✓
spaCy de_core_news_lg (Honnibal et al. 2020) — trained on Standard German, struggles with dialects:
Standard German → spaCy:
| Word | Lemma | |
|---|---|---|
| nicht | nicht | ✓ |
| was | was | ✓ |
| halt | halt | ✓ |
Bavarian → spaCy:
| Word | Lemma | Expected | |
|---|---|---|---|
| ned | ned | nicht | ✗ |
| wos | wos | was | ✗ |
| hoid | hoid | halt | ✗ |
Standard German → spaCy:
das weiß ich halt nicht
| Word | Tag | |
|---|---|---|
| das | PDS | ✓ |
| weiß | VVFIN | ✓ |
| ich | PPER | ✓ |
| halt | ADV | ✓ |
| nicht | PTKNEG | ✓ |
Bavarian → spaCy:
des woaß i hoid ned
| Word | Tag | Expected | |
|---|---|---|---|
| des | PDS | PDS | ✓ |
| woaß | NE | VVFIN | ✗ |
| i | NE | PPER | ✗ |
| hoid | NE | ADV | ✗ |
| ned | FM | PTKNEG | ✗ |
Problem: Dialect words tagged as NE (Named Entity) or FM (Foreign Material)
Standard German:
| Word | Head | DepRel | |
|---|---|---|---|
| nicht | möchte | ng (negation) | ✓ |
| ich | möchte | sb (subject) | ✓ |
| begegnen | möchte | oc (inf. compl.) | ✓ |
Bavarian:
| Word | POS | DepRel | Expected | |
|---|---|---|---|---|
| mecht | PROPN | ROOT | AUX | ✗ |
| i | X | punct | PRON/sb | ✗ |
| ned | X | oa (acc.) | ng | ✗ |
Problem: ned parsed as accusative object — tree structure completely broken.


Best models:
| Model | POS | LAS | UAS |
|---|---|---|---|
| UDPipe | 80.29 | 65.79 | 79.60 |
| mBERT | 78.74 | 54.96 | 66.38 |
| GBERT | 74.68 | 50.57 | 62.67 |
Our replication:
| Model | POS | LAS | UAS |
|---|---|---|---|
| UDPipe | 80.25 | 64.78 | 79.77 |
| mBERT | 79.75 | 54.09 | 65.78 |
| GBERT | 73.64 | 48.48 | 61.19 |
| spaCy de_lg | 39.94 | 11.73 | 24.60 |
Using MaChAmp (Goot et al. 2021) with Modern mBERT (Marone et al. 2025) and Modern GBERT (Wunderle et al. 2025):
| Model | POS | UAS | LAS |
|---|---|---|---|
| mBERT | 79.75 | 65.78 | 54.09 |
| Modern mBERT | 78.19 | 61.60 | 49.28 |
| GBERT | 73.64 | 61.19 | 48.48 |
| Modern GBERT | 67.66 | 54.70 | 40.29 |
→ Newer transformers seem to perform worse on Bavarian?
Is this a tokenization problem?
| Word | mBERT | ModernGBERT |
|---|---|---|
| wos | wos |
wo + ##s |
| gsagt | g + ##sagt |
g + ##sa + ##gt |
| kemma | ke + ##mma |
ke + ##mm + ##a |
→ German-specific models fragment Bavarian more aggressively
Adding 1k Bavarian tokens (ned, wos, hoid, oiwei, dahoam, gsagt, …)
+ Domain-Adaptive Pre-Training (Gururangan et al. 2020) on the Bavarian Reddit data:
| Model | POS | UAS | LAS |
|---|---|---|---|
| Modern mBERT | 78.19 | 61.60 | 49.28 |
| + DAPT | 80.46 | 65.82 | 54.71 |
| Δ | +2.27 | +4.22 | +5.43 |
| Modern GBERT | 67.66 | 54.70 | 40.29 |
| + DAPT | 71.16 | 57.06 | 44.01 |
| Δ | +3.50 | +2.36 | +3.72 |
→ DAPT improves dialect handling, but the overall results are still underwhelming.
Even with DAPT, models still struggle with Bavarian:
Lemmatization
| Input | Output | Expected | |
|---|---|---|---|
| koa | koa | kein | ✗ |
| nix | nix | nichts | ✗ |
| oba | oba | aber | ✗ |
| heid | heid | heute | ✗ |
POS Tagging
| Input | Output | Expected | |
|---|---|---|---|
| fei | ADV | PART | ✗ |
| geh | VERB | PART | ✗ |
| scho | ADJ | ADV | ✗ |
| oba | NOUN | CCONJ | ✗ |
Dependencies
| Input | Output | Expected | |
|---|---|---|---|
| i | nsubj | nsubj | ✓ |
| fei | advmod | discourse | ✗ |
| begenen | ROOT | xcomp | ✗ |
What is needed?
Linguistics and Computational Linguistics meet at the level of corpus data.
→ an application interface for collaboration between humans and machines


An interface between people and institutes at LMU.

Social Network Analysis
Subreddit Network