Munich Corpus Lab

Center for Information and Language Processing (CIS), LMU

Quirin Würschinger, LMU Munich | February 3, 2026
q.wuerschinger@lmu.de

Project Context

Motivation and Needs

The wants and needs of linguists:

  • Expensive licenses for corpus platforms
  • Using existing corpora outside of platforms/tools
  • Creating corpora: e.g., modern data sources like Reddit, YouTube, and WhatsApp
  • Analyzing corpora with current methods from NLP and LLMs (e.g., dependency parsing, topic modelling, sentiment analysis, embeddings, classification)
  • Research data management: sharing, publishing, and archiving corpus linguistic data

Munich Corpus Lab

Needs and interest also at other LMU institutes:

  • Linguistics: English, German, Slavic, Romance Studies
  • Language Teaching: corpus-based language pedagogy, learner corpora
  • Institute for Phonetics and Speech Processing (IPS): Phonetics and Speech Processing

maybe also: Center for Information and Language Processing (CIS)

Goal: collaborative, modern, sustainable corpuslinguistic work.

  • Providing and integrating corpora for research and teaching
  • Modern data sources: Reddit, YouTube, web corpora
  • Integration of AI and NLP methods: dependency parsing, topic modelling, semantic analysis with embeddings, LLM classifications
  • Application Interface for linguistics across philologies and methodological disciplines (e.g. CIS & IPS)

First step: developing an app to explore potential.

The Munich Corpus Lab App

https://www.wuerschinger.org/mcl
Username: lipp@lmu.de
Password: Sch3ll!ng1860

Backend: No Sketch Engine

React frontend

No Sketch Engine frontend

Current corpora on the Munich Corpus Lab

  • COHA (Davies 2012): Corpus of Historical American English, 1820–2019, 475M tokens. Genre-balanced: newspapers, magazines, fiction, academic texts.
  • COCA (Davies 2010): Corpus of Contemporary American English, 1990–present, 560M tokens. Genre-balanced: newspapers, magazines, fiction, academic texts, TV/film.
  • Stream (Benker): English YouTube transcripts, three channels (Entertainment, Commentary), ~15M tokens.
  • ICE-Bahamas (Hackert 2010): International Corpus of English – Bahamas component, 50K tokens. Variety research in Caribbean English.
  • UniPlans (Kersten): English strategic plans from 50 universities worldwide, 2020–2024, 383K tokens. Institutional discourse and higher education policy.
  • enRed: English Reddit discussions from regional subreddits (r/Wales, r/Scotland, r/northernireland, r/AskUK, r/AskAnAmerican), 1.47B tokens, 2010–2024. Variety research.
  • ruRed: Russian Reddit discussions (r/Pikabu, r/AskARussian), 215M tokens. Informal online communication.
  • fRed: French Reddit discussions from Quebec, France, and Belgium (r/Quebec, r/rance, r/Belgium2), 272M tokens.
  • skRed: Slovak Reddit discussions from 5 subreddits (r/Slovakia, r/Slovensko, etc.), 67M tokens. UDPipe annotation.
  • GeRedE (FAU): German Reddit discussions from 11 subreddits (Austria, ich_iel, Finanzen, etc.), 51M tokens, 2010–2018. Informal online communication.

A German Dialect Reddit Corpus

Overview

Reddit: Social media platform with topical communities (subreddits). (Baumgartner et al. 2020).

174M tokens from 17 subreddits (2008–2024)

Subreddit Tokens
r/de ~33M
r/Finanzen ~27M
r/Austria ~19M
r/de_EDV ~16M
r/FragReddit ~14M
r/Switzerland ~13M
r/wien ~12M
r/Munich ~11M
r/fcbayern ~8M
Subreddit Tokens
r/hamburg ~6M
r/FinanzenAT ~6M
r/graz ~4M
r/BUENZLI ~2M
r/ich_iel ~2M
r/aeiou ~1M
r/okoidawappler ~1M
r/bavaria ~1M

Overview of features

All features of Sketch Engine:

  • Query Builder
  • Concordances
  • Linguistic attributes
    • Sentence segmentation
    • Lemmatization
    • POS tagging
  • Frequency analysis
  • Collocation analysis

Excluded: Word Sketches, Ngrams, and Thesaurus

Additional features:

  • Linguistic attributes
    • Dependency Parsing
    • Morphological Parsing
    • Named Entity Recognition
  • Semantic analysis based on embeddings
  • Classification of corpus data with LLMs
  • Social Network Analysis
  • Full data export
  • API for Python and R
  • Documentation for features and corpora

Linguistic Features

NLP annotation with spaCy de_core_news_lg (Honnibal et al. 2020):

Attribute Query Finds
word [word="München"] München (exact form)
lemma [lemma="sein"] ist, sind, war, waren, bin
tag [tag="VERB"] all verbs
deprel [deprel="nsubj"] subjects: ich, es, er, München
morph [morph=".Case=Dat."] dative: dem, der, mir, dir
ent_type [ent_type="PER"] persons: Söder, Merkel, Scholz

Metadata

27 document attributes:

  • Basic: subreddit, author, date, year
  • Language (GlotLID (Kargaran et al. 2023)):
    • lang: ISO code (de, gsw, bar, en)
    • lang_group: language family (german, swiss_german, bavarian)
    • lang_conf: confidence (0–1)
  • Social Network Analysis:
    • parent_id: parent comment → reply networks
    • link_id: thread ID → discussion structures
    • score: upvotes → influence weighting
  • Engagement: controversiality, gilded, num_comments
  • Context: title, post_flair (Politics, Meme, Question)

Query Builder

Concordance View

Frequency Analysis

Word forms for [lemma="denken"] in the GeRedE corpus

Frequency of [word="eh"] across communities in the GeRedE corpus

[lemma="virus"] in the Russian Reddit corpus

Dependencies

Named Entities

[ent_type="GPE" & deprel="dobj"] in the UniPlans corpus

Semantic analysis

[lemma="mouse"] in the COCA

[lemma="mouse"] in the COCA

LLM classification

Causative alternation for [lemma="break"] in the COCA

[word="Putin"] in the Russian Reddit corpus

Social Network Analysis

Subreddit Network

In the deRed corpus.

User Interaction Network (r/Austria)

In the deRed corpus.

Challenges in the area of Computational Linguistics

Language identification

Languages

Language identification with fastText (Joulin et al. 2017):

Language Tokens %
German (de) ~130M 74.58%
English (en) ~41M 23.28%
(unidentified) ~3M 1.98%
French (fr) ~119K 0.07%
Dutch (nl) ~42K 0.02%
Italian (it) ~34K 0.02%
Portuguese (pt) ~29K 0.02%

Less than 0.01% for the following languages:

Polish (pl), Luxembourgish (lb), Slovenian (sl), Spanish (es), Slovak (sk), Hungarian (hu)

Dialects in the corpus

Problem: fastText labels dialect text as standard German (de):

Bavarian (r/aeiou, r/okoidawappler):

  • wennst as ned saufst kannst a ned mitreden
  • Muaßt hoid a Gambrinus drinkn

→ fastText: de

Swiss German (r/BUENZLI):

  • Wer dä Lappe nöd ehrt isch dä Schnegg nöd wert
  • Weiss eigetlech öpper wie das z’Bärn funktioniert?

→ fastText: de

Using GlotLID (Kargaran et al. 2023) for fine-grained dialect identification:

Variety Tokens % Notes
Standard German ~127M 72.90% r/de, r/Finanzen
Alemannic (Swiss) ~2M 1.10% r/BUENZLI
Bavarian ~433K 0.25% r/okoidawappler, r/aeiou
Low German ~6K <0.01% r/hamburg

Bavarian (r/aeiou, r/okoidawappler):

  • wennst as ned saufst kannst a ned mitreden
  • Muaßt hoid a Gambrinus drinkn

→ fastText: de | GlotLID: de_bavarian

Swiss German (r/BUENZLI):

  • Wer dä Lappe nöd ehrt isch dä Schnegg nöd wert
  • Weiss eigetlech öpper wie das z’Bärn funktioniert?

→ fastText: de | GlotLID: de_alemannic

Lemmatization

spaCy de_core_news_lg (Honnibal et al. 2020) — trained on Standard German, struggles with dialects:

Standard German → spaCy:

Word Lemma
nicht nicht
was was
halt halt

Bavarian → spaCy:

Word Lemma Expected
ned ned nicht
wos wos was
hoid hoid halt

POS tagging

Standard German → spaCy:

das weiß ich halt nicht

Word Tag
das PDS
weiß VVFIN
ich PPER
halt ADV
nicht PTKNEG

Bavarian → spaCy:

des woaß i hoid ned

Word Tag Expected
des PDS PDS
woaß NE VVFIN
i NE PPER
hoid NE ADV
ned FM PTKNEG

Problem: Dialect words tagged as NE (Named Entity) or FM (Foreign Material)

Dependency Parsing

Standard German:

Word Head DepRel
nicht möchte ng (negation)
ich möchte sb (subject)
begegnen möchte oc (inf. compl.)

Bavarian:

Word POS DepRel Expected
mecht PROPN ROOT AUX
i X punct PRON/sb
ned X oa (acc.) ng

Problem: ned parsed as accusative object — tree structure completely broken.

MaiBaam (Blaschke et al. 2024)

Best models:

Model POS LAS UAS
UDPipe 80.29 65.79 79.60
mBERT 78.74 54.96 66.38
GBERT 74.68 50.57 62.67

Our replication:

Model POS LAS UAS
UDPipe 80.25 64.78 79.77
mBERT 79.75 54.09 65.78
GBERT 73.64 48.48 61.19
spaCy de_lg 39.94 11.73 24.60

Have newer transformers become better at this?

Using MaChAmp (Goot et al. 2021) with Modern mBERT (Marone et al. 2025) and Modern GBERT (Wunderle et al. 2025):

Model POS UAS LAS
mBERT 79.75 65.78 54.09
Modern mBERT 78.19 61.60 49.28
GBERT 73.64 61.19 48.48
Modern GBERT 67.66 54.70 40.29

→ Newer transformers seem to perform worse on Bavarian?

Is this a tokenization problem?

Word mBERT ModernGBERT
wos wos wo + ##s
gsagt g + ##sagt g + ##sa + ##gt
kemma ke + ##mma ke + ##mm + ##a

→ German-specific models fragment Bavarian more aggressively

Can we improve this by adding Bavarian tokens and domain adaptation?

Adding 1k Bavarian tokens (ned, wos, hoid, oiwei, dahoam, gsagt, …)

+ Domain-Adaptive Pre-Training (Gururangan et al. 2020) on the Bavarian Reddit data:

Model POS UAS LAS
Modern mBERT 78.19 61.60 49.28
+ DAPT 80.46 65.82 54.71
Δ +2.27 +4.22 +5.43
Modern GBERT 67.66 54.70 40.29
+ DAPT 71.16 57.06 44.01
Δ +3.50 +2.36 +3.72

→ DAPT improves dialect handling, but the overall results are still underwhelming.

Problems

Even with DAPT, models still struggle with Bavarian:

Lemmatization

Input Output Expected
koa koa kein
nix nix nichts
oba oba aber
heid heid heute

POS Tagging

Input Output Expected
fei ADV PART
geh VERB PART
scho ADJ ADV
oba NOUN CCONJ

Dependencies

Input Output Expected
i nsubj nsubj
fei advmod discourse
begenen ROOT xcomp

What is needed?

  • better tokenization?
  • more high quality data?
  • more high quality annotations?

Annotation

Linguistics and Computational Linguistics meet at the level of corpus data.

  • Language models can help linguists with annotation and tackling linguistic RQ (Weissweiler, Köksal, and Schütze 2024)
  • Linguistic annotations and insights can inform NLP approaches and model development.

→ an application interface for collaboration between humans and machines

An interface between people and institutes at LMU.

Next Steps

Integration of Additional Corpora

  • Project Gutenberg: Multilingual public domain fiction, 70k+ books in 60+ languages. Historical and literary language research.
  • German Reddit Corpus: Expansion to comprehensive coverage (10M+ documents) in collaboration with Prof. Bülow
  • ICE Bahamas: Expansion to comprehensive coverage as part of Prof. Hackert’s DFG project (Hackert 2010)
  • BNC 2014: British National Corpus, spoken language (Love et al. 2017)
  • TV/Movie Corpus (Davies 2021): English film and TV subtitles, 325M words, 1950–2018
  • TokPisin: Tok Pisin corpus in collaboration with Prof. Hackert (Papua New Guinea creole language)
  • YouTube Corpus: Automatic transcription and IPA annotation in collaboration with IPS (WebMAUS: Kisler, Schiel, and Sloetjes (2012))
  • EEBO: Early English Books Online, 765M words, 1470s–1690s (Text Creation Partnership 2015)
  • ARCHER: A Representative Corpus of Historical English Registers, 1600–1999 (Biber, Finegan, and Atkinson 1994)

Features in Development

  • Annotation Module: Annotation of concordances with human-machine collaboration and active learning.
  • RAG: Retrieval-Augmented Generation for Digital Humanities research questions.
  • Topic Modelling: Topic modelling of corpus data with LLMs
  • Social Network Analysis UI: Visualization of reply networks, author interactions
  • Dependency Query Builder: Interactive UI for syntactic queries
  • Semantic Tags: Categorization by semantic fields (Wmatrix: Rayson (2008))
  • Transcription: Automatic transcription for YouTube and audio data (WebMAUS: Kisler, Schiel, and Sloetjes (2012))
  • AI Mode: LLM-copilot for corpus analyses

Conclusion

Conclusion

  • We see strong potential of a joint Corpus Lab for several institutes of Linguistics.
  • There is also interest from other institutes like Digital Humanities, Phonetics (IPS), and Language Teaching.
  • For sustainable project development, long-term resources will be needed for:
    • Development & maintenance of the app (pre-processing, features, corpora, etc.),
    • infrastructure (server, authentication system, databases, etc.),
    • and services for processing new data and analyses.

Discussion

  • How could the CIS benefit from such a Corpus Lab? Where do you see potential synergies?
  • If so, are there specific methods, or linguistic data that would be of particular interest?
  • Ideas for further development and institutional organization?

References

Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. “The Pushshift Reddit Dataset.” Proceedings of the International AAAI Conference on Web and Social Media 14 (1): 830–39. https://doi.org/10.1609/icwsm.v14i1.7347.
Biber, Douglas, Edward Finegan, and Dwight Atkinson. 1994. ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers.” In Creating and Using English Language Corpora: Papers from the 14th International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi.
Blaschke, Verena, Barbara Kovačić, Siyao Peng, Hinrich Schütze, and Barbara Plank. 2024. MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 10921–38. Torino, Italia: ELRA; ICCL. https://aclanthology.org/2024.lrec-main.953/.
Davies, Mark. 2010. “The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English.” Literary and Linguistic Computing 25 (4): 447–64. https://doi.org/10.1093/llc/fqq018.
———. 2012. “Expanding Horizons in Historical Linguistics with the 400-Million Word Corpus of Historical American English.” Corpora 7: 121–57. https://doi.org/10.3366/cor.2012.0024.
———. 2021. “The TV and Movies Corpora: Design, Construction, and Use.” International Journal of Corpus Linguistics 26 (1): 10–37. https://doi.org/10.1075/ijcl.00035.dav.
Goot, Rob van der, Ahmet Üstün, Alan Ramponi, Ibrahim Sharaf, and Barbara Plank. 2021. “Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-Task Learning in NLP.” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 176–97. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-demos.22.
Gururangan, Suchin, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8342–60. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.740.
Hackert, Stephanie. 2010. ICE Bahamas: Why and How?” ICAME Journal 34: 41–53. https://www.ice-corpora.uzh.ch/en/joinice/Teams/iceba.html.
Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-Strength Natural Language Processing in Python.” https://doi.org/10.5281/zenodo.1212303.
Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. “Bag of Tricks for Efficient Text Classification.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–31. Valencia, Spain: Association for Computational Linguistics. https://aclanthology.org/E17-2068/.
Kargaran, Amir Hossein, Ayyoob Imani, François Yvon, and Hinrich Schütze. 2023. GlotLID: Language Identification for Low-Resource Languages.” In Findings of the Association for Computational Linguistics: EMNLP 2023, 6155–6218. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.410.
Kisler, Thomas, Florian Schiel, and Han Sloetjes. 2012. “Signal Processing via Web Services: The Use Case WebMAUS.” In Proceedings Digital Humanities 2012, 30–34. Hamburg, Germany.
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina, and Tony McEnery. 2017. “The Spoken BNC2014: Designing and Building a Spoken Corpus of Everyday Conversations.” International Journal of Corpus Linguistics 22 (3): 319–44. https://doi.org/10.1075/ijcl.22.3.02lov.
Marone, Marc, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, and Benjamin Van Durme. 2025. mmBERT: A Modern Multilingual Encoder with Annealed Language Learning.” arXiv Preprint arXiv:2509.06888. https://arxiv.org/abs/2509.06888.
Rayson, Paul. 2008. “From Key Words to Key Semantic Domains.” In International Journal of Corpus Linguistics, 13:519–49. 4. https://doi.org/10.1075/ijcl.13.4.06ray.
Text Creation Partnership. 2015. “Early English Books Online – Text Creation Partnership (EEBO-TCP).” Oxford; Ann Arbor, Michigan. https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/.
Weissweiler, Leonie, Abdullatif Köksal, and Hinrich Schütze. 2024. “Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena.” arXiv Preprint arXiv:2403.06965. https://arxiv.org/abs/2403.06965.
Wunderle, Julia, Anton Ehrmanntraut, Jan Pfister, Fotis Jannidis, and Andreas Hotho. 2025. “New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models.” arXiv Preprint arXiv:2505.13136. https://arxiv.org/abs/2505.13136.