Munich Corpus Lab — Documentation

Munich Corpus Lab (MCL) is an interactive platform for querying linguistic corpora using the Corpus Query Language (CQL). This documentation provides guides for linguists and researchers to effectively use the platform.

Presentations

  • 14 Jan 2026: Graduate School Language & Literature, LMU Munich
  • 21 Jan 2026: German Linguistics, LMU Munich
  • 27 Jan 2026: Slavic Linguistics, LMU Munich
  • 3 Feb 2026: Center for Information and Language Processing (CIS), LMU Munich

Available Corpora

Corpus Description Words
enRed English Reddit — Regional Varieties (2010-2024) ~1.25B
COHA Corpus of Historical American English (1820-2019) ~472M
COCA Corpus of Contemporary American English (1990-2012) ~414M
fRed French Reddit corpus (2010-2024) ~236M
GeRedE German Reddit corpus (2010-2018) ~198M
ruRed Russian Reddit corpus (2013-2024) ~174M
deRedDial German Reddit Dialects — 17 subreddits (2010-2024) ~145M
skRed Slovak Reddit corpus (2008-2024) ~56M
deRed German Reddit with dialect detection (2018-2025) ~51M
Stream YouTube streaming transcripts (2017-2025) ~14M
UniPlans UK University Strategic Plans (2020-2040) ~334K
ICE Bahamas Bahamian newspaper press reports (2002-2007) ~43K

Features

NoSketch Engine

All features of NoSketch Engine (open-source Sketch Engine):

  • Query Builder — Visual CQL query construction
  • Concordances — KWIC display with context
  • Frequency analysis — Word lists, distributions by metadata
  • Collocation analysis — Statistical measures (logDice, MI, T-score)
  • Linguistic attributes — Lemmatization, POS tagging

Advanced NLP Features

Additional annotations via spaCy:

  • Dependency parsing — Syntactic relations (deprel, head)
  • Morphological analysis — Case, number, tense (morph)
  • Named Entity Recognition — Persons, organizations, locations (ent_type)

AI-Powered Analysis

  • Semantic clustering — Sentence embeddings + UMAP visualization + HDBSCAN clustering for exploring word senses and usage patterns
  • LLM classification — Classify concordance lines with custom variables using GPT-4o
  • Annotation module — Human-machine collaboration for corpus annotation (LID, POS tagging)

Data Export & API

  • Full data export — CSV/XLSX with flexible sampling
  • API access — Programmatic access for Python and R

Platform Architecture

MCL consists of two complementary components:

NoSketch Engine (NoSkE)

The NoSketch Engine Crystal UI is the primary interface for corpus queries. It provides:

  • Built-in CQL query builder with visual token construction
  • Concordance views with KWIC (keyword in context) display
  • Collocation analysis with statistical measures (logDice, MI, T-score)
  • Frequency distributions by word form, POS, and metadata
  • Word list and N-gram tools

NoSkE is an open-source corpus manager based on the Sketch Engine architecture, optimized for fast queries on large corpora.

Access: NoSkE Crystal UI

React App

The custom React application extends NoSkE with additional research features:

  • Semantic clustering — sentence embeddings + UMAP visualization + HDBSCAN clustering
  • LLM classification — classify concordance lines with custom variables
  • Enhanced exports — CSV/XLSX with flexible sampling options
  • Unified authentication — single login for both interfaces
  • User management — admin panel for account management

The custom app uses NoSkE as its backend query engine, ensuring consistent results across both interfaces.

Access: Custom App

Why Two Components?

This architecture leverages NoSkE’s battle-tested corpus query capabilities while adding modern NLP features (embeddings, clustering) that require custom implementation. Users can choose the interface that best fits their workflow.

What is CQL?

Corpus Query Language (CQL) is a powerful query language for searching linguistic corpora. Each query searches for tokens (words) matching specified attributes.

Query Examples

Query Description Finds
[word="partnership"] Exact word form partnership only
[lemma="run"] All forms of a lemma run, runs, running, ran
[pos="NN.*"] Part-of-speech pattern All nouns (singular, plural, proper)
[lemma="strategic" & pos="JJ"] Combined attributes strategic as adjective only
[lemma="strategic"] [lemma="plan"] Adjacent words strategic plan, strategic plans
[lemma="international"] []{0,2} [lemma="partnership"] Words within N tokens international partnership, international research partnership

Common Attributes

Attribute Description Example Finds
word Exact word form [word="partnerships"] partnerships only
lemma Base form (lemma) [lemma="run"] run, runs, running, ran
tag Part-of-speech tag [tag="VERB"] All verbs
deprel Dependency relation [deprel="nsubj"] Subjects: I, he, it, the company
morph Morphological features [morph=".*Case=Dat.*"] Dative case: dem, der, mir
ent_type Named entity type [ent_type="PER"] Persons: Biden, Merkel, Scholz

See the UniPlans documentation for detailed query tutorials.

References

Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. “The Pushshift Reddit Dataset.” Proceedings of the International AAAI Conference on Web and Social Media 14 (1): 830–39. https://doi.org/10.1609/icwsm.v14i1.7347.
Biber, Douglas, Edward Finegan, and Dwight Atkinson. 1994. ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers.” In Creating and Using English Language Corpora: Papers from the 14th International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi.
Blaschke, Verena, Barbara Kovačić, Siyao Peng, Hinrich Schütze, and Barbara Plank. 2024. MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 10921–38. Torino, Italia: ELRA; ICCL. https://aclanthology.org/2024.lrec-main.953/.
Blombach, Andreas, Natalie Dykes, Philipp Heinrich, Besim Kabashi, and Thomas Proisl. 2020. “A Corpus of German Reddit Exchanges (GeRedE).” In Proceedings of the Twelfth Language Resources and Evaluation Conference, 6310–16. Marseille, France: European Language Resources Association. https://aclanthology.org/2020.lrec-1.774/.
Bruckmaier, Elisabeth, and Stephanie Hackert. 2011. “Bahamian Standard English: A First Approach.” English World-Wide 32: 174–205.
Davies, Mark. 2008. “The Corpus of Contemporary American English (COCA).” https://www.english-corpora.org/coca/.
———. 2010a. “The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English.” Literary and Linguistic Computing 25 (4): 447–64. https://doi.org/10.1093/llc/fqq018.
———. 2010b. “The Corpus of Historical American English (COHA).” https://www.english-corpora.org/coha/.
———. 2012. “Expanding Horizons in Historical Linguistics with the 400-Million Word Corpus of Historical American English.” Corpora 7: 121–57. https://doi.org/10.3366/cor.2012.0024.
———. 2021. “The TV and Movies Corpora: Design, Construction, and Use.” International Journal of Corpus Linguistics 26 (1): 10–37. https://doi.org/10.1075/ijcl.00035.dav.
Garside, Roger. 1987. “The CLAWS Word-Tagging System.” In The Computational Analysis of English: A Corpus-Based Approach, edited by Roger Garside, Geoffrey Leech, and Geoffrey Sampson. London: Longman.
Goot, Rob van der, Ahmet Üstün, Alan Ramponi, Ibrahim Sharaf, and Barbara Plank. 2021. “Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-Task Learning in NLP.” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 176–97. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-demos.22.
Gururangan, Suchin, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8342–60. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.740.
Hackert, Stephanie. 2004. Urban Bahamian Creole: System and Variation. Amsterdam, Philadelphia: John Benjamins.
———. 2010. ICE Bahamas: Why and How?” ICAME Journal 34: 41–53. https://www.ice-corpora.uzh.ch/en/joinice/Teams/iceba.html.
Hackert, Stephanie, and Magnus Huber. 2007. “Gullah in the Diaspora: Historical and Linguistic Evidence from the Bahamas.” Diachronica 24: 279–325.
Hofmann, Valentin, Pratyusha R Kalluri, Dan Jurafsky, and Sharese King. 2024. AI Generates Covertly Racist Decisions about People Based on Their Dialect.” Nature 633 (8028): 147–54. https://doi.org/10.1038/s41586-024-07856-5.
Hofmann, Valentin, Hinrich Schütze, and Janet Pierrehumbert. 2022. “The Reddit Politosphere: A Large-Scale Text and Network Resource of Online Political Discourse.” In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), 16:349–60. https://doi.org/10.1609/icwsm.v16i1.19377.
Hofmann, Valentin, Leonie Weissweiler, David Mortensen, Hinrich Schütze, and Janet Pierrehumbert. 2025. “Derivational Morphology Reveals Analogical Generalization in Large Language Models.” Proceedings of the National Academy of Sciences 122 (2). https://doi.org/10.1073/pnas.2423232122.
Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-Strength Natural Language Processing in Python.” https://doi.org/10.5281/zenodo.1212303.
Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. “Bag of Tricks for Efficient Text Classification.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–31. Valencia, Spain: Association for Computational Linguistics. https://aclanthology.org/E17-2068/.
Kargaran, Amir Hossein, Ayyoob Imani, François Yvon, and Hinrich Schütze. 2023. GlotLID: Language Identification for Low-Resource Languages.” In Findings of the Association for Computational Linguistics: EMNLP 2023, 6155–6218. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.410.
Kisler, Thomas, Florian Schiel, and Han Sloetjes. 2012. “Signal Processing via Web Services: The Use Case WebMAUS.” In Proceedings Digital Humanities 2012, 30–34. Hamburg, Germany.
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina, and Tony McEnery. 2017. “The Spoken BNC2014: Designing and Building a Spoken Corpus of Everyday Conversations.” International Journal of Corpus Linguistics 22 (3): 319–44. https://doi.org/10.1075/ijcl.22.3.02lov.
Ma, Bolei, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, and Barbara Plank. 2025. “Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges.” In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 8679–96. Vienna, Austria. https://doi.org/10.18653/v1/2025.acl-long.425.
Marone, Marc, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, and Benjamin Van Durme. 2025. mmBERT: A Modern Multilingual Encoder with Annealed Language Learning.” arXiv Preprint arXiv:2509.06888. https://arxiv.org/abs/2509.06888.
Mayhew, Stephen, Terra Blevins, Shuheng Liu, Barbara Plank, et al. 2024. “Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark.” In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 4322–37. https://doi.org/10.18653/v1/2024.naacl-long.243.
Oenbring, Raymond A. 2010. “Corpus Linguistic Studies of Standard Bahamian English: A Comparative Study of Newspaper Usage.” The International Journal of Bahamian Studies 16: 51–62.
Plank, Barbara. 2022. “The ’Problem’ of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 10671–82. https://doi.org/10.18653/v1/2022.emnlp-main.731.
Plank, Barbara, Anders Søgaard, and Yoav Goldberg. 2016. “Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 412–18. https://doi.org/10.18653/v1/P16-2067.
Ramponi, Alan, and Barbara Plank. 2020. “Neural Unsupervised Domain Adaptation in NLP—a Survey.” In Proceedings of the 28th International Conference on Computational Linguistics (COLING), 6838–55. https://doi.org/10.18653/v1/2020.coling-main.603.
Rayson, Paul. 2008. “From Key Words to Key Semantic Domains.” In International Journal of Corpus Linguistics, 13:519–49. 4. https://doi.org/10.1075/ijcl.13.4.06ray.
Rychlý, Pavel. 2007. “Manatee/Bonito – a Modular Corpus Manager.” In 1st Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN), 65–70. Brno, Czech Republic: Masaryk University.
Schmid, Helmut. 1994. “Probabilistic Part-of-Speech Tagging Using Decision Trees.” In Proceedings of the International Conference on New Methods in Language Processing, 44–49. Manchester, UK.
Text Creation Partnership. 2015. “Early English Books Online – Text Creation Partnership (EEBO-TCP).” Oxford; Ann Arbor, Michigan. https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/.
Warner, Benjamin, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, et al. 2024. “Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.” arXiv Preprint arXiv:2412.13663. https://arxiv.org/abs/2412.13663.
Weber-Genzel, Leon, Siyao Peng, Marie-Catherine De Marneffe, and Barbara Plank. 2024. VariErr NLI: Separating Annotation Error from Human Label Variation.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2024.acl-long.123.
Weissweiler, Leonie, Valentin Hofmann, Abdullatif Köksal, and Hinrich Schütze. 2023. “Explaining Pretrained Language Models’ Understanding of Linguistic Structures Using Construction Grammar.” Frontiers in Artificial Intelligence 6. https://doi.org/10.3389/frai.2023.1225791.
Weissweiler, Leonie, Abdullatif Köksal, and Hinrich Schütze. 2024. “Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena.” arXiv Preprint arXiv:2403.06965. https://arxiv.org/abs/2403.06965.
Wunderle, Julia, Anton Ehrmanntraut, Jan Pfister, Fotis Jannidis, and Andreas Hotho. 2025. “New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models.” arXiv Preprint arXiv:2505.13136. https://arxiv.org/abs/2505.13136.