Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020.
“The Pushshift Reddit Dataset.” Proceedings of the International AAAI Conference on Web and Social Media 14 (1): 830–39.
https://doi.org/10.1609/icwsm.v14i1.7347.
Biber, Douglas, Edward Finegan, and Dwight Atkinson. 1994. “ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers.” In Creating and Using English Language Corpora: Papers from the 14th International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi.
Blaschke, Verena, Barbara Kovačić, Siyao Peng, Hinrich Schütze, and Barbara Plank. 2024.
“MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank.” In
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 10921–38. Torino, Italia: ELRA; ICCL.
https://aclanthology.org/2024.lrec-main.953/.
Blombach, Andreas, Natalie Dykes, Philipp Heinrich, Besim Kabashi, and Thomas Proisl. 2020.
“A Corpus of German Reddit Exchanges (GeRedE).” In
Proceedings of the Twelfth Language Resources and Evaluation Conference, 6310–16. Marseille, France: European Language Resources Association.
https://aclanthology.org/2020.lrec-1.774/.
Bruckmaier, Elisabeth, and Stephanie Hackert. 2011. “Bahamian Standard English: A First Approach.” English World-Wide 32: 174–205.
Davies, Mark. 2008.
“The Corpus of Contemporary American English (COCA).” https://www.english-corpora.org/coca/.
———. 2010a.
“The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English.” Literary and Linguistic Computing 25 (4): 447–64.
https://doi.org/10.1093/llc/fqq018.
———. 2010b.
“The Corpus of Historical American English (COHA).” https://www.english-corpora.org/coha/.
———. 2012.
“Expanding Horizons in Historical Linguistics with the 400-Million Word Corpus of Historical American English.” Corpora 7: 121–57.
https://doi.org/10.3366/cor.2012.0024.
———. 2021.
“The TV and Movies Corpora: Design, Construction, and Use.” International Journal of Corpus Linguistics 26 (1): 10–37.
https://doi.org/10.1075/ijcl.00035.dav.
Garside, Roger. 1987. “The CLAWS Word-Tagging System.” In The Computational Analysis of English: A Corpus-Based Approach, edited by Roger Garside, Geoffrey Leech, and Geoffrey Sampson. London: Longman.
Goot, Rob van der, Ahmet Üstün, Alan Ramponi, Ibrahim Sharaf, and Barbara Plank. 2021.
“Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-Task Learning in NLP.” In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 176–97. Online: Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021.eacl-demos.22.
Gururangan, Suchin, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020.
“Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8342–60. Online: Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020.acl-main.740.
Hackert, Stephanie. 2004. Urban Bahamian Creole: System and Variation. Amsterdam, Philadelphia: John Benjamins.
———. 2010.
“ICE Bahamas: Why and How?” ICAME Journal 34: 41–53.
https://www.ice-corpora.uzh.ch/en/joinice/Teams/iceba.html.
Hackert, Stephanie, and Magnus Huber. 2007. “Gullah in the Diaspora: Historical and Linguistic Evidence from the Bahamas.” Diachronica 24: 279–325.
Hofmann, Valentin, Pratyusha R Kalluri, Dan Jurafsky, and Sharese King. 2024.
“AI Generates Covertly Racist Decisions about People Based on Their Dialect.” Nature 633 (8028): 147–54.
https://doi.org/10.1038/s41586-024-07856-5.
Hofmann, Valentin, Hinrich Schütze, and Janet Pierrehumbert. 2022.
“The Reddit Politosphere: A Large-Scale Text and Network Resource of Online Political Discourse.” In
Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), 16:349–60.
https://doi.org/10.1609/icwsm.v16i1.19377.
Hofmann, Valentin, Leonie Weissweiler, David Mortensen, Hinrich Schütze, and Janet Pierrehumbert. 2025.
“Derivational Morphology Reveals Analogical Generalization in Large Language Models.” Proceedings of the National Academy of Sciences 122 (2).
https://doi.org/10.1073/pnas.2423232122.
Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020.
“spaCy: Industrial-Strength Natural Language Processing in Python.” https://doi.org/10.5281/zenodo.1212303.
Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017.
“Bag of Tricks for Efficient Text Classification.” In
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–31. Valencia, Spain: Association for Computational Linguistics.
https://aclanthology.org/E17-2068/.
Kargaran, Amir Hossein, Ayyoob Imani, François Yvon, and Hinrich Schütze. 2023.
“GlotLID: Language Identification for Low-Resource Languages.” In
Findings of the Association for Computational Linguistics: EMNLP 2023, 6155–6218. Singapore: Association for Computational Linguistics.
https://doi.org/10.18653/v1/2023.findings-emnlp.410.
Kisler, Thomas, Florian Schiel, and Han Sloetjes. 2012. “Signal Processing via Web Services: The Use Case WebMAUS.” In Proceedings Digital Humanities 2012, 30–34. Hamburg, Germany.
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina, and Tony McEnery. 2017.
“The Spoken BNC2014: Designing and Building a Spoken Corpus of Everyday Conversations.” International Journal of Corpus Linguistics 22 (3): 319–44.
https://doi.org/10.1075/ijcl.22.3.02lov.
Ma, Bolei, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, and Barbara Plank. 2025.
“Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges.” In
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 8679–96. Vienna, Austria.
https://doi.org/10.18653/v1/2025.acl-long.425.
Marone, Marc, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, and Benjamin Van Durme. 2025.
“mmBERT: A Modern Multilingual Encoder with Annealed Language Learning.” arXiv Preprint arXiv:2509.06888.
https://arxiv.org/abs/2509.06888.
Mayhew, Stephen, Terra Blevins, Shuheng Liu, Barbara Plank, et al. 2024.
“Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark.” In
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 4322–37.
https://doi.org/10.18653/v1/2024.naacl-long.243.
Oenbring, Raymond A. 2010. “Corpus Linguistic Studies of Standard Bahamian English: A Comparative Study of Newspaper Usage.” The International Journal of Bahamian Studies 16: 51–62.
Plank, Barbara. 2022.
“The ’Problem’ of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation.” In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 10671–82.
https://doi.org/10.18653/v1/2022.emnlp-main.731.
Plank, Barbara, Anders Søgaard, and Yoav Goldberg. 2016.
“Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss.” In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 412–18.
https://doi.org/10.18653/v1/P16-2067.
Ramponi, Alan, and Barbara Plank. 2020.
“Neural Unsupervised Domain Adaptation in NLP—a Survey.” In
Proceedings of the 28th International Conference on Computational Linguistics (COLING), 6838–55.
https://doi.org/10.18653/v1/2020.coling-main.603.
Rayson, Paul. 2008.
“From Key Words to Key Semantic Domains.” In
International Journal of Corpus Linguistics, 13:519–49. 4.
https://doi.org/10.1075/ijcl.13.4.06ray.
Rychlý, Pavel. 2007. “Manatee/Bonito – a Modular Corpus Manager.” In 1st Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN), 65–70. Brno, Czech Republic: Masaryk University.
Schmid, Helmut. 1994. “Probabilistic Part-of-Speech Tagging Using Decision Trees.” In Proceedings of the International Conference on New Methods in Language Processing, 44–49. Manchester, UK.
Text Creation Partnership. 2015.
“Early English Books Online – Text Creation Partnership (EEBO-TCP).” Oxford; Ann Arbor, Michigan.
https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/.
Warner, Benjamin, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, et al. 2024.
“Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.” arXiv Preprint arXiv:2412.13663.
https://arxiv.org/abs/2412.13663.
Weber-Genzel, Leon, Siyao Peng, Marie-Catherine De Marneffe, and Barbara Plank. 2024.
“VariErr NLI: Separating Annotation Error from Human Label Variation.” In
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL).
https://doi.org/10.18653/v1/2024.acl-long.123.
Weissweiler, Leonie, Valentin Hofmann, Abdullatif Köksal, and Hinrich Schütze. 2023.
“Explaining Pretrained Language Models’ Understanding of Linguistic Structures Using Construction Grammar.” Frontiers in Artificial Intelligence 6.
https://doi.org/10.3389/frai.2023.1225791.
Weissweiler, Leonie, Abdullatif Köksal, and Hinrich Schütze. 2024.
“Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena.” arXiv Preprint arXiv:2403.06965.
https://arxiv.org/abs/2403.06965.
Wunderle, Julia, Anton Ehrmanntraut, Jan Pfister, Fotis Jannidis, and Andreas Hotho. 2025.
“New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models.” arXiv Preprint arXiv:2505.13136.
https://arxiv.org/abs/2505.13136.