Slovak Reddit (skRed)

Slovak Reddit Corpus — Multilingual Slavic Communities (2008-2024)

NoteCurrent Version: skRed

5 subreddits with 56M words and 1.67M documents. Features GlotLID language family detection enabling research on West Slavic content (Slovak, Czech, Polish) alongside code-switching to English and other languages. Full dependency parsing and SNA metadata for social network analysis.

Corpus Overview

skRed (Slovak Reddit) is a corpus of Slovak-language Reddit posts and comments designed for Slavic linguistics research and Social Network Analysis. It features automatic language detection using GlotLID v3 from LMU Munich, enabling studies of:

  • West Slavic content: Slovak, Czech, Polish detection
  • Code-switching: Slavic ↔︎ English alternation patterns
  • Social Network Analysis: Reply chains, thread structure, user networks
  • Temporal patterns: Full timeline 2008-2024
  • Dependency parsing: Full syntactic trees via UDPipe

Key Statistics

Metric Value
Total Words 56,158,841
Documents 1,668,382
Comments 1,627,791 (97.6%)
Submissions 40,597 (2.4%)
Subreddits 5
Time Period 2008-2024
Language Slovak (primary)
NLP Model UDPipe Slovak
Language Detection GlotLID v3

Corpus Size Roadmap

Size Subreddits Tokens Status
nano 5 927K Test corpus
mini 5 4.2M Demo corpus
full 5 67.3M Available

Subreddits Included

Subreddit Est. Docs Description
r/slovakia ~1.63M Main Slovak subreddit. News, politics, culture
r/bratislava ~30K Capital city community. Urban topics
r/kosice ~1K Second largest city. Eastern Slovakia
r/slovensko ~300 Alternative Slovak community
r/Slovak ~200 Slovak language learning/discussion

The r/slovakia subreddit dominates (~98%) as the primary Slovak Reddit community. Smaller subreddits are included for Social Network Analysis across communities.

Language Distribution

The corpus uses GlotLID v3 from LMU Munich for language family detection. All content is included regardless of language; users filter via lang_group.

Language Family Statistics

lang_group Documents Percentage Description
west_slavic 1,314,004 78.8% Slovak, Czech, Polish
germanic 181,604 10.9% English, German (code-switching)
low_conf 84,023 5.0% Low confidence detection
other 63,327 3.8% Other languages
south_slavic 18,233 1.1% Bulgarian, Serbian, Croatian
romance 5,464 0.3% French, Spanish, Italian
uralic 1,560 0.1% Hungarian (neighboring language)
east_slavic 167 <0.1% Russian, Ukrainian

Language Queries

Filter by language family:

[lemma="byť"] within

Compare Slavic vs Germanic content:

[word="ja"] within

Find all Slavic content (West + South + East):

[word=".*"] within

Available Attributes

Token Attributes (Positional)

Attribute Description Example
word Exact word form [word="Slovensko"]
lemma Base form (lemma) [lemma="byť"]
tag POS tag (Universal Dependencies) [tag="NOUN"]
head_n Head position (1-indexed, 0=root) Dependency tree reconstruction
head Head lemma Dependency parsing
deprel Dependency relation [deprel="nsubj"]
ent_type Named entity type [ent_type="PER"]
ent_iob NER IOB tag [ent_iob="B"]
morph Morphological features [morph=".Case=Nom."]

Document Attributes (27 total)

Category Attributes
Basic id, subreddit, author, date, year, doc_type
SNA permalink, parent_id, link_id, score
Image post_hint, is_self, is_image, domain, url
Language lang, lang_raw, lang_group, lang_conf
Flair author_flair, post_flair, title
Engagement edited, controversiality, gilded, over_18, num_comments

Dependency Parsing

The corpus includes full dependency information enabling syntactic tree reconstruction (e.g., for displaCy visualization).

Dependency Tree Format

Field Description
head_n Head token index (1-indexed, 0=root)
deprel Dependency relation label

Example sentence: Slovensko je krásne

Position  word        head_n  deprel   Tree
1         Slovensko   2       nsubj    Slovensko ──nsubj──▶ je
2         je          0       ROOT     je (ROOT)
3         krásne      2       acomp    krásne ──acomp──▶ je

Dependency Queries

Find nominal subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="obj"]

Find copula + subject patterns:

[lemma="byť"] [deprel="nsubj"]

Social Network Analysis (SNA)

The corpus includes fields enabling Social Network Analysis across Reddit communities.

SNA Fields

Field Description Use Case
parent_id Parent comment/post ID (t1_/t3_) Reply chain construction
link_id Thread ID (t3_) Thread grouping
author Reddit username User networks
score Upvote count Influence weighting
subreddit Community name Cross-community analysis

Note: t1_ prefix indicates a comment parent, t3_ prefix indicates a submission parent.

Network Construction

With these fields, you can build:

  1. Reply Networks: parent_id links comments to parents (directed graph)
  2. Thread Networks: link_id groups all comments in a submission
  3. Author Networks: Co-occurrence in threads (undirected graph)
  4. Influence Analysis: score as edge/node weight

Linguistic Annotations

Annotation Statistics

Query Hits Description
[word="ja"] 32,248 First person pronoun
[word="je"] 58,578 Third person is
[lemma="byť"] 33,164 Copula to be
[lemma="mať"] 27,316 to have
[deprel="nsubj"] 32,977 Nominal subjects
[deprel="obj"] 71,099 Direct objects
[deprel="advmod"] 30,606 Adverbial modifiers
[lemma="byť"] [deprel="nsubj"] 4,384 Copula + subject pattern

Common Universal Dependencies Tags

Pattern Meaning
NOUN Nouns
VERB Verbs
ADJ Adjectives
ADV Adverbs
PROPN Proper nouns
ADP Adpositions
PRON Pronouns
DET Determiners

Quick Reference

Goal CQL Query
Exact word [word="Slovensko"]
All forms (lemma) [lemma="byť"]
Person names [ent_type="PER"]
Locations [ent_type="LOC"]
Subjects [deprel="nsubj"]
Nouns [tag="NOUN"]
West Slavic content within
All Slavic content within
Main subreddit within
From 2023 within

Research Applications

Polysemy Analysis: strana

The Slovak word strana demonstrates clear polysemy that can be analyzed with semantic clustering:

Sense Example Context
Spatial side na druhej strane “on the other side”
Page čiernobielu stranu “black-and-white page”
Political party politická strana “political party”

Workflow:

  1. Query: [lemma="strana"] (57,717 hits)
  2. Run Semantic analysis with 200 samples
  3. Clustering reveals 2 distinct sense groups (Silhouette: 0.409)
  4. Use LLM classification to label senses automatically

Grammaticalization: Reflexive sa

The reflexive marker sa shows grammaticalization patterns visible through dependency analysis:

[lemma="sa"]

Distribution by dependency relation:

Deprel Percentage Function Example
expl:pv ~88% Inherently reflexive vyjadriť sa (to express oneself)
expl:pass ~5% Reflexive passive sa riešilo (was discussed)
obj ~1% True reflexive myjem sa (I wash myself)

Key finding: Only ~1% are true reflexives; the vast majority are grammaticalized (inherently reflexive verbs or passive constructions). This pattern parallels Russian -ся, Czech se, and Polish się.

Diachronic Analysis

Track frequency changes over time:

[lemma="strana"] within

Use frequency analysis by year to observe trends — e.g., strana shows continuous increase 2013-2023, with acceleration around 2019-2020 (political events).

Data Sources & References