Slovak Reddit (skRed)
Slovak Reddit Corpus — Multilingual Slavic Communities (2008-2024)
5 subreddits with 56M words and 1.67M documents. Features GlotLID language family detection enabling research on West Slavic content (Slovak, Czech, Polish) alongside code-switching to English and other languages. Full dependency parsing and SNA metadata for social network analysis.
Corpus Overview
skRed (Slovak Reddit) is a corpus of Slovak-language Reddit posts and comments designed for Slavic linguistics research and Social Network Analysis. It features automatic language detection using GlotLID v3 from LMU Munich, enabling studies of:
- West Slavic content: Slovak, Czech, Polish detection
- Code-switching: Slavic ↔︎ English alternation patterns
- Social Network Analysis: Reply chains, thread structure, user networks
- Temporal patterns: Full timeline 2008-2024
- Dependency parsing: Full syntactic trees via UDPipe
Key Statistics
| Metric | Value |
|---|---|
| Total Words | 56,158,841 |
| Documents | 1,668,382 |
| Comments | 1,627,791 (97.6%) |
| Submissions | 40,597 (2.4%) |
| Subreddits | 5 |
| Time Period | 2008-2024 |
| Language | Slovak (primary) |
| NLP Model | UDPipe Slovak |
| Language Detection | GlotLID v3 |
Corpus Size Roadmap
| Size | Subreddits | Tokens | Status |
|---|---|---|---|
| nano | 5 | 927K | Test corpus |
| mini | 5 | 4.2M | Demo corpus |
| full | 5 | 67.3M | Available |
Subreddits Included
| Subreddit | Est. Docs | Description |
|---|---|---|
| r/slovakia | ~1.63M | Main Slovak subreddit. News, politics, culture |
| r/bratislava | ~30K | Capital city community. Urban topics |
| r/kosice | ~1K | Second largest city. Eastern Slovakia |
| r/slovensko | ~300 | Alternative Slovak community |
| r/Slovak | ~200 | Slovak language learning/discussion |
The r/slovakia subreddit dominates (~98%) as the primary Slovak Reddit community. Smaller subreddits are included for Social Network Analysis across communities.
Language Distribution
The corpus uses GlotLID v3 from LMU Munich for language family detection. All content is included regardless of language; users filter via lang_group.
Language Family Statistics
lang_group |
Documents | Percentage | Description |
|---|---|---|---|
west_slavic |
1,314,004 | 78.8% | Slovak, Czech, Polish |
germanic |
181,604 | 10.9% | English, German (code-switching) |
low_conf |
84,023 | 5.0% | Low confidence detection |
other |
63,327 | 3.8% | Other languages |
south_slavic |
18,233 | 1.1% | Bulgarian, Serbian, Croatian |
romance |
5,464 | 0.3% | French, Spanish, Italian |
uralic |
1,560 | 0.1% | Hungarian (neighboring language) |
east_slavic |
167 | <0.1% | Russian, Ukrainian |
Language Queries
Filter by language family:
Compare Slavic vs Germanic content:
Find all Slavic content (West + South + East):
Available Attributes
Token Attributes (Positional)
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="Slovensko"] |
lemma |
Base form (lemma) | [lemma="byť"] |
tag |
POS tag (Universal Dependencies) | [tag="NOUN"] |
head_n |
Head position (1-indexed, 0=root) | Dependency tree reconstruction |
head |
Head lemma | Dependency parsing |
deprel |
Dependency relation | [deprel="nsubj"] |
ent_type |
Named entity type | [ent_type="PER"] |
ent_iob |
NER IOB tag | [ent_iob="B"] |
morph |
Morphological features | [morph=".Case=Nom."] |
Document Attributes (27 total)
| Category | Attributes |
|---|---|
| Basic | id, subreddit, author, date, year, doc_type |
| SNA | permalink, parent_id, link_id, score |
| Image | post_hint, is_self, is_image, domain, url |
| Language | lang, lang_raw, lang_group, lang_conf |
| Flair | author_flair, post_flair, title |
| Engagement | edited, controversiality, gilded, over_18, num_comments |
Dependency Parsing
The corpus includes full dependency information enabling syntactic tree reconstruction (e.g., for displaCy visualization).
Dependency Tree Format
| Field | Description |
|---|---|
head_n |
Head token index (1-indexed, 0=root) |
deprel |
Dependency relation label |
Example sentence: Slovensko je krásne
Position word head_n deprel Tree
1 Slovensko 2 nsubj Slovensko ──nsubj──▶ je
2 je 0 ROOT je (ROOT)
3 krásne 2 acomp krásne ──acomp──▶ je
Dependency Queries
Find nominal subjects:
Find direct objects:
Find copula + subject patterns:
Linguistic Annotations
Annotation Statistics
| Query | Hits | Description |
|---|---|---|
| [word="ja"] | 32,248 | First person pronoun |
| [word="je"] | 58,578 | Third person is |
| [lemma="byť"] | 33,164 | Copula to be |
| [lemma="mať"] | 27,316 | to have |
| [deprel="nsubj"] | 32,977 | Nominal subjects |
| [deprel="obj"] | 71,099 | Direct objects |
| [deprel="advmod"] | 30,606 | Adverbial modifiers |
| [lemma="byť"] [deprel="nsubj"] | 4,384 | Copula + subject pattern |
Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="Slovensko"] |
| All forms (lemma) | [lemma="byť"] |
| Person names | [ent_type="PER"] |
| Locations | [ent_type="LOC"] |
| Subjects | [deprel="nsubj"] |
| Nouns | [tag="NOUN"] |
| West Slavic content | within |
| All Slavic content | within |
| Main subreddit | within |
| From 2023 | within |
Research Applications
Polysemy Analysis: strana
The Slovak word strana demonstrates clear polysemy that can be analyzed with semantic clustering:
| Sense | Example | Context |
|---|---|---|
| Spatial side | na druhej strane | “on the other side” |
| Page | čiernobielu stranu | “black-and-white page” |
| Political party | politická strana | “political party” |
Workflow:
- Query:
[lemma="strana"](57,717 hits) - Run Semantic analysis with 200 samples
- Clustering reveals 2 distinct sense groups (Silhouette: 0.409)
- Use LLM classification to label senses automatically
Grammaticalization: Reflexive sa
The reflexive marker sa shows grammaticalization patterns visible through dependency analysis:
Distribution by dependency relation:
| Deprel | Percentage | Function | Example |
|---|---|---|---|
expl:pv |
~88% | Inherently reflexive | vyjadriť sa (to express oneself) |
expl:pass |
~5% | Reflexive passive | sa riešilo (was discussed) |
obj |
~1% | True reflexive | myjem sa (I wash myself) |
Key finding: Only ~1% are true reflexives; the vast majority are grammaticalized (inherently reflexive verbs or passive constructions). This pattern parallels Russian -ся, Czech se, and Polish się.
Diachronic Analysis
Track frequency changes over time:
Use frequency analysis by year to observe trends — e.g., strana shows continuous increase 2013-2023, with acceleration around 2019-2020 (political events).
Data Sources & References
- Academic Torrents: Reddit archive dumps
- Arctic Shift API: Supplementary data retrieval
- GlotLID v3: cis-lmu/glotlid — LMU Munich language detection
- UDPipe Slovak: ufal/udpipe — Slovak NLP model
- Universal Dependencies: universaldependencies.org/sk
Social Network Analysis (SNA)
The corpus includes fields enabling Social Network Analysis across Reddit communities.
SNA Fields
parent_idlink_idauthorscoresubredditNote:
t1_prefix indicates a comment parent,t3_prefix indicates a submission parent.Network Construction
With these fields, you can build:
parent_idlinks comments to parents (directed graph)link_idgroups all comments in a submissionscoreas edge/node weight