Slovak Reddit (skRed)

Slovak Reddit Corpus — Multilingual Slavic Communities (2008-2024)

Current Version: skRed

5 subreddits with 56M words and 1.67M documents. Features GlotLID language family detection enabling research on West Slavic content (Slovak, Czech, Polish) alongside code-switching to English and other languages. Full dependency parsing and SNA metadata for social network analysis.

Corpus Overview

skRed (Slovak Reddit) is a corpus of Slovak-language Reddit posts and comments designed for Slavic linguistics research and Social Network Analysis. It features automatic language detection using GlotLID v3 from LMU Munich, enabling studies of:

West Slavic content: Slovak, Czech, Polish detection
Code-switching: Slavic ↔︎ English alternation patterns
Social Network Analysis: Reply chains, thread structure, user networks
Temporal patterns: Full timeline 2008-2024
Dependency parsing: Full syntactic trees via UDPipe

Key Statistics

Metric	Value
Total Words	56,158,841
Documents	1,668,382
Comments	1,627,791 (97.6%)
Submissions	40,597 (2.4%)
Subreddits	5
Time Period	2008-2024
Language	Slovak (primary)
NLP Model	UDPipe Slovak
Language Detection	GlotLID v3

Corpus Size Roadmap

Size	Subreddits	Tokens	Status
nano	5	927K	Test corpus
mini	5	4.2M	Demo corpus
full	5	67.3M	Available

Subreddits Included

Subreddit	Est. Docs	Description
r/slovakia	~1.63M	Main Slovak subreddit. News, politics, culture
r/bratislava	~30K	Capital city community. Urban topics
r/kosice	~1K	Second largest city. Eastern Slovakia
r/slovensko	~300	Alternative Slovak community
r/Slovak	~200	Slovak language learning/discussion

The r/slovakia subreddit dominates (~98%) as the primary Slovak Reddit community. Smaller subreddits are included for Social Network Analysis across communities.

Language Distribution

The corpus uses GlotLID v3 from LMU Munich for language family detection. All content is included regardless of language; users filter via lang_group.

Language Family Statistics

`lang_group`	Documents	Percentage	Description
`west_slavic`	1,314,004	78.8%	Slovak, Czech, Polish
`germanic`	181,604	10.9%	English, German (code-switching)
`low_conf`	84,023	5.0%	Low confidence detection
`other`	63,327	3.8%	Other languages
`south_slavic`	18,233	1.1%	Bulgarian, Serbian, Croatian
`romance`	5,464	0.3%	French, Spanish, Italian
`uralic`	1,560	0.1%	Hungarian (neighboring language)
`east_slavic`	167	<0.1%	Russian, Ukrainian

Language Queries

Filter by language family:

[lemma="byť"] within

Compare Slavic vs Germanic content:

[word="ja"] within

Find all Slavic content (West + South + East):

[word=".*"] within

Available Attributes

Token Attributes (Positional)

Attribute	Description	Example
`word`	Exact word form	[word="Slovensko"]
`lemma`	Base form (lemma)	[lemma="byť"]
`tag`	POS tag (Universal Dependencies)	[tag="NOUN"]
`head_n`	Head position (1-indexed, 0=root)	Dependency tree reconstruction
`head`	Head lemma	Dependency parsing
`deprel`	Dependency relation	[deprel="nsubj"]
`ent_type`	Named entity type	[ent_type="PER"]
`ent_iob`	NER IOB tag	[ent_iob="B"]
`morph`	Morphological features	[morph=".Case=Nom."]

Document Attributes (27 total)

Category	Attributes
Basic	id, subreddit, author, date, year, doc_type
SNA	permalink, parent_id, link_id, score
Image	post_hint, is_self, is_image, domain, url
Language	lang, lang_raw, lang_group, lang_conf
Flair	author_flair, post_flair, title
Engagement	edited, controversiality, gilded, over_18, num_comments

Dependency Parsing

The corpus includes full dependency information enabling syntactic tree reconstruction (e.g., for displaCy visualization).

Dependency Tree Format

Field	Description
`head_n`	Head token index (1-indexed, 0=root)
`deprel`	Dependency relation label

Example sentence: Slovensko je krásne

Position  word        head_n  deprel   Tree
1         Slovensko   2       nsubj    Slovensko ──nsubj──▶ je
2         je          0       ROOT     je (ROOT)
3         krásne      2       acomp    krásne ──acomp──▶ je

Dependency Queries

Find nominal subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="obj"]

Find copula + subject patterns:

[lemma="byť"] [deprel="nsubj"]

Social Network Analysis (SNA)

The corpus includes fields enabling Social Network Analysis across Reddit communities.

SNA Fields

Field	Description	Use Case
`parent_id`	Parent comment/post ID (t1_/t3_)	Reply chain construction
`link_id`	Thread ID (t3_)	Thread grouping
`author`	Reddit username	User networks
`score`	Upvote count	Influence weighting
`subreddit`	Community name	Cross-community analysis

Note: t1_ prefix indicates a comment parent, t3_ prefix indicates a submission parent.

Network Construction

With these fields, you can build:

Reply Networks: parent_id links comments to parents (directed graph)
Thread Networks: link_id groups all comments in a submission
Author Networks: Co-occurrence in threads (undirected graph)
Influence Analysis: score as edge/node weight

Linguistic Annotations

Annotation Statistics

Query	Hits	Description
[word="ja"]	32,248	First person pronoun
[word="je"]	58,578	Third person is
[lemma="byť"]	33,164	Copula to be
[lemma="mať"]	27,316	to have
[deprel="nsubj"]	32,977	Nominal subjects
[deprel="obj"]	71,099	Direct objects
[deprel="advmod"]	30,606	Adverbial modifiers
[lemma="byť"] [deprel="nsubj"]	4,384	Copula + subject pattern

Common Universal Dependencies Tags

Pattern	Meaning
`NOUN`	Nouns
`VERB`	Verbs
`ADJ`	Adjectives
`ADV`	Adverbs
`PROPN`	Proper nouns
`ADP`	Adpositions
`PRON`	Pronouns
`DET`	Determiners

Quick Reference

Goal	CQL Query
Exact word	[word="Slovensko"]
All forms (lemma)	[lemma="byť"]
Person names	[ent_type="PER"]
Locations	[ent_type="LOC"]
Subjects	[deprel="nsubj"]
Nouns	[tag="NOUN"]
West Slavic content	within
All Slavic content	within
Main subreddit	within
From 2023	within

Research Applications

Polysemy Analysis: strana

The Slovak word strana demonstrates clear polysemy that can be analyzed with semantic clustering:

Sense	Example	Context
Spatial side	na druhej strane	“on the other side”
Page	čiernobielu stranu	“black-and-white page”
Political party	politická strana	“political party”

Workflow:

Query: [lemma="strana"] (57,717 hits)
Run Semantic analysis with 200 samples
Clustering reveals 2 distinct sense groups (Silhouette: 0.409)
Use LLM classification to label senses automatically

Grammaticalization: Reflexive sa

The reflexive marker sa shows grammaticalization patterns visible through dependency analysis:

[lemma="sa"]

Distribution by dependency relation:

Deprel	Percentage	Function	Example
`expl:pv`	~88%	Inherently reflexive	vyjadriť sa (to express oneself)
`expl:pass`	~5%	Reflexive passive	sa riešilo (was discussed)
`obj`	~1%	True reflexive	myjem sa (I wash myself)

Key finding: Only ~1% are true reflexives; the vast majority are grammaticalized (inherently reflexive verbs or passive constructions). This pattern parallels Russian -ся, Czech se, and Polish się.

Diachronic Analysis

Track frequency changes over time:

[lemma="strana"] within

Use frequency analysis by year to observe trends — e.g., strana shows continuous increase 2013-2023, with acceleration around 2019-2020 (political events).

Data Sources & References

Academic Torrents: Reddit archive dumps
Arctic Shift API: Supplementary data retrieval
GlotLID v3: cis-lmu/glotlid — LMU Munich language detection
UDPipe Slovak: ufal/udpipe — Slovak NLP model
Universal Dependencies: universaldependencies.org/sk