Russian Reddit Corpus

ruRed — Russian Reddit (2013-2024)

Current Status

Russian Reddit corpus: 2 subreddits with 174M words, 8.2M documents. Full dependency parsing and NER are enabled.

Corpus Overview

The ruRed (Russian Reddit Corpus) contains Russian-language posts and comments from Reddit, providing access to contemporary informal written Russian discourse from online communities.

The corpus is filtered using fastText language identification (lid.176.bin model) to ensure Russian-only content. Comments and submissions below a confidence threshold of 0.5 are excluded.

The corpus is processed with spaCy ru_core_news_lg (large Russian model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric	Value
Total Words	174 million
Documents	~8,200,000
Subreddits	2
Time Period	2013-2024
Language	Russian
spaCy Model	ru_core_news_lg

Subreddits Included

Subreddit	Documents	Tokens	Description
Pikabu	~6,000,000	116M	Russian humor/entertainment
AskARussian	~2,200,000	99M	Q&A about Russia and Russians

Data Source

The corpus is derived from:

Academic Torrents: Reddit archive dumps via academictorrents.com
fastText Language ID: Comments filtered using lid.176.bin model

Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute	Description	Example
`word`	Exact word form	[word="и"]
`lemma`	Base form (lemma)	[lemma="быть"]
`tag`	POS tag (Universal Dependencies)	[tag="NOUN"]
`head_n`	Head position (1-based, 0 for root)	[head_n="0"]
`head`	Head lemma	[head="сказать"]
`deprel`	Dependency relation (UD)	[deprel="nsubj"]
`ent_type`	Named entity type (PER, ORG, LOC, MISC)	[ent_type="PER"]
`ent_iob`	NER IOB tag (B=begin, I=inside, O=outside)	[ent_iob="B"]
`morph`	Morphological features (UD FEATS format)	[morph=".Case=Nom."]

Note: All attributes are fully populated. The corpus was processed with spaCy’s ru_core_news_lg model with full dependency parsing and named entity recognition.

Document Attributes

Attribute	Description	Example
`doc.id`	Document ID	Filter by specific post
`doc.subreddit`	Subreddit name	Filter: Pikabu, AskARussian
`doc.author`	Reddit username	Filter by author
`doc.date`	Publication date	Filter by date range
`doc.year`	Publication year	Filter by year
`doc.doc_type`	Type: comment or submission	Filter by type
`doc.lang`	Detected language code	Filter by language
`doc.lang_conf`	Language confidence	Filter by confidence

CQL Query Examples

Basic Word Search

Find all occurrences of a word:

[lemma="быть"]

Find exact word form:

[word="и"]

Named Entity Recognition

Russian spaCy uses these NER labels: PER, ORG, LOC, MISC

Find all person names:

[ent_type="PER"]

Find all organizations:

[ent_type="ORG"]

Find all locations:

[ent_type="LOC"]

Part-of-Speech Queries

Find all nouns (Universal Dependencies tagset):

[tag="NOUN"]

Find all verbs:

[tag="VERB"]

Find all adjectives:

[tag="ADJ"]

Morphological Features

Find nominative case:

[morph=".Case=Nom."]

Find plural nouns:

[tag="NOUN" & morph=".Number=Plur."]

Find perfective verbs:

[tag="VERB" & morph=".Aspect=Perf."]

Dependency Relations

Russian spaCy uses Universal Dependencies labels.

Find subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="obj"]

Find indirect objects:

[deprel="iobj"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VERB"] []{0,3} [tag="NOUN"]

Filter by Subreddit

Search within a specific subreddit:

Use the metadata filter to select a subreddit (e.g., Pikabu)
Run your query

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"быть\"]",
    "corpus_id": "ruRed",
    "limit": 50
  }'

Quick Reference

Goal	CQL Query
Exact word	[word="и"]
All forms (lemma)	[lemma="быть"]
Adjacent words	[lemma="я"] [lemma="думать"]
Person names	[ent_type="PER"]
Organizations	[ent_type="ORG"]
Locations	[ent_type="LOC"]
Nouns	[tag="NOUN"]
Verbs	[tag="VERB"]
Adjectives	[tag="ADJ"]
Subjects	[deprel="nsubj"]
Direct objects	[deprel="obj"]
Nominative case	[morph=".Case=Nom."]

References

Academic Torrents Reddit Archive: academictorrents.com
fastText Language ID: fasttext.cc
spaCy Russian Model: ru_core_news_lg
Universal Dependencies Tagset: universaldependencies.org