Russian Reddit Corpus

ruRed — Russian Reddit (2013-2024)

NoteCurrent Status

Russian Reddit corpus: 2 subreddits with 174M words, 8.2M documents. Full dependency parsing and NER are enabled.

Corpus Overview

The ruRed (Russian Reddit Corpus) contains Russian-language posts and comments from Reddit, providing access to contemporary informal written Russian discourse from online communities.

The corpus is filtered using fastText language identification (lid.176.bin model) to ensure Russian-only content. Comments and submissions below a confidence threshold of 0.5 are excluded.

The corpus is processed with spaCy ru_core_news_lg (large Russian model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric Value
Total Words 174 million
Documents ~8,200,000
Subreddits 2
Time Period 2013-2024
Language Russian
spaCy Model ru_core_news_lg

Subreddits Included

Subreddit Documents Tokens Description
Pikabu ~6,000,000 116M Russian humor/entertainment
AskARussian ~2,200,000 99M Q&A about Russia and Russians

Data Source

The corpus is derived from:


Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute Description Example
word Exact word form [word="и"]
lemma Base form (lemma) [lemma="быть"]
tag POS tag (Universal Dependencies) [tag="NOUN"]
head_n Head position (1-based, 0 for root) [head_n="0"]
head Head lemma [head="сказать"]
deprel Dependency relation (UD) [deprel="nsubj"]
ent_type Named entity type (PER, ORG, LOC, MISC) [ent_type="PER"]
ent_iob NER IOB tag (B=begin, I=inside, O=outside) [ent_iob="B"]
morph Morphological features (UD FEATS format) [morph=".Case=Nom."]

Note: All attributes are fully populated. The corpus was processed with spaCy’s ru_core_news_lg model with full dependency parsing and named entity recognition.

Document Attributes

Attribute Description Example
doc.id Document ID Filter by specific post
doc.subreddit Subreddit name Filter: Pikabu, AskARussian
doc.author Reddit username Filter by author
doc.date Publication date Filter by date range
doc.year Publication year Filter by year
doc.doc_type Type: comment or submission Filter by type
doc.lang Detected language code Filter by language
doc.lang_conf Language confidence Filter by confidence

CQL Query Examples

Named Entity Recognition

Russian spaCy uses these NER labels: PER, ORG, LOC, MISC

Find all person names:

[ent_type="PER"]

Find all organizations:

[ent_type="ORG"]

Find all locations:

[ent_type="LOC"]

Part-of-Speech Queries

Find all nouns (Universal Dependencies tagset):

[tag="NOUN"]

Find all verbs:

[tag="VERB"]

Find all adjectives:

[tag="ADJ"]

Morphological Features

Find nominative case:

[morph=".Case=Nom."]

Find plural nouns:

[tag="NOUN" & morph=".Number=Plur."]

Find perfective verbs:

[tag="VERB" & morph=".Aspect=Perf."]

Dependency Relations

Russian spaCy uses Universal Dependencies labels.

Find subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="obj"]

Find indirect objects:

[deprel="iobj"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VERB"] []{0,3} [tag="NOUN"]

Filter by Subreddit

Search within a specific subreddit:

  1. Use the metadata filter to select a subreddit (e.g., Pikabu)
  2. Run your query

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"быть\"]",
    "corpus_id": "ruRed",
    "limit": 50
  }'

Quick Reference

Goal CQL Query
Exact word [word="и"]
All forms (lemma) [lemma="быть"]
Adjacent words [lemma="я"] [lemma="думать"]
Person names [ent_type="PER"]
Organizations [ent_type="ORG"]
Locations [ent_type="LOC"]
Nouns [tag="NOUN"]
Verbs [tag="VERB"]
Adjectives [tag="ADJ"]
Subjects [deprel="nsubj"]
Direct objects [deprel="obj"]
Nominative case [morph=".Case=Nom."]

References