Russian Reddit Corpus
ruRed — Russian Reddit (2013-2024)
Russian Reddit corpus: 2 subreddits with 174M words, 8.2M documents. Full dependency parsing and NER are enabled.
Corpus Overview
The ruRed (Russian Reddit Corpus) contains Russian-language posts and comments from Reddit, providing access to contemporary informal written Russian discourse from online communities.
The corpus is filtered using fastText language identification (lid.176.bin model) to ensure Russian-only content. Comments and submissions below a confidence threshold of 0.5 are excluded.
The corpus is processed with spaCy ru_core_news_lg (large Russian model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.
Key Statistics
| Metric | Value |
|---|---|
| Total Words | 174 million |
| Documents | ~8,200,000 |
| Subreddits | 2 |
| Time Period | 2013-2024 |
| Language | Russian |
| spaCy Model | ru_core_news_lg |
Subreddits Included
| Subreddit | Documents | Tokens | Description |
|---|---|---|---|
| Pikabu | ~6,000,000 | 116M | Russian humor/entertainment |
| AskARussian | ~2,200,000 | 99M | Q&A about Russia and Russians |
Data Source
The corpus is derived from:
- Academic Torrents: Reddit archive dumps via academictorrents.com
- fastText Language ID: Comments filtered using lid.176.bin model
Available Attributes
The corpus includes the following searchable attributes:
Token Attributes
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="и"] |
lemma |
Base form (lemma) | [lemma="быть"] |
tag |
POS tag (Universal Dependencies) | [tag="NOUN"] |
head_n |
Head position (1-based, 0 for root) | [head_n="0"] |
head |
Head lemma | [head="сказать"] |
deprel |
Dependency relation (UD) | [deprel="nsubj"] |
ent_type |
Named entity type (PER, ORG, LOC, MISC) | [ent_type="PER"] |
ent_iob |
NER IOB tag (B=begin, I=inside, O=outside) | [ent_iob="B"] |
morph |
Morphological features (UD FEATS format) | [morph=".Case=Nom."] |
Note: All attributes are fully populated. The corpus was processed with spaCy’s ru_core_news_lg model with full dependency parsing and named entity recognition.
Document Attributes
| Attribute | Description | Example |
|---|---|---|
doc.id |
Document ID | Filter by specific post |
doc.subreddit |
Subreddit name | Filter: Pikabu, AskARussian |
doc.author |
Reddit username | Filter by author |
doc.date |
Publication date | Filter by date range |
doc.year |
Publication year | Filter by year |
doc.doc_type |
Type: comment or submission | Filter by type |
doc.lang |
Detected language code | Filter by language |
doc.lang_conf |
Language confidence | Filter by confidence |
CQL Query Examples
Basic Word Search
Find all occurrences of a word:
Find exact word form:
Named Entity Recognition
Russian spaCy uses these NER labels: PER, ORG, LOC, MISC
Find all person names:
Find all organizations:
Find all locations:
Part-of-Speech Queries
Find all nouns (Universal Dependencies tagset):
Find all verbs:
Find all adjectives:
Morphological Features
Find nominative case:
Find plural nouns:
Find perfective verbs:
Dependency Relations
Russian spaCy uses Universal Dependencies labels.
Find subjects:
Find direct objects:
Find indirect objects:
Combined Queries
Find verbs followed by nouns within 3 words:
Filter by Subreddit
Search within a specific subreddit:
- Use the metadata filter to select a subreddit (e.g.,
Pikabu) - Run your query
Programmatic Access (API)
Query the corpus programmatically via the REST API:
Example Request (curl)
curl -X POST http://localhost:8000/cql/run \
-H "Content-Type: application/json" \
-d '{
"cql": "[lemma=\"быть\"]",
"corpus_id": "ruRed",
"limit": 50
}'Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="и"] |
| All forms (lemma) | [lemma="быть"] |
| Adjacent words | [lemma="я"] [lemma="думать"] |
| Person names | [ent_type="PER"] |
| Organizations | [ent_type="ORG"] |
| Locations | [ent_type="LOC"] |
| Nouns | [tag="NOUN"] |
| Verbs | [tag="VERB"] |
| Adjectives | [tag="ADJ"] |
| Subjects | [deprel="nsubj"] |
| Direct objects | [deprel="obj"] |
| Nominative case | [morph=".Case=Nom."] |
References
- Academic Torrents Reddit Archive: academictorrents.com
- fastText Language ID: fasttext.cc
- spaCy Russian Model: ru_core_news_lg
- Universal Dependencies Tagset: universaldependencies.org