fRed Corpus
French Reddit Corpus (2010-2024)
French Reddit corpus: 3 subreddits with 236M words, 6.8M documents. Full dependency parsing and NER are enabled.
Corpus Overview
The fRed (French Reddit Corpus) contains French-language posts and comments from Reddit, providing access to contemporary informal written French discourse from online communities across different francophone regions.
The corpus includes data from Quebec (Canadian French), France (Metropolitan French), and Belgium (Belgian French), enabling regional variation studies.
Language detection uses fastText (lid.176.bin model) with a lang_group field for easy filtering:
- target: French content (corpus focus)
- en: English content
- related: Related languages (Dutch, Catalan, Occitan, Walloon, Breton)
- other: Other languages
- low_conf: Low confidence detections (<0.5)
The corpus is processed with spaCy fr_core_news_lg (large French model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.
Key Statistics
| Metric | Value |
|---|---|
| Total Words | 236 million |
| Documents | 6,807,097 |
| Subreddits | 3 |
| Time Period | 2010-2024 |
| Language | French |
| spaCy Model | fr_core_news_lg |
Subreddits Included
| Subreddit | Region | Documents | Description |
|---|---|---|---|
| Quebec | Canada | 4,847,733 | Canadian French, Quebec community |
| rance | France | 837,133 | French humor/memes, Metropolitan |
| Belgium2 | Belgium | 1,122,231 | Multilingual (French/Dutch/English) |
Data Source
The corpus is derived from: - Academic Torrents: Reddit archive dumps via academictorrents.com - fastText Language ID: Comments filtered using lid.176.bin model
Available Attributes
The corpus includes the following searchable attributes:
Token Attributes
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="et"] |
lemma |
Base form (lemma) | [lemma="avoir"] |
tag |
POS tag (Universal Dependencies) | [tag="NOUN"] |
head_n |
Head position (1-based, 0 for root) | Dependency parsing |
head |
Head lemma | Dependency parsing |
deprel |
Dependency relation (UD) | [deprel="nsubj"] |
ent_type |
Named entity type (PER, ORG, LOC, MISC) | [ent_type="PER"] |
ent_iob |
NER IOB tag (B=begin, I=inside, O=outside) | [ent_iob="B"] |
morph |
Morphological features (UD FEATS format) | [morph=".Gender=Masc."] |
Note: All attributes are fully populated. The corpus was processed with spaCy’s fr_core_news_lg model with full dependency parsing and named entity recognition.
Document Attributes
| Attribute | Description | Example |
|---|---|---|
doc.id |
Document ID | Filter by specific post |
doc.subreddit |
Subreddit name | Filter: Quebec, rance, Belgium2 |
doc.author |
Reddit username | Filter by author |
doc.date |
Publication date | Filter by date range |
doc.year |
Publication year | Filter by year |
doc.doc_type |
Type: comment or submission | Filter by type |
doc.lang |
Detected language code | Filter by language |
doc.lang_group |
Language group (target/en/etc.) | Filter French-only content |
CQL Query Examples
Basic Word Search
Find all occurrences of a word:
Find exact word form:
Named Entity Recognition
French spaCy uses these NER labels: PER, ORG, LOC, MISC
Find all person names:
Find all organizations:
Find all locations:
Part-of-Speech Queries
Find all nouns (Universal Dependencies tagset):
Find all verbs:
Find all adjectives:
Morphological Features
Find masculine nouns:
Find plural nouns:
Find past tense verbs:
Find feminine adjectives:
Dependency Relations
French spaCy uses Universal Dependencies labels.
Find subjects:
Find direct objects:
Find indirect objects:
Combined Queries
Find verbs followed by nouns within 3 words:
Find adjective + noun combinations:
Filter by Subreddit
Search within a specific subreddit:
- Use the metadata filter to select a subreddit (e.g.,
Quebec) - Run your query
Filter by Language Group
To search only French content:
- Use the metadata filter to select
lang_group = target - Run your query
Regional Variation Studies
The corpus enables comparative studies across francophone regions:
| Region | Subreddit | Features |
|---|---|---|
| Quebec | Quebec | Canadian French, Quebec idioms |
| France | rance | Metropolitan French, slang |
| Belgium | Belgium2 | Belgian French, Dutch influences |
Example research questions: - Lexical differences between Quebec and Metropolitan French - Use of anglicisms across regions - Regional discourse markers and expressions
Programmatic Access (API)
Query the corpus programmatically via the REST API:
Example Request (curl)
curl -X POST http://localhost:8000/cql/run \
-H "Content-Type: application/json" \
-d '{
"cql": "[lemma=\"avoir\"]",
"corpus_id": "fRed",
"limit": 50
}'Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="bonjour"] |
| All forms (lemma) | [lemma="avoir"] |
| Adjacent words | [lemma="je"] [lemma="penser"] |
| Person names | [ent_type="PER"] |
| Organizations | [ent_type="ORG"] |
| Locations | [ent_type="LOC"] |
| Nouns | [tag="NOUN"] |
| Verbs | [tag="VERB"] |
| Adjectives | [tag="ADJ"] |
| Subjects | [deprel="nsubj"] |
| Direct objects | [deprel="obj"] |
| Masculine nouns | [morph=".Gender=Masc."] |
References
- Academic Torrents Reddit Archive: academictorrents.com
- fastText Language ID: fasttext.cc
- spaCy French Model: fr_core_news_lg
- Universal Dependencies Tagset: universaldependencies.org