fRed Corpus

French Reddit Corpus (2010-2024)

Current Status

French Reddit corpus: 3 subreddits with 236M words, 6.8M documents. Full dependency parsing and NER are enabled.

Corpus Overview

The fRed (French Reddit Corpus) contains French-language posts and comments from Reddit, providing access to contemporary informal written French discourse from online communities across different francophone regions.

The corpus includes data from Quebec (Canadian French), France (Metropolitan French), and Belgium (Belgian French), enabling regional variation studies.

Language detection uses fastText (lid.176.bin model) with a lang_group field for easy filtering:

target: French content (corpus focus)
en: English content
related: Related languages (Dutch, Catalan, Occitan, Walloon, Breton)
other: Other languages
low_conf: Low confidence detections (<0.5)

The corpus is processed with spaCy fr_core_news_lg (large French model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric	Value
Total Words	236 million
Documents	6,807,097
Subreddits	3
Time Period	2010-2024
Language	French
spaCy Model	fr_core_news_lg

Subreddits Included

Subreddit	Region	Documents	Description
Quebec	Canada	4,847,733	Canadian French, Quebec community
rance	France	837,133	French humor/memes, Metropolitan
Belgium2	Belgium	1,122,231	Multilingual (French/Dutch/English)

Data Source

The corpus is derived from: - Academic Torrents: Reddit archive dumps via academictorrents.com - fastText Language ID: Comments filtered using lid.176.bin model

Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute	Description	Example
`word`	Exact word form	[word="et"]
`lemma`	Base form (lemma)	[lemma="avoir"]
`tag`	POS tag (Universal Dependencies)	[tag="NOUN"]
`head_n`	Head position (1-based, 0 for root)	Dependency parsing
`head`	Head lemma	Dependency parsing
`deprel`	Dependency relation (UD)	[deprel="nsubj"]
`ent_type`	Named entity type (PER, ORG, LOC, MISC)	[ent_type="PER"]
`ent_iob`	NER IOB tag (B=begin, I=inside, O=outside)	[ent_iob="B"]
`morph`	Morphological features (UD FEATS format)	[morph=".Gender=Masc."]

Note: All attributes are fully populated. The corpus was processed with spaCy’s fr_core_news_lg model with full dependency parsing and named entity recognition.

Document Attributes

Attribute	Description	Example
`doc.id`	Document ID	Filter by specific post
`doc.subreddit`	Subreddit name	Filter: Quebec, rance, Belgium2
`doc.author`	Reddit username	Filter by author
`doc.date`	Publication date	Filter by date range
`doc.year`	Publication year	Filter by year
`doc.doc_type`	Type: comment or submission	Filter by type
`doc.lang`	Detected language code	Filter by language
`doc.lang_group`	Language group (target/en/etc.)	Filter French-only content

CQL Query Examples

Basic Word Search

Find all occurrences of a word:

[lemma="avoir"]

Find exact word form:

[word="bonjour"]

Named Entity Recognition

French spaCy uses these NER labels: PER, ORG, LOC, MISC

Find all person names:

[ent_type="PER"]

Find all organizations:

[ent_type="ORG"]

Find all locations:

[ent_type="LOC"]

Part-of-Speech Queries

Find all nouns (Universal Dependencies tagset):

[tag="NOUN"]

Find all verbs:

[tag="VERB"]

Find all adjectives:

[tag="ADJ"]

Morphological Features

Find masculine nouns:

[tag="NOUN" & morph=".Gender=Masc."]

Find plural nouns:

[tag="NOUN" & morph=".Number=Plur."]

Find past tense verbs:

[tag="VERB" & morph=".Tense=Past."]

Find feminine adjectives:

[tag="ADJ" & morph=".Gender=Fem."]

Dependency Relations

French spaCy uses Universal Dependencies labels.

Find subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="obj"]

Find indirect objects:

[deprel="iobj"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VERB"] []{0,3} [tag="NOUN"]

Find adjective + noun combinations:

[tag="ADJ"] [tag="NOUN"]

Filter by Subreddit

Search within a specific subreddit:

Use the metadata filter to select a subreddit (e.g., Quebec)
Run your query

Filter by Language Group

To search only French content:

Use the metadata filter to select lang_group = target
Run your query

Regional Variation Studies

The corpus enables comparative studies across francophone regions:

Region	Subreddit	Features
Quebec	Quebec	Canadian French, Quebec idioms
France	rance	Metropolitan French, slang
Belgium	Belgium2	Belgian French, Dutch influences

Example research questions: - Lexical differences between Quebec and Metropolitan French - Use of anglicisms across regions - Regional discourse markers and expressions

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"avoir\"]",
    "corpus_id": "fRed",
    "limit": 50
  }'

Quick Reference

Goal	CQL Query
Exact word	[word="bonjour"]
All forms (lemma)	[lemma="avoir"]
Adjacent words	[lemma="je"] [lemma="penser"]
Person names	[ent_type="PER"]
Organizations	[ent_type="ORG"]
Locations	[ent_type="LOC"]
Nouns	[tag="NOUN"]
Verbs	[tag="VERB"]
Adjectives	[tag="ADJ"]
Subjects	[deprel="nsubj"]
Direct objects	[deprel="obj"]
Masculine nouns	[morph=".Gender=Masc."]

References

Academic Torrents Reddit Archive: academictorrents.com
fastText Language ID: fasttext.cc
spaCy French Model: fr_core_news_lg
Universal Dependencies Tagset: universaldependencies.org