fRed Corpus

French Reddit Corpus (2010-2024)

NoteCurrent Status

French Reddit corpus: 3 subreddits with 236M words, 6.8M documents. Full dependency parsing and NER are enabled.

Corpus Overview

The fRed (French Reddit Corpus) contains French-language posts and comments from Reddit, providing access to contemporary informal written French discourse from online communities across different francophone regions.

The corpus includes data from Quebec (Canadian French), France (Metropolitan French), and Belgium (Belgian French), enabling regional variation studies.

Language detection uses fastText (lid.176.bin model) with a lang_group field for easy filtering:

  • target: French content (corpus focus)
  • en: English content
  • related: Related languages (Dutch, Catalan, Occitan, Walloon, Breton)
  • other: Other languages
  • low_conf: Low confidence detections (<0.5)

The corpus is processed with spaCy fr_core_news_lg (large French model) with full pipeline including tokenization, POS tagging, lemmatization, dependency parsing, and named entity recognition.

Key Statistics

Metric Value
Total Words 236 million
Documents 6,807,097
Subreddits 3
Time Period 2010-2024
Language French
spaCy Model fr_core_news_lg

Subreddits Included

Subreddit Region Documents Description
Quebec Canada 4,847,733 Canadian French, Quebec community
rance France 837,133 French humor/memes, Metropolitan
Belgium2 Belgium 1,122,231 Multilingual (French/Dutch/English)

Data Source

The corpus is derived from: - Academic Torrents: Reddit archive dumps via academictorrents.com - fastText Language ID: Comments filtered using lid.176.bin model


Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute Description Example
word Exact word form [word="et"]
lemma Base form (lemma) [lemma="avoir"]
tag POS tag (Universal Dependencies) [tag="NOUN"]
head_n Head position (1-based, 0 for root) Dependency parsing
head Head lemma Dependency parsing
deprel Dependency relation (UD) [deprel="nsubj"]
ent_type Named entity type (PER, ORG, LOC, MISC) [ent_type="PER"]
ent_iob NER IOB tag (B=begin, I=inside, O=outside) [ent_iob="B"]
morph Morphological features (UD FEATS format) [morph=".Gender=Masc."]

Note: All attributes are fully populated. The corpus was processed with spaCy’s fr_core_news_lg model with full dependency parsing and named entity recognition.

Document Attributes

Attribute Description Example
doc.id Document ID Filter by specific post
doc.subreddit Subreddit name Filter: Quebec, rance, Belgium2
doc.author Reddit username Filter by author
doc.date Publication date Filter by date range
doc.year Publication year Filter by year
doc.doc_type Type: comment or submission Filter by type
doc.lang Detected language code Filter by language
doc.lang_group Language group (target/en/etc.) Filter French-only content

CQL Query Examples

Named Entity Recognition

French spaCy uses these NER labels: PER, ORG, LOC, MISC

Find all person names:

[ent_type="PER"]

Find all organizations:

[ent_type="ORG"]

Find all locations:

[ent_type="LOC"]

Part-of-Speech Queries

Find all nouns (Universal Dependencies tagset):

[tag="NOUN"]

Find all verbs:

[tag="VERB"]

Find all adjectives:

[tag="ADJ"]

Morphological Features

Find masculine nouns:

[tag="NOUN" & morph=".Gender=Masc."]

Find plural nouns:

[tag="NOUN" & morph=".Number=Plur."]

Find past tense verbs:

[tag="VERB" & morph=".Tense=Past."]

Find feminine adjectives:

[tag="ADJ" & morph=".Gender=Fem."]

Dependency Relations

French spaCy uses Universal Dependencies labels.

Find subjects:

[deprel="nsubj"]

Find direct objects:

[deprel="obj"]

Find indirect objects:

[deprel="iobj"]

Combined Queries

Find verbs followed by nouns within 3 words:

[tag="VERB"] []{0,3} [tag="NOUN"]

Find adjective + noun combinations:

[tag="ADJ"] [tag="NOUN"]

Filter by Subreddit

Search within a specific subreddit:

  1. Use the metadata filter to select a subreddit (e.g., Quebec)
  2. Run your query

Filter by Language Group

To search only French content:

  1. Use the metadata filter to select lang_group = target
  2. Run your query

Regional Variation Studies

The corpus enables comparative studies across francophone regions:

Region Subreddit Features
Quebec Quebec Canadian French, Quebec idioms
France rance Metropolitan French, slang
Belgium Belgium2 Belgian French, Dutch influences

Example research questions: - Lexical differences between Quebec and Metropolitan French - Use of anglicisms across regions - Regional discourse markers and expressions


Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"avoir\"]",
    "corpus_id": "fRed",
    "limit": 50
  }'

Quick Reference

Goal CQL Query
Exact word [word="bonjour"]
All forms (lemma) [lemma="avoir"]
Adjacent words [lemma="je"] [lemma="penser"]
Person names [ent_type="PER"]
Organizations [ent_type="ORG"]
Locations [ent_type="LOC"]
Nouns [tag="NOUN"]
Verbs [tag="VERB"]
Adjectives [tag="ADJ"]
Subjects [deprel="nsubj"]
Direct objects [deprel="obj"]
Masculine nouns [morph=".Gender=Masc."]

References