Stream Corpus

YouTube Streaming Transcripts (2017-2025)

Corpus Overview

The Stream corpus contains transcripts from YouTube streaming channels, providing access to contemporary spoken discourse from online content creators.

The corpus has been processed using spaCy’s large model (en_core_web_lg) for linguistic annotation, including Named Entity Recognition (NER) and morphological features.

Key Statistics

Metric Value
Total Words ~14 million
Time Period 2017-2025
Channels 3 (see below)
Language Contemporary English (spoken)

Channels

Channel Description
contrapoints Philosophy, politics, culture
kelseydangerous Gaming, commentary
moresimsie Gaming, lifestyle

Available Attributes

The corpus includes the following searchable attributes:

Token Attributes

Attribute Description Example
word Exact word form [word="the"]
lemma Base form (lemma) [lemma="run"]
tag POS tag (Penn Treebank) [tag="NN.*"]
head_n Head position (1-based, 0 for root) Dependency parsing
head Head lemma Dependency parsing
deprel Dependency relation [deprel="nsubj"]
ent_type Named entity type (ORG, PERSON, GPE, etc.) [ent_type="PERSON"]
ent_iob NER IOB tag (B=begin, I=inside, O=outside) [ent_iob="B"]
morph Morphological features (UD FEATS format) [morph=".Tense=Past."]

Document Attributes

Attribute Description Example
doc.id Document ID Filter by specific video
doc.channel Channel name Filter: contrapoints, etc.
doc.date Publication date Filter by date range
doc.year Publication year Filter by year
doc.title Video title Reference metadata
doc.url Video URL Link to source
doc.topic Topic tag (if available) Topic filtering

CQL Query Examples

Named Entity Recognition

Find all person names:

[ent_type="PERSON"]

Find all organizations:

[ent_type="ORG"]

Find all locations (geopolitical entities):

[ent_type="GPE"]

Morphological Features

Find past tense verbs:

[morph=".Tense=Past."]

Find plural nouns:

[tag="NN.*" & morph=".Number=Plur."]

Find past tense of a specific verb:

[lemma="go" & morph=".Tense=Past."]

Dependency Relations

Find subjects of verbs:

[deprel="nsubj" & lemma="say"]

Combined Queries

Find past tense verbs followed by person names:

[morph=".Tense=Past."] []{0,3} [ent_type="PERSON"]

Filter by Channel

Search within a specific channel:

  1. Use the metadata filter to select a channel (e.g., contrapoints)
  2. Run: [lemma="philosophy"]{.cql}

Programmatic Access (API)

Query the corpus programmatically via the REST API:

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"stream\"]",
    "corpus_id": "stream",
    "limit": 50
  }'

Quick Reference

Goal CQL Query
Exact word [word="stream"]
All forms (lemma) [lemma="stream"]
Adjacent words [lemma="live"] [lemma="stream"]
Person names [ent_type="PERSON"]
Organizations [ent_type="ORG"]
Past tense verbs [morph=".Tense=Past."]
Plural nouns [tag="NN.*" & morph=".Number=Plur."]