Stream Corpus
YouTube Streaming Transcripts (2017-2025)
Corpus Overview
The Stream corpus contains transcripts from YouTube streaming channels, providing access to contemporary spoken discourse from online content creators.
The corpus has been processed using spaCy’s large model (en_core_web_lg) for linguistic annotation, including Named Entity Recognition (NER) and morphological features.
Key Statistics
| Metric | Value |
|---|---|
| Total Words | ~14 million |
| Time Period | 2017-2025 |
| Channels | 3 (see below) |
| Language | Contemporary English (spoken) |
Channels
| Channel | Description |
|---|---|
| contrapoints | Philosophy, politics, culture |
| kelseydangerous | Gaming, commentary |
| moresimsie | Gaming, lifestyle |
Available Attributes
The corpus includes the following searchable attributes:
Token Attributes
| Attribute | Description | Example |
|---|---|---|
word |
Exact word form | [word="the"] |
lemma |
Base form (lemma) | [lemma="run"] |
tag |
POS tag (Penn Treebank) | [tag="NN.*"] |
head_n |
Head position (1-based, 0 for root) | Dependency parsing |
head |
Head lemma | Dependency parsing |
deprel |
Dependency relation | [deprel="nsubj"] |
ent_type |
Named entity type (ORG, PERSON, GPE, etc.) | [ent_type="PERSON"] |
ent_iob |
NER IOB tag (B=begin, I=inside, O=outside) | [ent_iob="B"] |
morph |
Morphological features (UD FEATS format) | [morph=".Tense=Past."] |
Document Attributes
| Attribute | Description | Example |
|---|---|---|
doc.id |
Document ID | Filter by specific video |
doc.channel |
Channel name | Filter: contrapoints, etc. |
doc.date |
Publication date | Filter by date range |
doc.year |
Publication year | Filter by year |
doc.title |
Video title | Reference metadata |
doc.url |
Video URL | Link to source |
doc.topic |
Topic tag (if available) | Topic filtering |
CQL Query Examples
Basic Word Search
Find all occurrences of a word:
Named Entity Recognition
Find all person names:
Find all organizations:
Find all locations (geopolitical entities):
Morphological Features
Find past tense verbs:
Find plural nouns:
Find past tense of a specific verb:
Dependency Relations
Find subjects of verbs:
Combined Queries
Find past tense verbs followed by person names:
Filter by Channel
Search within a specific channel:
- Use the metadata filter to select a channel (e.g., contrapoints)
- Run:
[lemma="philosophy"]{.cql}
Programmatic Access (API)
Query the corpus programmatically via the REST API:
Example Request (curl)
curl -X POST http://localhost:8000/cql/run \
-H "Content-Type: application/json" \
-d '{
"cql": "[lemma=\"stream\"]",
"corpus_id": "stream",
"limit": 50
}'Quick Reference
| Goal | CQL Query |
|---|---|
| Exact word | [word="stream"] |
| All forms (lemma) | [lemma="stream"] |
| Adjacent words | [lemma="live"] [lemma="stream"] |
| Person names | [ent_type="PERSON"] |
| Organizations | [ent_type="ORG"] |
| Past tense verbs | [morph=".Tense=Past."] |
| Plural nouns | [tag="NN.*" & morph=".Number=Plur."] |