UniPlans Corpus

UK University Strategic Plans (2020-2040)

Corpus Overview

The UniPlans corpus contains strategic planning documents from 99 UK universities, providing a window into how higher education institutions articulate their visions, goals, and strategies for the future.

Key Statistics

Metric	Value
Documents	99
Total Words	~334,000
Time Period	2020-2040 strategies
Language	English (British)

Geographic Distribution

Country	Universities
England	~82
Scotland	~14
Wales	~3

Available Metadata

Each document includes the following metadata for filtering and analysis:

university_name: Full name of the institution
country: England, Scotland, or Wales
city: Location of the main campus
strategy_year: Target year of the strategic plan (e.g., 2030)

Corpus Processing

Data Source

99 UK university strategic plan PDFs
Time period: 2020-2040 strategies
Collected from institutional websites

NLP Processing

Processed with spaCy en_core_web_lg large model:

Tokenization and sentence segmentation
Part-of-speech tagging (Penn Treebank)
Lemmatization
Dependency parsing (Universal Dependencies)
Named entity recognition
Morphological feature extraction

Compilation

Converted to NoSketch Engine vertical format:

One token per line with tab-separated attributes
Document structure marked with XML tags
Indexed with Manatee corpus compiler

Getting Started with NoSketch Engine

NoSketch Engine (NoSkE) provides a powerful interface for querying linguistic corpora using CQL.

Accessing the Corpus

Open NoSketch Engine at your local or production URL
From the Dashboard, click Concordance to begin querying
Select UniPlans from the corpus list

The CQL Query Interface

In the Concordance view, switch to the Advanced tab and select CQL as the query type:

Using the Visual CQL Builder

For an interactive query-building experience, click the CQL Builder button to open the visual interface:

The CQL Builder allows you to:

Add tokens with the + button
Configure attributes (lemma, word, tag, etc.)
Preview results as you build your query
See the generated CQL at the top of the screen

CQL Query Basics

Corpus Query Language (CQL) is a powerful query language for searching linguistic corpora. Each query searches for tokens (words) matching specified attributes. See the Sketch Engine CQL documentation for comprehensive reference.

Basic Query Structure

[attribute="value"]

Available Attributes

Attribute	Description	Example	Finds
`word`	Exact word form	[word="partnerships"]	partnerships only
`lemma`	Base form (lemma)	[lemma="partnership"]	partnership, partnerships, Partnership
`pos`	Part-of-speech tag	[pos="NN"]	All singular nouns
`deprel`	Dependency relation	[deprel="nsubj"]	Words that are grammatical subjects
`head`	Head word (lemma)	[head="car"]	Words governed by car like blue in blue car
`ent_type`	Named entity type	[ent_type="ORG"]	University of Edinburgh, NHS, REF
`morph`	Morphological features	[morph="Number=Plur"]	Plural nouns and verbs

Case Study: Analyzing partnership

The term partnership is central to university strategic discourse. Let’s explore how to analyze its usage.

Basic Word Search

Search for an exact word form:

[word="partnership"]

This finds only partnership (singular, lowercase). To find all case variants:

[word="partnership" | word="Partnership" | word="PARTNERSHIP"]

Lemma Search (All Forms)

To find all forms of a word (partnership, partnerships, Partnership, etc.), use lemma search:

[lemma="partnership"]

This returns ~968 hits across the corpus, capturing:

partnership (singular)
partnerships (plural)
Partnership (capitalized)
PARTNERSHIPS (uppercase)

Combining Attributes

Narrow your search by combining attributes with &:

[lemma="partnership" & pos="NNS"]

This finds only plural forms of partnership.

[lemma="partnership" & pos="NN"]

This finds only singular forms.

Note

The pos attribute uses Penn Treebank tags:

NN = singular noun
NNS = plural noun
NNP = proper noun (singular)
JJ = adjective
VB.* = verb forms (VB, VBD, VBG, VBN, VBP, VBZ)

Finding Collocations with Proximity

Search for words appearing together:

Adjacent words:

[lemma="strategic"] [lemma="partnership"]

Finds: strategic partnership, strategic partnerships

Within N words:

[lemma="international"] []{0,2} [lemma="partnership"]

Finds international partnership with 0-2 words in between.

Using Wildcards and Regex

Match patterns with regular expressions:

[word="partner.*"]

Finds: partner, partners, partnership, partnerships, partnering

[lemma="strateg.*"]

Finds: strategy, strategic, strategies, strategically

Tip

Enable the Regex checkbox in the query builder to use pattern matching.

Syntactic Queries (Dependency Relations)

The corpus includes dependency parsing annotations, allowing you to search by grammatical relationships. See the Universal Dependencies relation inventory for the full list.

Attribute	Description	Example	Finds
`word`	Exact word form	[word="partnerships"]	partnerships only
`lemma`	Base form (lemma)	[lemma="partnership"]	partnership, partnerships, Partnership
`pos`	Part-of-speech tag	[pos="NN"]	All singular nouns
`deprel`	Dependency relation	[deprel="nsubj"]	Words that are grammatical subjects
`head`	Head word (lemma)	[head="car"]	Words governed by car like blue in blue car
`ent_type`	Named entity type	[ent_type="ORG"]	University of Edinburgh, NHS, REF
`morph`	Morphological features	[morph="Number=Plur"]	Plural nouns and verbs

Available Syntactic Attributes

Attribute	Description	Example Values
`deprel`	Dependency relation	nsubj, dobj, pobj, amod
`head`	Head word (lemma)	develop, build, create
`head_n`	Head position	Token position number

Grammatical Role Distribution

The distribution of partnership across grammatical roles reveals how universities frame partnerships in their strategic discourse. Use the Frequency feature with the deprel attribute to see this distribution.

Partnership as Grammatical Object

Find partnership when it’s the object of a verb:

[lemma="partnership" & deprel="dobj"]

This finds sentences like: “We will develop partnerships…”

Note

The concordance reveals the action verbs universities use with partnerships: build (34), develop (45), grow, strengthen (13), enhance (9), form, explore. This framing positions partnerships as goals to be achieved through institutional agency, reflecting a proactive, strategic orientation toward external relationships.

[lemma="partnership" & deprel="pobj"]

Finds: …through partnerships, …in partnership with…

Tip

The high frequency of prepositional object usage (38.8%) reflects the prevalence of phrases like in partnership with, through partnerships, and working in partnership in strategic planning discourse.

Partnership as Subject

[lemma="partnership" & deprel="nsubj"]

Finds sentences where partnership is the subject: Partnerships enable us to…

Compound Modifiers

Find what modifies partnership:

[deprel="compound"] [lemma="partnership"]

This finds compound nouns like industry partnership, research partnership.

Morphological Queries

The corpus includes rich morphological annotations in Universal Dependencies format.

The `morph` Attribute

Morphological features are stored as feature-value pairs (e.g., Number=Plur).

Number (Singular vs Plural)

Query singular vs plural forms separately:

[lemma="partnership" & morph="Number=Plur"]

Finds all plural forms: partnerships, Partnerships

[lemma="partnership" & morph="Number=Sing"]

Finds all singular forms: partnership, Partnership

Use the Frequency feature to see the distribution of forms.

Tip

Run these queries and compare frequency counts to see whether universities prefer singular or plural forms.

Verb Tense

Find past tense verbs:

[pos="VB.*" & morph=".Tense=Past."]

Find present tense:

[pos="VB.*" & morph=".Tense=Pres."]

Adjective Degree

Find comparative adjectives:

[pos="JJ.*" & morph=".Degree=Cmp."]

Finds: stronger, better, greater

Find superlative adjectives:

[pos="JJ.*" & morph=".Degree=Sup."]

Finds: strongest, best, greatest

Note

Use .* for regex matching within the morph attribute, as features may contain multiple values separated by |.

Comparative Adjectives in Strategic Discourse

Running the comparative adjective query reveals the evaluative register of strategic planning language:

Note

The high frequency of comparative forms—more (602), higher (285), wider (265), better (168), greater (142)—reveals how universities construct an implicit gap between present and desired future states. Beyond performance metrics (higher, greater), value-laden comparatives like fairer, healthier, greener, and safer indicate the ethical and social dimensions of institutional aspirations.

Named Entity Recognition (NER)

The corpus includes named entity annotations identifying organizations, places, dates, and more. See the spaCy model labels for details on entity types.

Available Entity Types

Type	Description	Examples
`ORG`	Organizations	University of Edinburgh, NHS, REF
`GPE`	Geo-political entities	Scotland, London, UK
`DATE`	Dates and time	2030, next decade
`CARDINAL`	Numbers	50%, 10,000
`MONEY`	Monetary values	£1 million
`PERCENT`	Percentages	25%

Finding All Organizations

[ent_type="ORG"]

This finds all mentions of organizations in the corpus.

Finding Geographic References

[ent_type="GPE"]

Finds country, city, and region names. Use the Frequency feature with the ent_type="GPE" query to see the most frequently mentioned places.

Note

The frequency analysis reveals a strong domestic focus in UK university strategic plans. The top mentions are UK (381), London (157), Wales (78), and Scotland (71). International locations appear much lower: China (#23, 15 mentions), India (#25, 14), Dubai (#36, 12), and Singapore (#38, 12)—suggesting limited explicit international engagement in strategic planning discourse.

Combining NER with Other Attributes

Find when partnership is part of an organization name:

[word="partnership" & ent_type="ORG"]

Find organization names containing specific words:

[ent_type="ORG" & word=".University."]

Use Cases for NER

Identify key partners: Find which organizations are mentioned most frequently
Geographic analysis: Compare international vs domestic references
Temporal analysis: Identify target years and timelines in strategic plans

Collocations Analysis

NoSketch Engine provides a powerful Collocations feature that identifies words that co-occur with your search term more frequently than expected by chance.

Accessing Collocations

After running a query, click Collocations in the left sidebar to see statistically significant collocates.

Top Collocates of partnership

The screenshot above shows the top collocates ranked by logDice score, which measures association strength (0-14 scale, higher = stronger). Scores above 7 indicate significant collocations.

Collocations vs Manual Proximity Queries

While proximity queries ([lemma="X"] []{0,N} [lemma="Y"]) let you search for specific combinations, the Collocations feature automatically discovers patterns you might not have anticipated.

Frequency Analysis

After running a query, use the Frequency panel to understand distributions across different dimensions.

Distribution by Word Form

The frequency breakdown shows how your query results distribute across word forms, POS tags, and other attributes:

Distribution by Metadata

Analyze how terms are distributed across document metadata:

The distribution roughly corresponds to the number of universities in each nation.

Analyzing Distributions

Countries: Compare usage in English vs Scottish vs Welsh universities
Strategy years: See temporal patterns (2025 vs 2030 vs 2040 targets)
Universities: Identify which institutions use certain terms most

Export Options

Export your results for further analysis:

CSV: For spreadsheet analysis
XLSX: For Excel with formatting

Filtering by Metadata

Use metadata filters to create subcorpora for focused analysis.

Filter by Country

Compare “partnership” usage across the UK nations:

Select England to see only English universities
Select Scotland to see only Scottish universities
Select Wales to see only Welsh universities

Filter by Strategy Year

Analyze how discourse changes based on target dates:

Universities targeting 2025 vs 2030 vs 2040

Semantic Analysis

Semantic analysis clusters concordance lines by semantic similarity, helping you discover usage patterns within a query.

For a full feature overview, see Semantic Analysis.

This walkthrough demonstrates semantic analysis using the UniPlans corpus of UK university strategic plans.

Step 1: Set up the query

Select UniPlans (383K tokens) from the corpus dropdown.
In the query builder, set:
- Attribute: lemma
- Operator: ==
- Value: partnership
Click Run builder query.

Query setup with UniPlans corpus and partnership lemma search

The query returns roughly 900 matches for forms of “partnership” across university strategic plans.

Step 2: Configure and run analysis

Scroll to the Semantic analysis panel.
Select sample size (e.g., 200 hits for a manageable analysis).
Choose visualization method (t-SNE or UMAP).
Click Analyze.

Semantic analysis panel with sample size and method selection

The analysis embeds each concordance line, reduces dimensions, and clusters automatically.

Step 3: Interpret the scatter plot

After analysis completes, you’ll see:

A scatter plot where each point represents one concordance line
Points colored by cluster assignment
Cluster buttons showing the size of each cluster
Silhouette score indicating cluster quality (higher = better separation)

Scatter plot showing semantic clusters with typical examples

In this example, the analysis detected 3 clusters. The typical examples help interpret what distinguishes each cluster.

Step 4: Filter by cluster

Click a cluster button to filter the concordance table to only show lines from that cluster.

Cluster filter applied to highlight a single cluster

Filtered concordance table showing only Cluster 1 results

This makes it easy to:

Read through examples from a specific semantic grouping
Identify patterns in how “partnership” is used in different contexts
Export filtered results for further analysis

Interpreting results

Semantic clusters represent usage similarity, not predefined categories. When interpreting:

Look at the typical examples shown for each cluster
Read several concordances from each cluster to understand the pattern
Consider what linguistic or contextual features distinguish clusters
Use the cluster slider to explore different numbers of clusters

For “partnership” in UniPlans, the clusters separate common usage patterns such as:

Local/regional partnerships: community stakeholders, regional employers, local councils
Global/international partnerships: worldwide networks, international collaborations, alumni connections
Institutional partnerships: academic collaborations, internal initiatives, cross-campus programs

LLM Classification

LLM classification lets you label concordance samples with custom variables for targeted linguistic analysis.

For a full feature overview, see LLM Classification.

This walkthrough demonstrates LLM classification using the UniPlans corpus to analyze the grammatical subjects of “partnership” constructions.

Step 1: Run a query

First, run a CQL query to get concordance results:

Select UniPlans (383K tokens) from the corpus dropdown.
Query for [lemma=="partnership"].
Click Run builder query.

Step 2: Configure variables

Scroll to the LLM classification panel. You’ll see:

Sample size: how many concordances to classify (5-100)
Estimated cost: calculated based on variables and sample size
Preset variable buttons: Subject animacy, Transitivity, Topic, Illocution

LLM classification panel with preset variable buttons

Open the sample size dropdown to pick how many lines to classify:

Sample size options in the classification panel

Click + Subject animacy to add the preset. The variable card appears with:

Title: Subject animacy
Description: Classification criteria with examples
Type: Fixed set of options (enum)
Options: animate, inanimate

Variable configuration showing description and enum options

The description provides guidance to the LLM:

Classify the grammatical subject’s animacy. animate = humans, animals, or organizations acting as agents; inanimate = objects, abstract concepts, events, or natural phenomena.

Examples help calibrate the model:

“The child laughed loudly.” -> animate
“The committee approved the proposal.” -> animate
“The storm destroyed the docks.” -> inanimate

Step 3: Run classification

Select sample size (e.g., 5 lines for a quick test).
Review the estimated cost.
Click Classify sample.

The model classifies each concordance line and returns results.

Step 4: Review results

Results appear in a table with your variable columns:

Classification progress indicator while the job runs

Classification results showing concordances with Subject animacy annotations

Each row shows:

Left context: words before the keyword
KWIC: the keyword in context (partnerships)
Right context: words after the keyword
Subject animacy: the LLM’s classification

In this example, most “partnership” usages have animate subjects (organizations, institutions, “we”) because the corpus discusses what universities do with partnerships.

Step 5: Export results

Click Export CSV to download results for further analysis in Excel, R, or Python.

Programmatic Access (API)

For advanced users, the corpus can be queried programmatically via the REST API.

API Endpoint

POST /cql/run

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"partnership\"]",
    "corpus_id": "uniplan",
    "limit": 50
  }'

Example Request (Python)

import requests

response = requests.post(
    "http://localhost:8000/cql/run",
    json={
        "cql": '[lemma="partnership"]',
        "corpus_id": "uniplan",
        "limit": 50
    }
)

results = response.json()
for hit in results["hits"]:
    print(hit["kwic"])

Response Format

{
  "total": 968,
  "hits": [
    {
      "text_id": 1,
      "position": 245,
      "left": "building international",
      "kwic": "partnerships",
      "right": "with other universities",
      "metadata": {
        "university_name": "University of Aberdeen",
        "country": "Scotland"
      }
    }
  ]
}

When to Use the API

Batch processing: Analyze multiple queries programmatically
Data pipelines: Integrate corpus data into research workflows
Custom visualizations: Build specialized analysis tools

Quick Reference

Common Query Patterns

Goal	CQL Query
Exact word	[word="partnership"]
All forms (lemma)	[lemma="partnership"]
Plural nouns only	[lemma="X" & pos="NNS"]
Adjacent words	[lemma="X"] [lemma="Y"]
Words within 3	[lemma="X"] []{0,3} [lemma="Y"]
As direct object	[lemma="X" & deprel="dobj"]
As subject	[lemma="X" & deprel="nsubj"]
Organizations	[ent_type="ORG"]
Places	[ent_type="GPE"]
Plural forms	[lemma="X" & morph="Number=Plur"]
Regex match	[word="partner.*"]

Penn Treebank POS Tags

Tag	Description	Examples
NN	Singular noun	partnership, university
NNS	Plural noun	partnerships, universities
NNP	Proper noun	Edinburgh, Scotland
JJ	Adjective	strategic, international
VB	Verb (base)	develop, create
VBG	Verb (gerund)	developing, creating
VBD	Verb (past)	developed, created
RB	Adverb	significantly, globally

Dependency Relations

Relation	Description
nsubj	Nominal subject
dobj	Direct object
pobj	Prepositional object
amod	Adjectival modifier
compound	Compound word
conj	Conjunction

Corpus Overview

Key Statistics

Geographic Distribution

Available Metadata

Corpus Processing

Data Source

NLP Processing

Compilation

Getting Started with NoSketch Engine

Accessing the Corpus

The CQL Query Interface

Using the Visual CQL Builder

CQL Query Basics

Basic Query Structure

Available Attributes

Case Study: Analyzing partnership

Basic Word Search

Lemma Search (All Forms)

Combining Attributes

Finding Collocations with Proximity

Using Wildcards and Regex

Syntactic Queries (Dependency Relations)

Available Syntactic Attributes

Grammatical Role Distribution

Partnership as Grammatical Object

Partnership as Subject

Compound Modifiers

Morphological Queries

The morph Attribute

Number (Singular vs Plural)

Verb Tense

Adjective Degree

Comparative Adjectives in Strategic Discourse

Named Entity Recognition (NER)

Available Entity Types

Finding All Organizations

Finding Geographic References

Combining NER with Other Attributes

Use Cases for NER

Collocations Analysis

Accessing Collocations

Top Collocates of partnership

Collocations vs Manual Proximity Queries

Frequency Analysis

Distribution by Word Form

Distribution by Metadata

Analyzing Distributions

Export Options

Filtering by Metadata

Filter by Country

Filter by Strategy Year

Semantic Analysis

Step 1: Set up the query

Step 2: Configure and run analysis

Step 3: Interpret the scatter plot

Step 4: Filter by cluster

Interpreting results

LLM Classification

Step 1: Run a query

Step 2: Configure variables

Step 3: Run classification

Step 4: Review results

Step 5: Export results

Programmatic Access (API)

API Endpoint

Example Request (curl)

Example Request (Python)

Response Format

When to Use the API

Quick Reference

Common Query Patterns

Penn Treebank POS Tags

Dependency Relations

The `morph` Attribute