UniPlans Corpus

UK University Strategic Plans (2020-2040)

Corpus Overview

The UniPlans corpus contains strategic planning documents from 99 UK universities, providing a window into how higher education institutions articulate their visions, goals, and strategies for the future.

Key Statistics

Metric Value
Documents 99
Total Words ~334,000
Time Period 2020-2040 strategies
Language English (British)

Geographic Distribution

Country Universities
England ~82
Scotland ~14
Wales ~3

Available Metadata

Each document includes the following metadata for filtering and analysis:

  • university_name: Full name of the institution
  • country: England, Scotland, or Wales
  • city: Location of the main campus
  • strategy_year: Target year of the strategic plan (e.g., 2030)

Corpus Processing

Data Source

  • 99 UK university strategic plan PDFs
  • Time period: 2020-2040 strategies
  • Collected from institutional websites

NLP Processing

Processed with spaCy en_core_web_lg large model:

  • Tokenization and sentence segmentation
  • Part-of-speech tagging (Penn Treebank)
  • Lemmatization
  • Dependency parsing (Universal Dependencies)
  • Named entity recognition
  • Morphological feature extraction

Compilation

Converted to NoSketch Engine vertical format:

  • One token per line with tab-separated attributes
  • Document structure marked with XML tags
  • Indexed with Manatee corpus compiler

Getting Started with NoSketch Engine

NoSketch Engine (NoSkE) provides a powerful interface for querying linguistic corpora using CQL.

Accessing the Corpus

  1. Open NoSketch Engine at your local or production URL
  2. From the Dashboard, click Concordance to begin querying
  3. Select UniPlans from the corpus list

NoSkE Dashboard

The CQL Query Interface

In the Concordance view, switch to the Advanced tab and select CQL as the query type:

CQL Interface

Using the Visual CQL Builder

For an interactive query-building experience, click the CQL Builder button to open the visual interface:

CQL Builder

The CQL Builder allows you to:

  • Add tokens with the + button
  • Configure attributes (lemma, word, tag, etc.)
  • Preview results as you build your query
  • See the generated CQL at the top of the screen

CQL Builder with results

CQL Query Basics

Corpus Query Language (CQL) is a powerful query language for searching linguistic corpora. Each query searches for tokens (words) matching specified attributes. See the Sketch Engine CQL documentation for comprehensive reference.

Basic Query Structure

[attribute="value"]

Available Attributes

Attribute Description Example Finds
word Exact word form [word="partnerships"] partnerships only
lemma Base form (lemma) [lemma="partnership"] partnership, partnerships, Partnership
pos Part-of-speech tag [pos="NN"] All singular nouns
deprel Dependency relation [deprel="nsubj"] Words that are grammatical subjects
head Head word (lemma) [head="car"] Words governed by car like blue in blue car
ent_type Named entity type [ent_type="ORG"] University of Edinburgh, NHS, REF
morph Morphological features [morph="Number=Plur"] Plural nouns and verbs

Case Study: Analyzing partnership

The term partnership is central to university strategic discourse. Let’s explore how to analyze its usage.

Lemma Search (All Forms)

To find all forms of a word (partnership, partnerships, Partnership, etc.), use lemma search:

[lemma="partnership"]

This returns ~968 hits across the corpus, capturing:

  • partnership (singular)
  • partnerships (plural)
  • Partnership (capitalized)
  • PARTNERSHIPS (uppercase)

Lemma search results

Combining Attributes

Narrow your search by combining attributes with &:

[lemma="partnership" & pos="NNS"]

This finds only plural forms of partnership.

[lemma="partnership" & pos="NN"]

This finds only singular forms.

Note

The pos attribute uses Penn Treebank tags:

  • NN = singular noun
  • NNS = plural noun
  • NNP = proper noun (singular)
  • JJ = adjective
  • VB.* = verb forms (VB, VBD, VBG, VBN, VBP, VBZ)

Finding Collocations with Proximity

Search for words appearing together:

Adjacent words:

[lemma="strategic"] [lemma="partnership"]

Finds: strategic partnership, strategic partnerships

Within N words:

[lemma="international"] []{0,2} [lemma="partnership"]

Finds international partnership with 0-2 words in between.

Using Wildcards and Regex

Match patterns with regular expressions:

[word="partner.*"]

Finds: partner, partners, partnership, partnerships, partnering

[lemma="strateg.*"]

Finds: strategy, strategic, strategies, strategically

Tip

Enable the Regex checkbox in the query builder to use pattern matching.


Syntactic Queries (Dependency Relations)

The corpus includes dependency parsing annotations, allowing you to search by grammatical relationships. See the Universal Dependencies relation inventory for the full list.

Attribute Description Example Finds
word Exact word form [word="partnerships"] partnerships only
lemma Base form (lemma) [lemma="partnership"] partnership, partnerships, Partnership
pos Part-of-speech tag [pos="NN"] All singular nouns
deprel Dependency relation [deprel="nsubj"] Words that are grammatical subjects
head Head word (lemma) [head="car"] Words governed by car like blue in blue car
ent_type Named entity type [ent_type="ORG"] University of Edinburgh, NHS, REF
morph Morphological features [morph="Number=Plur"] Plural nouns and verbs

Available Syntactic Attributes

Attribute Description Example Values
deprel Dependency relation nsubj, dobj, pobj, amod
head Head word (lemma) develop, build, create
head_n Head position Token position number

Grammatical Role Distribution

The distribution of partnership across grammatical roles reveals how universities frame partnerships in their strategic discourse. Use the Frequency feature with the deprel attribute to see this distribution.

Partnership as Grammatical Object

Find partnership when it’s the object of a verb:

[lemma="partnership" & deprel="dobj"]

This finds sentences like: “We will develop partnerships…”

Direct object query results
Note

The concordance reveals the action verbs universities use with partnerships: build (34), develop (45), grow, strengthen (13), enhance (9), form, explore. This framing positions partnerships as goals to be achieved through institutional agency, reflecting a proactive, strategic orientation toward external relationships.

[lemma="partnership" & deprel="pobj"]

Finds: …through partnerships, …in partnership with…

Tip

The high frequency of prepositional object usage (38.8%) reflects the prevalence of phrases like in partnership with, through partnerships, and working in partnership in strategic planning discourse.

Partnership as Subject

[lemma="partnership" & deprel="nsubj"]

Finds sentences where partnership is the subject: Partnerships enable us to…

Compound Modifiers

Find what modifies partnership:

[deprel="compound"] [lemma="partnership"]

This finds compound nouns like industry partnership, research partnership.


Morphological Queries

The corpus includes rich morphological annotations in Universal Dependencies format.

The morph Attribute

Morphological features are stored as feature-value pairs (e.g., Number=Plur).

Number (Singular vs Plural)

Query singular vs plural forms separately:

[lemma="partnership" & morph="Number=Plur"]

Finds all plural forms: partnerships, Partnerships

[lemma="partnership" & morph="Number=Sing"]

Finds all singular forms: partnership, Partnership

Use the Frequency feature to see the distribution of forms.

Tip

Run these queries and compare frequency counts to see whether universities prefer singular or plural forms.

Verb Tense

Find past tense verbs:

[pos="VB.*" & morph=".Tense=Past."]

Find present tense:

[pos="VB.*" & morph=".Tense=Pres."]

Adjective Degree

Find comparative adjectives:

[pos="JJ.*" & morph=".Degree=Cmp."]

Finds: stronger, better, greater

Find superlative adjectives:

[pos="JJ.*" & morph=".Degree=Sup."]

Finds: strongest, best, greatest

Note

Use .* for regex matching within the morph attribute, as features may contain multiple values separated by |.

Comparative Adjectives in Strategic Discourse

Running the comparative adjective query reveals the evaluative register of strategic planning language:

Comparative adjective frequency
Note

The high frequency of comparative forms—more (602), higher (285), wider (265), better (168), greater (142)—reveals how universities construct an implicit gap between present and desired future states. Beyond performance metrics (higher, greater), value-laden comparatives like fairer, healthier, greener, and safer indicate the ethical and social dimensions of institutional aspirations.


Named Entity Recognition (NER)

The corpus includes named entity annotations identifying organizations, places, dates, and more. See the spaCy model labels for details on entity types.

Available Entity Types

Type Description Examples
ORG Organizations University of Edinburgh, NHS, REF
GPE Geo-political entities Scotland, London, UK
DATE Dates and time 2030, next decade
CARDINAL Numbers 50%, 10,000
MONEY Monetary values £1 million
PERCENT Percentages 25%

Finding All Organizations

[ent_type="ORG"]

This finds all mentions of organizations in the corpus.

Finding Geographic References

[ent_type="GPE"]

Finds country, city, and region names. Use the Frequency feature with the ent_type="GPE" query to see the most frequently mentioned places.

Geographic entities frequency
Note

The frequency analysis reveals a strong domestic focus in UK university strategic plans. The top mentions are UK (381), London (157), Wales (78), and Scotland (71). International locations appear much lower: China (#23, 15 mentions), India (#25, 14), Dubai (#36, 12), and Singapore (#38, 12)—suggesting limited explicit international engagement in strategic planning discourse.

Combining NER with Other Attributes

Find when partnership is part of an organization name:

[word="partnership" & ent_type="ORG"]

Find organization names containing specific words:

[ent_type="ORG" & word=".University."]

Use Cases for NER

  1. Identify key partners: Find which organizations are mentioned most frequently
  2. Geographic analysis: Compare international vs domestic references
  3. Temporal analysis: Identify target years and timelines in strategic plans

Collocations Analysis

NoSketch Engine provides a powerful Collocations feature that identifies words that co-occur with your search term more frequently than expected by chance.

Accessing Collocations

After running a query, click Collocations in the left sidebar to see statistically significant collocates.

Collocations for partnership

Top Collocates of partnership

The screenshot above shows the top collocates ranked by logDice score, which measures association strength (0-14 scale, higher = stronger). Scores above 7 indicate significant collocations.

Collocations vs Manual Proximity Queries

While proximity queries ([lemma="X"] []{0,N} [lemma="Y"]) let you search for specific combinations, the Collocations feature automatically discovers patterns you might not have anticipated.


Frequency Analysis

After running a query, use the Frequency panel to understand distributions across different dimensions.

Distribution by Word Form

The frequency breakdown shows how your query results distribute across word forms, POS tags, and other attributes:

Frequency distribution

Distribution by Metadata

Analyze how terms are distributed across document metadata:

Frequency by country

The distribution roughly corresponds to the number of universities in each nation.

Analyzing Distributions

  • Countries: Compare usage in English vs Scottish vs Welsh universities
  • Strategy years: See temporal patterns (2025 vs 2030 vs 2040 targets)
  • Universities: Identify which institutions use certain terms most

Export Options

Export your results for further analysis:

  • CSV: For spreadsheet analysis
  • XLSX: For Excel with formatting

Filtering by Metadata

Use metadata filters to create subcorpora for focused analysis.

Filter by Country

Compare “partnership” usage across the UK nations:

  1. Select England to see only English universities
  2. Select Scotland to see only Scottish universities
  3. Select Wales to see only Welsh universities

Filter by Strategy Year

Analyze how discourse changes based on target dates:

  • Universities targeting 2025 vs 2030 vs 2040

Semantic Analysis

Semantic analysis clusters concordance lines by semantic similarity, helping you discover usage patterns within a query.

For a full feature overview, see Semantic Analysis.

This walkthrough demonstrates semantic analysis using the UniPlans corpus of UK university strategic plans.

Step 1: Set up the query

  1. Select UniPlans (383K tokens) from the corpus dropdown.
  2. In the query builder, set:
    • Attribute: lemma
    • Operator: ==
    • Value: partnership
  3. Click Run builder query.

Query setup with UniPlans corpus and partnership lemma search

The query returns roughly 900 matches for forms of “partnership” across university strategic plans.

Step 2: Configure and run analysis

  1. Scroll to the Semantic analysis panel.
  2. Select sample size (e.g., 200 hits for a manageable analysis).
  3. Choose visualization method (t-SNE or UMAP).
  4. Click Analyze.

Semantic analysis panel with sample size and method selection

The analysis embeds each concordance line, reduces dimensions, and clusters automatically.

Step 3: Interpret the scatter plot

After analysis completes, you’ll see:

  • A scatter plot where each point represents one concordance line
  • Points colored by cluster assignment
  • Cluster buttons showing the size of each cluster
  • Silhouette score indicating cluster quality (higher = better separation)

Scatter plot showing semantic clusters with typical examples

In this example, the analysis detected 3 clusters. The typical examples help interpret what distinguishes each cluster.

Step 4: Filter by cluster

Click a cluster button to filter the concordance table to only show lines from that cluster.

Cluster filter applied to highlight a single cluster

Filtered concordance table showing only Cluster 1 results

This makes it easy to:

  • Read through examples from a specific semantic grouping
  • Identify patterns in how “partnership” is used in different contexts
  • Export filtered results for further analysis

Interpreting results

Semantic clusters represent usage similarity, not predefined categories. When interpreting:

  • Look at the typical examples shown for each cluster
  • Read several concordances from each cluster to understand the pattern
  • Consider what linguistic or contextual features distinguish clusters
  • Use the cluster slider to explore different numbers of clusters

For “partnership” in UniPlans, the clusters separate common usage patterns such as:

  1. Local/regional partnerships: community stakeholders, regional employers, local councils
  2. Global/international partnerships: worldwide networks, international collaborations, alumni connections
  3. Institutional partnerships: academic collaborations, internal initiatives, cross-campus programs

LLM Classification

LLM classification lets you label concordance samples with custom variables for targeted linguistic analysis.

For a full feature overview, see LLM Classification.

This walkthrough demonstrates LLM classification using the UniPlans corpus to analyze the grammatical subjects of “partnership” constructions.

Step 1: Run a query

First, run a CQL query to get concordance results:

  1. Select UniPlans (383K tokens) from the corpus dropdown.
  2. Query for [lemma=="partnership"].
  3. Click Run builder query.

Step 2: Configure variables

Scroll to the LLM classification panel. You’ll see:

  • Sample size: how many concordances to classify (5-100)
  • Estimated cost: calculated based on variables and sample size
  • Preset variable buttons: Subject animacy, Transitivity, Topic, Illocution

LLM classification panel with preset variable buttons

Open the sample size dropdown to pick how many lines to classify:

Sample size options in the classification panel

Click + Subject animacy to add the preset. The variable card appears with:

  • Title: Subject animacy
  • Description: Classification criteria with examples
  • Type: Fixed set of options (enum)
  • Options: animate, inanimate

Variable configuration showing description and enum options

The description provides guidance to the LLM:

Classify the grammatical subject’s animacy. animate = humans, animals, or organizations acting as agents; inanimate = objects, abstract concepts, events, or natural phenomena.

Examples help calibrate the model:

  • “The child laughed loudly.” -> animate
  • “The committee approved the proposal.” -> animate
  • “The storm destroyed the docks.” -> inanimate

Step 3: Run classification

  1. Select sample size (e.g., 5 lines for a quick test).
  2. Review the estimated cost.
  3. Click Classify sample.

The model classifies each concordance line and returns results.

Step 4: Review results

Results appear in a table with your variable columns:

Classification progress indicator while the job runs

Classification results showing concordances with Subject animacy annotations

Each row shows:

  • Left context: words before the keyword
  • KWIC: the keyword in context (partnerships)
  • Right context: words after the keyword
  • Subject animacy: the LLM’s classification

In this example, most “partnership” usages have animate subjects (organizations, institutions, “we”) because the corpus discusses what universities do with partnerships.

Step 5: Export results

Click Export CSV to download results for further analysis in Excel, R, or Python.


Programmatic Access (API)

For advanced users, the corpus can be queried programmatically via the REST API.

API Endpoint

POST /cql/run

Example Request (curl)

curl -X POST http://localhost:8000/cql/run \
  -H "Content-Type: application/json" \
  -d '{
    "cql": "[lemma=\"partnership\"]",
    "corpus_id": "uniplan",
    "limit": 50
  }'

Example Request (Python)

import requests

response = requests.post(
    "http://localhost:8000/cql/run",
    json={
        "cql": '[lemma="partnership"]',
        "corpus_id": "uniplan",
        "limit": 50
    }
)

results = response.json()
for hit in results["hits"]:
    print(hit["kwic"])

Response Format

{
  "total": 968,
  "hits": [
    {
      "text_id": 1,
      "position": 245,
      "left": "building international",
      "kwic": "partnerships",
      "right": "with other universities",
      "metadata": {
        "university_name": "University of Aberdeen",
        "country": "Scotland"
      }
    }
  ]
}

When to Use the API

  • Batch processing: Analyze multiple queries programmatically
  • Data pipelines: Integrate corpus data into research workflows
  • Custom visualizations: Build specialized analysis tools

Quick Reference

Common Query Patterns

Goal CQL Query
Exact word [word="partnership"]
All forms (lemma) [lemma="partnership"]
Plural nouns only [lemma="X" & pos="NNS"]
Adjacent words [lemma="X"] [lemma="Y"]
Words within 3 [lemma="X"] []{0,3} [lemma="Y"]
As direct object [lemma="X" & deprel="dobj"]
As subject [lemma="X" & deprel="nsubj"]
Organizations [ent_type="ORG"]
Places [ent_type="GPE"]
Plural forms [lemma="X" & morph="Number=Plur"]
Regex match [word="partner.*"]

Penn Treebank POS Tags

Tag Description Examples
NN Singular noun partnership, university
NNS Plural noun partnerships, universities
NNP Proper noun Edinburgh, Scotland
JJ Adjective strategic, international
VB Verb (base) develop, create
VBG Verb (gerund) developing, creating
VBD Verb (past) developed, created
RB Adverb significantly, globally

Dependency Relations

Relation Description
nsubj Nominal subject
dobj Direct object
pobj Prepositional object
amod Adjectival modifier
compound Compound word
conj Conjunction