Toronto summer interns’ work

Here Kaspar writes about the work of the project’s two summer interns:

Roman Polyanovsky and Tim Alberdingk Thijm, two Computer Science undergraduates who were working as summer interns on the Dilipad project, have created a video to showcase the project to a general audience. The video gives a good insight into how our digital parliamentary corpora are constructed (without getting into technical details) and shows some exciting preliminary research results.

You watch the video by following the link below:

Roman’s work consisted of transforming the OCR-ed text of the Canadian proceedings from its raw form, all the way into a richly annotated XML dataset. This implied overcoming various challenges, such as optimising complex regular expressions to extract the multiple crucial entities that appear in the proceedings, like speaker names, topic titles etc. Because of the noise caused by OCR errors, the regular expressions had to be fine-tuned so they won’t be too conservative – and exclude all slight deviations due to OCR errors – or too general, which would equally lead to information loss. Roman has put a lot of energy into preserving the elaborate topic structure of the original proceedings, which changes over time and differs slightly from the UK Hansards. To accomplish these goals, he has built a general and flexible regex transformer that not only accurately converts parliamentary text to XML, but can also be applied to any other type of (political) text with only minor changes.

Tim focused on both enhancing the database and performing prospective research on the discourse on migration. The database part of his internship comprised of improving the biographical information on MPs in the UK, which concretely boiled down to adding and correcting party information and solving multiple disambiguation problems. Because members were originally identified and linked with string matching techniques, those who shared the same name were often assigned to the same ids. Tim managed to reduce to number of ambiguous MPs – or those without correct biographical information – to almost zero, and has thus significantly improved the overall quality of the Dilipad metadata. Besides this, he made sure that almost all of the UK MPs were properly linked to external databases such as DBPedia and Wikipedia. In his research Tim examined the ideological differences in debates on immigration in the UK Parliament, and applied Structural Topic Models to extract Labour and Conservative frames on this topic.

Visualizing Parliamentary Discourse with Word2Vec and Gephi.

This is a post by Kaspar Beelen. Kaspar is a post-doctoral researcher with the Toronto part of the project team.


For historians, the idea of “automated” content analysis is still contested and treated with a justified dose of suspicion. How can one interpret texts, make claims on their meaning, on the basis of just quantitative output? Text-mining techniques, such as supervised classification, are directly transferred from exact sciences to the humanities, even though many of these methods were not intended to serve as an instrument of content analysis. They are nevertheless used to detect and analyze traces of – for example – ideology and gender in text. Despite the reliance on often advanced algorithms, many studies come to something of a dead end when being forced to interpret their model in the form of wordlists or word clouds, both being nothing more than a collection of lexical units devoid of context. Although the text is processed rigorously, the substance of the interpretation frequently depends on a rather arbitrary reading of a small set of phrases, which, due to their lack of context, are highly ambiguous. Many studies traditionally show the top twenty most important features that characterize certain ideological or gender perspectives, to prove that their classifier or other instrument successfully registers what it is trained to do. But what about the next 100 elements? And, more importantly, what is the interrelationship between all these significant features? Humanities scholars are often equally – if not more – interested in the structure or the content of the model than its predictive power. It is not my intention to reconcile close and distant reading in just one short blog post, but what I will elaborate upon is how recent advances in both natural language processing and network visualization have made it possible to represent data in way which is more fine-grained and holistic at the same time.

Visualizing “Women’s Interest” in Postwar Westminster

Below I’ll demonstrate how tools as Word2Vec – an unsupervised method for obtaining vector representations of words – in combination with dynamic graphs can shed more light on ongoing debates within Political History and Political Science, such as Women’s Substantive Representation (WSR). In very concise terms, the theory on WSR has hypothesized that increases in women legislators ensure that women’s interests, priorities, and perspectives will be better represented. To what extent can this theory be empirically corroborated by looking at the discursive practices of female MPs?

To track the issues and problems women have focused on after the Second World War, we’ve extracted nouns and adjectives – words most indicative of topics – and calculated to what extent these words characterize (when controlling for ideology) the discourse of female MPs.

Most “female” words for 1945-2015


This procedure generates a list of words that suggest, as expected, that women MPs speak more about “women” and that their social categorization somehow focuses on generational and family related issues (“child”, “age”, “mother” etc.). However interesting, lists like these are not exactly fine-grained and don’t allow us to make holistic claims about the issues women MPs have traditionally prioritized after 1945. Just listing more features would put considerable strain on the researcher and reader alike, and would make the interpretation only more arbitrary. Moreover, it wouldn’t allow us to identify issues, their interrelationship, and their development over time. What we can do is cluster the most “female” features for each legislature between 1945 and 2015 based on the proximity of their vector representations as created with Word2Vec. The vector representations have proven to contain many interesting properties, with the famous observation that when subtracting the vector(“man”) from the vector(“king”) and consequently adding the vector(“woman”), the closest match turns out to be the vector(“queen”).

Besides being successful in solving analogy tasks, Word2Vec turns out to be useful for many other tasks, such as clustering. Creating a vector space model based on all speeches of female MPs enables us to construct a nearest neighbor network of all words which are indicative of women’s parliamentary language. Each word w1 thus becomes a node and is connected to another word/node w2 if the latter appears in the set of n-closest vectors to word w1. The result is a network that at first sight might not seem very illuminating.

graph in stage1

Figure 1: Graph in Stage 1

Luckily we can use Gephi, an excellent visualization tool, to transform this unordered hairball to a neatly structured graph with identifiable clusters. Gephi not only allows one to visualize the network using different layout algorithms, it also comes with many other methods for analyzing its structure. After rearranging the graph using a linear-attraction linear-repulsion model (“Force Atlas”), we apply the “Modularity” algorithm to detect communities in the network and color each cluster separately. The result looks as follows:

graph after running layout algorithm and colored by cluster

Figure 2: Graph after running layout algorithm and colored by cluster

The motivation for constructing networks as these is only partly aesthetic. Although it might not seem obvious at first glance, the graph provides a framework for studying the lexical choice of women legislators over time as well as across party. Firstly we can separately scrutinize the clusters and identify the issues they capture. The figures below zoom in on the different communities the Modularity algorithms has detected.

graph representing the Education cluster

Figure 3: Graph representing the “Education” cluster

graph representing the fertility cluster

Figure 4: Graph representing the “Fertility” cluster

Secondly, and even more interesting for historians, Gephi supports the construction of dynamic graphs which have the ability to visualize change over time. These lexical networks capture and visualize multiple attributes, be it the frequency, the weight assigned to words by the feature selection algorithm (to what extent a word is indicative for a certain perspective), or party (i.e. if female Conservative MPs use a word more than female Labour MPs). As an example, the figure below shows a cluster of words relating to “poverty and inequality” for two different periods, with the node color indicating party (blue meaning Conservative, red standing for Labour and yellow saying that female members of both parties use a word more than their male party members).

the equality cluster for 1945 to 1951

Figure 5: the “Equality” cluster for 1945-1951 (first Attlee Cabinet)

the equality cluster for 1987 to 1992

Figure 6: the “Equality cluster for 1987-1992 (last Thatcher Cabinet)

The subgraph suggests that in the early postwar years, during the Attlee premiership, the “poverty” theme was mainly a priority of Labour women. However, during Thatcher’s last ministry both Conservative and Labour women equally prioritized this topic when speaking of the “poorest”, the “vulnerable” and “poverty”. Besides the overlap, they also employed a distinctly different jargon. Labour women talked more about “inequality” and the “[poverty] trap”, while their Conservative female colleagues concentrated on the “needy” and “poorer”.

The options for research are virtually endless and largely depend on the hypothesis or question being investigated.


Although these lexical networks provide a valuable framework for exploratory research on the historically changing discourse of gender and party, this might not always be the most convenient way for presenting the results. Gephi allows the user to export all the data again to a spreadsheet and analyze the network quantitatively in a more straightforward manner. Based on the network created in the aforementioned paragraphs, we can demarcate the periods during which certain clusters ranged more prominently. The following table lists the lexical communities that appeared mainly during the fifties and the sixties, suggesting that women’s interventions in the House of Commons concentrated on the practical aspect of everyday life and consumption affairs.

Table 2: Cluster for the 1945-1965

 Cluster ID  Words
1 cooking, hot, wash, laundry, apparatus, lavatory, luxury, kitchen, catering, appliance, electric, room, cleaning, portable, refrigerator, cooker, fireplace, analgesia, bathroom, bath, washing
2 foodstuff, cabbage, tomato, vitamin, glut, production, cereal, import, fruit, exporter, pear, overseas, banana, lettuce, vegetable, potato, protein, strawberry, tinned, foreign, wholesaler, decontrol, importation, imported, importer, dried, apple, carrot, export
3 soap, jam, coffee, confectionery, powder, cocoa, coupon, glove, bean, cream, ice, tin, chocolate, sandwich, sweet, biscuit

For the decades after Blair’s landslide victory in 1997, a whole new range of topics has appeared, such as transport, crime, violence and fertility.

Table 3: Cluster for the 1997-2014

 Cluster ID Words
1 passenger, emission, plane, freight, commuter, network, rail, operator, concessionary, fare, ticket, carriage, aviation, infrastructure, airline, franchising, airport, train, bus, booking, season, transport, railway, railtrack, franchise
2 perpetrator, malnutrition, harassment, abus, graffiti, suffering, sexual, victim, antisocial, fly, pain, assault, gross, violence, harm, domestic, tipper, crime, violent, behaviour, distress, litter, rape, abuse
3 genetic, experimentation, tissue, reproductive, stem, therapeutic, cell, gene, sperm, insemination, artificial, technique, fertilisation, gamete, implant, embryo, embryology