Toronto summer interns’ work

Here Kaspar writes about the work of the project’s two summer interns:

Roman Polyanovsky and Tim Alberdingk Thijm, two Computer Science undergraduates who were working as summer interns on the Dilipad project, have created a video to showcase the project to a general audience. The video gives a good insight into how our digital parliamentary corpora are constructed (without getting into technical details) and shows some exciting preliminary research results.

You watch the video by following the link below:

https://www.youtube.com/watch?v=0ZhadQoG3EE

Roman’s work consisted of transforming the OCR-ed text of the Canadian proceedings from its raw form, all the way into a richly annotated XML dataset. This implied overcoming various challenges, such as optimising complex regular expressions to extract the multiple crucial entities that appear in the proceedings, like speaker names, topic titles etc. Because of the noise caused by OCR errors, the regular expressions had to be fine-tuned so they won’t be too conservative – and exclude all slight deviations due to OCR errors – or too general, which would equally lead to information loss. Roman has put a lot of energy into preserving the elaborate topic structure of the original proceedings, which changes over time and differs slightly from the UK Hansards. To accomplish these goals, he has built a general and flexible regex transformer that not only accurately converts parliamentary text to XML, but can also be applied to any other type of (political) text with only minor changes.

Tim focused on both enhancing the database and performing prospective research on the discourse on migration. The database part of his internship comprised of improving the biographical information on MPs in the UK, which concretely boiled down to adding and correcting party information and solving multiple disambiguation problems. Because members were originally identified and linked with string matching techniques, those who shared the same name were often assigned to the same ids. Tim managed to reduce to number of ambiguous MPs – or those without correct biographical information – to almost zero, and has thus significantly improved the overall quality of the Dilipad metadata. Besides this, he made sure that almost all of the UK MPs were properly linked to external databases such as DBPedia and Wikipedia. In his research Tim examined the ideological differences in debates on immigration in the UK Parliament, and applied Structural Topic Models to extract Labour and Conservative frames on this topic.

Visualizing Parliamentary Discourse with Word2Vec and Gephi.

This is a post by Kaspar Beelen. Kaspar is a post-doctoral researcher with the Toronto part of the project team.

Introduction

For historians, the idea of “automated” content analysis is still contested and treated with a justified dose of suspicion. How can one interpret texts, make claims on their meaning, on the basis of just quantitative output? Text-mining techniques, such as supervised classification, are directly transferred from exact sciences to the humanities, even though many of these methods were not intended to serve as an instrument of content analysis. They are nevertheless used to detect and analyze traces of – for example – ideology and gender in text. Despite the reliance on often advanced algorithms, many studies come to something of a dead end when being forced to interpret their model in the form of wordlists or word clouds, both being nothing more than a collection of lexical units devoid of context. Although the text is processed rigorously, the substance of the interpretation frequently depends on a rather arbitrary reading of a small set of phrases, which, due to their lack of context, are highly ambiguous. Many studies traditionally show the top twenty most important features that characterize certain ideological or gender perspectives, to prove that their classifier or other instrument successfully registers what it is trained to do. But what about the next 100 elements? And, more importantly, what is the interrelationship between all these significant features? Humanities scholars are often equally – if not more – interested in the structure or the content of the model than its predictive power. It is not my intention to reconcile close and distant reading in just one short blog post, but what I will elaborate upon is how recent advances in both natural language processing and network visualization have made it possible to represent data in way which is more fine-grained and holistic at the same time.

Visualizing “Women’s Interest” in Postwar Westminster

Below I’ll demonstrate how tools as Word2Vec – an unsupervised method for obtaining vector representations of words – in combination with dynamic graphs can shed more light on ongoing debates within Political History and Political Science, such as Women’s Substantive Representation (WSR). In very concise terms, the theory on WSR has hypothesized that increases in women legislators ensure that women’s interests, priorities, and perspectives will be better represented. To what extent can this theory be empirically corroborated by looking at the discursive practices of female MPs?

To track the issues and problems women have focused on after the Second World War, we’ve extracted nouns and adjectives – words most indicative of topics – and calculated to what extent these words characterize (when controlling for ideology) the discourse of female MPs.

Most “female” words for 1945-2015

child_NN
woman_NN
health_NN
age_NN
mother_NN
care_NN
family_NN
husband_NN
elderly_JJ
work_NN
help_NN
parent_NN
young_JJ
person_NN
girl_NN
baby_NN
women_NN
lady_NN
home_NN

This procedure generates a list of words that suggest, as expected, that women MPs speak more about “women” and that their social categorization somehow focuses on generational and family related issues (“child”, “age”, “mother” etc.). However interesting, lists like these are not exactly fine-grained and don’t allow us to make holistic claims about the issues women MPs have traditionally prioritized after 1945. Just listing more features would put considerable strain on the researcher and reader alike, and would make the interpretation only more arbitrary. Moreover, it wouldn’t allow us to identify issues, their interrelationship, and their development over time. What we can do is cluster the most “female” features for each legislature between 1945 and 2015 based on the proximity of their vector representations as created with Word2Vec. The vector representations have proven to contain many interesting properties, with the famous observation that when subtracting the vector(“man”) from the vector(“king”) and consequently adding the vector(“woman”), the closest match turns out to be the vector(“queen”).

Besides being successful in solving analogy tasks, Word2Vec turns out to be useful for many other tasks, such as clustering. Creating a vector space model based on all speeches of female MPs enables us to construct a nearest neighbor network of all words which are indicative of women’s parliamentary language. Each word w1 thus becomes a node and is connected to another word/node w2 if the latter appears in the set of n-closest vectors to word w1. The result is a network that at first sight might not seem very illuminating.

graph in stage1

Figure 1: Graph in Stage 1

Luckily we can use Gephi, an excellent visualization tool, to transform this unordered hairball to a neatly structured graph with identifiable clusters. Gephi not only allows one to visualize the network using different layout algorithms, it also comes with many other methods for analyzing its structure. After rearranging the graph using a linear-attraction linear-repulsion model (“Force Atlas”), we apply the “Modularity” algorithm to detect communities in the network and color each cluster separately. The result looks as follows:

graph after running layout algorithm and colored by cluster

Figure 2: Graph after running layout algorithm and colored by cluster

The motivation for constructing networks as these is only partly aesthetic. Although it might not seem obvious at first glance, the graph provides a framework for studying the lexical choice of women legislators over time as well as across party. Firstly we can separately scrutinize the clusters and identify the issues they capture. The figures below zoom in on the different communities the Modularity algorithms has detected.

graph representing the Education cluster

Figure 3: Graph representing the “Education” cluster

graph representing the fertility cluster

Figure 4: Graph representing the “Fertility” cluster

Secondly, and even more interesting for historians, Gephi supports the construction of dynamic graphs which have the ability to visualize change over time. These lexical networks capture and visualize multiple attributes, be it the frequency, the weight assigned to words by the feature selection algorithm (to what extent a word is indicative for a certain perspective), or party (i.e. if female Conservative MPs use a word more than female Labour MPs). As an example, the figure below shows a cluster of words relating to “poverty and inequality” for two different periods, with the node color indicating party (blue meaning Conservative, red standing for Labour and yellow saying that female members of both parties use a word more than their male party members).

the equality cluster for 1945 to 1951

Figure 5: the “Equality” cluster for 1945-1951 (first Attlee Cabinet)

the equality cluster for 1987 to 1992

Figure 6: the “Equality cluster for 1987-1992 (last Thatcher Cabinet)

The subgraph suggests that in the early postwar years, during the Attlee premiership, the “poverty” theme was mainly a priority of Labour women. However, during Thatcher’s last ministry both Conservative and Labour women equally prioritized this topic when speaking of the “poorest”, the “vulnerable” and “poverty”. Besides the overlap, they also employed a distinctly different jargon. Labour women talked more about “inequality” and the “[poverty] trap”, while their Conservative female colleagues concentrated on the “needy” and “poorer”.

The options for research are virtually endless and largely depend on the hypothesis or question being investigated.

Epilogue

Although these lexical networks provide a valuable framework for exploratory research on the historically changing discourse of gender and party, this might not always be the most convenient way for presenting the results. Gephi allows the user to export all the data again to a spreadsheet and analyze the network quantitatively in a more straightforward manner. Based on the network created in the aforementioned paragraphs, we can demarcate the periods during which certain clusters ranged more prominently. The following table lists the lexical communities that appeared mainly during the fifties and the sixties, suggesting that women’s interventions in the House of Commons concentrated on the practical aspect of everyday life and consumption affairs.

Table 2: Cluster for the 1945-1965

 Cluster ID  Words
1 cooking, hot, wash, laundry, apparatus, lavatory, luxury, kitchen, catering, appliance, electric, room, cleaning, portable, refrigerator, cooker, fireplace, analgesia, bathroom, bath, washing
2 foodstuff, cabbage, tomato, vitamin, glut, production, cereal, import, fruit, exporter, pear, overseas, banana, lettuce, vegetable, potato, protein, strawberry, tinned, foreign, wholesaler, decontrol, importation, imported, importer, dried, apple, carrot, export
3 soap, jam, coffee, confectionery, powder, cocoa, coupon, glove, bean, cream, ice, tin, chocolate, sandwich, sweet, biscuit

For the decades after Blair’s landslide victory in 1997, a whole new range of topics has appeared, such as transport, crime, violence and fertility.

Table 3: Cluster for the 1997-2014

 Cluster ID Words
1 passenger, emission, plane, freight, commuter, network, rail, operator, concessionary, fare, ticket, carriage, aviation, infrastructure, airline, franchising, airport, train, bus, booking, season, transport, railway, railtrack, franchise
2 perpetrator, malnutrition, harassment, abus, graffiti, suffering, sexual, victim, antisocial, fly, pain, assault, gross, violence, harm, domestic, tipper, crime, violent, behaviour, distress, litter, rape, abuse
3 genetic, experimentation, tissue, reproductive, stem, therapeutic, cell, gene, sperm, insemination, artificial, technique, fertilisation, gamete, implant, embryo, embryology

 

Sources of Evidence for Automatic Indexing of Political Texts

Political texts on the Web, documenting laws and policies and the process leading to them, are of key importance to government, industry, and every individual citizen.  Yet access to such texts is difficult due to the ever increasing volume and complexity of the content, prompting the need for indexing or annotating them with a common controlled vocabulary or ontology.

We investigated the effectiveness of different sources of evidence: such as the labeled training data, textual glosses of descriptor terms, and the thesaurus structure for automatically indexing political texts.

The main findings are the following.

First, using a learning to rank approach integrating all features, we observe significantly better performance than previous systems.

Second, the analysis of feature weights reveals the relative importance of various sources of evidence, also giving insight in the underlying classification problem. Interestingly we found that the most important part of political documents is their title.

The research was done by University of Amsterdam’s researchers: Mostafa Dehghani, Hosein Azarbonyad, Maarten Marx, and Jaap Kamps; the results were presented as a poster at the 37th European Conference on Information Retrieval and won the best poster award. The original paper is available here.

 

M. Dehghani, H. Azarbonyad, M. Marx, and J. Kamps. Sources of evidence for automatic indexing of political texts. In A. Hanbury, G. Kazai, A. Rauber, and N. Fuhr, editors, Advances in Information Retrieval, volume 9022 of Lecture Notes in Computer Science, pages 568–573. Springer International Publishing, 2015. ISBN 978-3-319-16353-6. doi: 10.1007/978-3-319-16354-3 63. URL http://dx.doi.org/ 10.1007/978-3-319-16354-3_63.

Accelerating and validating ex post facto hypothesis formulation

This is a guest post by Evelijn Martinius, highlighting findings from her internship on the project in January 2015:

Researchers often set out to test their hypothesis, but the question I would like to pose in the post is: how do they come to frame their hypothesis? Generally, researchers test what factors seem to be associated with certain conditions. The problems with this type of “ex post facto” hypothesis formulation is first of all the level of generalization. As these hypotheses are usually formulated with a specific event in mind, it might be difficult to generalize the findings. Secondly, observing the behavior of a pre-established dependent variable might be more prone to lead you to correlations between variables instead of causal relations. Most importantly it is also argued that the non-randomized sample that is selected can threaten the research validity.

The character of the Dilipad database allows us to solve some of these disadvantages quite easy. The formulation of the hypothesis benefits from the homogeneous data that still holds a lot of variance due to the large sample size. A simple method, like for example counting keywords, could give a quick and easy insight into the data and therefore stimulate a faster and easier hypothesis formulation.

Let’s take an example, using a simple method like counting keywords in the Dilipad database. It is often assumed that after the depillarization in the Netherlands politicians are more prone to ‘personalization’ (Kriesi, 2012; McAllister, 2009). Personalization differs from individualization because the public function remains dominant over the person’s image. Personalization can highlight idiosyncrasies in different wording during debates because the focus is on the person rather than the party (Bennett, 2012). Being the first post-war cabinet during the pillarization, we can expect that politicians in the first cabinet of Drees referred to their followers more than politicians in the first cabinet of Kok. We can test this expectation for instance by counting the usage of ‘I’ and ‘we’ for these two parliamentary terms.

Counting ‘I’ and ‘we’ in 1951 and 1995 the first results seem to indicate little change of the usage of the word “I”. Current political theory would have a hard time explaining this, which might make it an interesting topic to examine more in depth.

While the Dilipad database allows us to speed up the process of hypothesis formulation, there is another promising quality to it. The focus on statistical significant outcomes might lead us to focus on events, as it is hard to explain the variance that we see over time theoretically or statistically. The example of the word count given with ‘I’ and ‘we’ shows the difficulty immediately; we cannot explain this variance with current political theory, but perhaps from a sociolinguistic perspective there is an explanation for this. However, larger databases like Dilipad allow us to increase our timeline and see how the variance developed between 1951 and 1995, perhaps eventually discovering new insights from that pattern. We could also count more keywords and see what this does to the variance. Working with the Dilipad data and using the search engine of PoliticalMashup makes hypothesis formulation a lot faster and easier, which leaves us with more time to focus on examining new topics.

 

Bennet, W.L. (2012). ‘The personalization of politics: political identity, social media and changing patterns of participation.’ The ANNALS of American Academy of Political and Social Science, 644(1),20-39.

Kriesi, H. (2012). ‘Personalization of national election campaigns.’ Party Politics, 18(6), 825-844.

McAllister, I. (2009). ‘The personalization of politics.’ In R.J. Dalton and H.-D. Klingemann (ed.), The Oxford Handbook of Political Behavior, (pp. 571-588). Oxford: Oxford University Press.

Why argumentation?

Here is a blog post by project team member Nona Naderi.

We study argumentation to understand how arguments are related and interact with one another. This kind of analysis allows for better interpretation of human reasoning. Here is an example from Araucaria, a corpus of manually annotated arguments provided by Argumentation Research Group at the University of Dundee:

The times in which we live and work are changing dramatically. The workers of our parents’ generation typically had one job, one skill, one career often with one company that provided health care and a pension. And most of those workers were men. Today, workers change jobs, even careers, many times during their lives, and in one of the most dramatic shifts our society has seen, two-thirds of all moms also work outside the home. This changed world can be a time of great opportunity for all Americans to earn a better living, support your [sic] family, and have a rewarding career. And government must take your side. Many of our most fundamental systems— the tax code, health coverage, pension plans, worker training— were created for the world of yesterday, not tomorrow. We will transform these systems so that all citizens are equipped, prepared and thus truly free to make your [sic] own choices and pursue your [sic] own dreams.

Now, let’s see what the arguer tries to persuade the audience to accept or reject. First, the arguer claims that “the times in which we live and work are changing dramatically” and supports the claim by comparing the past and present work situations. The premises used to increase the confidence of the audience in the claim include: “The workers of our parents’ generation typically had one job, one skill, one career often with one company that provided health care and a pension. And most of those workers were men” and “Today, workers change jobs, even careers, many times during their lives, and in one of the most dramatic shifts our society has seen, two-thirds of all moms also work outside the home.” She then argues that since the changed world will benefit Americans, the government must take the citizens’ side (claim: “government must take your side.” and premise: “This changed world can be a time of great opportunity for all Americans to earn a better living, support your [sic] family, and have a rewarding career”). Another claim by the proponent is “we will transform these systems so that all citizens are equipped, prepared and thus truly free to make your [sic] own choices and pursue your [sic] own dreams”. This claim is supported by multiple premises that are either implicit or explicit. One explicit premise is that “the government must take your [sic] side”, but there are also premises that are not expressed explicitly, such as (1) “the system is not good for the world today” and (2) “The best means to have a system fitting the economic situation of today is to change it”. Premise (1) can be simultaneously considered as a claim that can be supported by “The times in which we live and work are changing dramatically”.

There are also other unstated propositions, e.g., the claim (3) “the changes to the systems will benefit citizens”. And the interpretations can be changed by considering different unstated premises and claims. For example, by considering the claim (4) “the government does not do enough to provide a better living for citizens and needs to be changed”, we have a different interpretation; however, some propositions (e.g., claim 4) seem less likely compared to the others.

If we take a look at each argument separately, we will not be able to derive the correct interpretations. Each argument individually provides some information, but we gain more information by discovering the connections and interactions between the presented arguments and reconstructed enthymemes (arguments with implicit premises or claims). But how can we represent these argument interpretations? Solving this interesting and challenging problem helps in better understanding the arguer’s underlying reasoning.

Wliat’s in a n^me? Post-correction of randomly misrecognized names in OCR data

This is a guest post by team member Kaspar Beelen.

Problem.

Notwithstanding the recent optimization of Optical Character Recognition (OCR) techniques, the conversion from image to machine-readable text remains, more often than not, a problematic endeavor. The results are rarely perfect. The reasons for the defects are multiple and range from errors in the original prints, to more systemic issues such as the quality of the scan, the selected font or typographic variation within the same document. When we converted the scans of the historical Canadian parliamentary proceedings, especially the latter cause turned out to be problematic. Typographically speaking, the parliamentary proceedings are richly adorned with transitions between different font types and styles. These switches are not simply due to the esthetic preferences of the editors, but are intended facilitate reading by indicating the structure of the text. Structural elements of the proceedings such as topic titles, the names of the MPs taking the floor, audience reactions and other crucial items, are distinguished from common speech by the use of bold or cursive type, small capital or even a combination.

Moreover, if the scans are not optimized for OCR conversion, the quality of the data decreases dramatically as a result of typographic variation. In the case of the Belgian parliamentary proceedings, a huge effort was undertaken to make historical proceedings publicly available in PDF format. The scans were optimized for readability, but seemingly not for OCR processing, and unsurprisingly the conversion yielded to a flawed and unreliable output. Although one might complain about this, it is at the same time highly unlikely that, considering the costs of scanning more than 100.000 pages, the process will be redone in the near future, so we have no option but to work with the data that is available.

Because of the aforementioned reason, names, printed in bold (Belgium) or small capital (Canada), ended up misrecognized in an almost random manner, i.e. there was no logic in the way the software converted the name. Although it showcases the inventiveness of the OCR system, it makes linking names to an external database almost impossible. Below you see a small selection of the various ways ABBYY, the software package we are currently working with, screwed up the name of the Belgian progressive liberal “Houzeau the Lehaie”:

 

Table 1: Different outputs for “Houzeau the Lehaie”

Houzeau de Lehnie. Ilonzenu dc Lehnlc. lionceau de Lehale.
Ilonseau de Lehaie. Ilonzenu 4e Lehaie. HouKemi de Lehnlc.
lionceau de Lehaie. Honaeaa 4e Lehaie. Hoaieau de Lehnle.
Ilonzenn de Lehaie. Heaieaa ée Lehaie. Homean de Lehaie.
Heazeaa «le Lehaie. Houzcait de Lekale. Houteau de Lehaie.
Hoiizcan de Lchnle. Henxean dc Lehaie. Houxcau de Lehaie.
Hensean die Lehaie. IleuzeAit «Je Lehnie. Houzeau de Jlehuie.
Ileaieaa «Je Lehaie. Honzean dc Lehaie Houzeau de Lehaic.
Hoiizcnu de Lehaie. Honzeau de Lehaie. Ilouzeati de Lehaie.
Houxean de Lehaie. Hanseau de Lehaie. Etc.

 

Although the quality of the scanned Canadian Hansards is significantly better, the same phenomenon occurs.

 

Table 2: Sample of errors spotted in the conversion Canadian Hansards (1919)

BALLANTYNE ARCHAMBAULT
BAILLANiTYNE ARCBAMBAULT
BALLAINTYNE ARCHAMBATJLT
BALLANT1NE AECBAMBAULT
BALLAiNTYNE ABCHAMBAULT
iBALiLANTYNE AROHASMBAULT
BAIiLANTYNE ARlQHAMBAULT
BALLANTYINE AECBAMBAULT

 

In many other cases even an expert would have hard time figuring out to whom the name should refer to.

 

Table 3: Misrecognition of names

,%nsaaeh-l»al*saai.
aandcrklndcrc.
fiillleaiix.
IYanoerklnaere.
I* nréeldcn*.
Ilellcpuitc.
Thlcapaat.

 

These observation are rather troubling, especially with respect to the construction linked corpora: even if, let’s say, 99% of the text is correctly converted, the other 1% will contain many of the most crucial entities, needed for marking up the structure and linking the proceedings to other sources of information. To correct the tiny but highly important 1%, I will focus in this blog post on how to automatically normalize speaker entities, those parts of proceedings that indicate who is taking the floor. In order to retrieve context information about the MPs, such as party and constituency, we have to link the proceedings our biographic databases. Linking will only be possible of the speaker entities in the proceedings match those in our external corpus.

In most occasions speaker entities include a title and a name followed by optional elements indicating the function and/or the constituency of the orator. The semicolon forms the border between the speaker entity and the actual speech. In a more formal notation, a speaker entity consists of the following pattern:

 

Mr. {Initials} Name{, Function} {(Constituency)}: Speech.

Using regular expression we can easily extract these entities. The result of this extraction is summarized by the figures below, which show the frequency with which the different speaker entities occur.

 Figure 1: Distribution of extracted speaker entities (Canada, 1919)

fig1afig1b

 

 

 

Figure 2: Distribution of extracted speaker entities (Belgium, 1893)

fig2afig2b

 

The figures lay bare the scope of the problem caused by these random OCR errors in more detail. Ideally there shouldn’t be more speaker entities than there are MPs in the House, which is clearly not the case. As you can see for the Belgian proceedings from the year 1893, the set of items occurring once or twice alone contains around 3000 unique elements. The output for the Canadian Hansards from 1919, looks slightly better, but there are still around 1000 almost unique items. Also, as is clear from the plots, the distribution of the speakers is more right skewed, due to the large amount of unique and wrongly recognized names in the original scans. We will try to reduce the right-skewedness by replacing the almost unique elements with more common items.

Solution.

In a first step we set out to replace these names with similar items that occur more frequent. Replacement happens in two consecutive rounds: First by searching in the local context of the sitting, and secondly by looking for a likely candidate in the set of items extracted from all the sittings of a particular year. To measure whether two names resemble each other, we calculated cosine similarity, based on n-grams of characters, with n running from one to four.

More formally, the correction starts with the following procedure:

More formally

 

As shown in table 4, running this loop yields many replacement rules. Not all of them are correct so we need manually filter out and discard any illegitimate rules that this procedure has generated.

 

Table 4: Selection of rules generated by above procedure

Legitimate rules Illegitimate rules
EOWELL->ROWELL W.HIDDEN -> DENIS
McOOIG->McCOIG SCOTT -> CAEVELL
ROWELiL->ROWELL THOMAS VIEN -> THOMAS WHITE
RUCHARBSON->RICHARDSON BRAKE -> SPEAKER
(MdMASTER->McMASTER CLARKE -> CLARK
ABCHAMBAULT->ARCHAMBAULT
AROHASMBAULT->ARCHAMBAULT
CQCKSHUTT->COCKSHUTT

Just applying these corrected replacement rules, would increase the quality of the text material a lot. But, as stated before, similarity won’t suffice when quality is awful, such as is the case for the examples shown in table 2. We need to go beyond similarity, but how?

The solution I propose is to use the replacement rules to train a classifier and consequently apply the classifier to instances that couldn’t be assigned to a correction during the previous steps. OCR correction thus becomes a multiclass classification task, in which each generated rule is used as a training instance. The right-hand side of the rule represents the class or the target variable. The left-hand side is converted to input variables or features. After training, the classifier will predict a correction, given a misrecognized name as input. For our experiment we used Multinomial Naïve Bayes, trained with n-grams of characters as features, with n againg ranging from 1 to 4. This worked surprisingly well: 90% of the rules it created were correct. Only around 10% of the rules generated by the classifier were either wrong or didn’t allow us to make a decision. Table 4 shows a small fragment of the rules produced by the classifier.

 

Table 5: Sample of classifier output given input name

Input name Classifier output
,%nsaaeh-l»al*saai. Anspach-Puissant.
aandcrklndcrc. Vanderkindere.
fiillleaiix. Gillieaux.
IYanoerklnaere. Vanderkindere.
I* nréeldcn*. le président.
Ilellcpuitc. Helleputte.
Thlcapaat. Thienpont.

 

Conclusion.

As you can see in table 5, the predicted corrections aren’t necessarily very similar to the input name. If just a few elements are stable, the classifier can pick up the signal even when there is a lot of noise. Because OCR software mostly recognizes at a handful characters consistently, this method seems to perform well.

To summarize: What are the strong points of this system? First of all, it is fairly simple, reasonably time-efficient and works even when the quality of the original data is very bad. Manual filtering can be done quickly: for each year of data, it takes an hour or two to correct the rules generated by each of the two processes and replace the names.  Secondly: Once a classifier is trained, it can also predict corrections for the other years of the same parliamentary session. Lastly, as mentioned before, the classifier can correctly predict replacements just on the basis of a few shared characters.

Some weak points need to be addressed as well. The system still needs supervision. But, nonetheless, this is worth the effort, because it can enhance the quality of the data significantly, especially with respect to linking the speeches in a later stage. In some cases, however, it can be impossible to assess whether a replacement rule should be kept or not. Another crucial problem is that the manual supervision needs to be done by experts who are familiar both with the historical period of the text and with the OCR errors. That is, the expert has to know which names are legal and also has to be proficient in reading OCR errors.

At the moment, we are trying to improve and expand the method. So far, the model uses only the frequency of n-grams, and not their location in a token. By taking location into account, we expect that we could improve the results, but that would also increase dimensionality. Besides adding new features, we should also experiment with other algorithms, such as support-vector machines, which perform better in a high-dimensional space. We will also test whether we can expand the method to correct other structural elements of the parliamentary proceedings, such as topical titles.

The Historical Aspects of Dilipad: Challenges and Opportunities

This post is from one of the historians working the Dilipad project, Luke Blaxill:

The Dilipad project is on one hand exciting because it will allow us to investigate ambitious research questions that our team of historians, social and political scientists, and computational linguists couldn’t address otherwise. But it’s also exciting precisely because it is such an interdisciplinary undertaking, which has the capacity to inspire methodological innovation. For me as a historian, it offers a unique opportunity not just to investigate new scholarly questions, but also to analyse historical texts in a new way.

We must remember that, in History, the familiarity with corpus-driven content analysis and semantic approaches is minimal. Almost all historians of language use purely qualitative approaches (i.e. manual reading) and are unfamiliar even with basic word-counting and concordance techniques. Indeed, the very idea of ‘distant reading’ with computers, and categorising ephemeral and context-sensitive political vocabulary and phrases into analytical groups is massively controversial even for a single specific historical moment, let alone diachronically or transnationally over decades or even generations. The reasons for this situation in History are complex, but can reasonably be summarised as stemming from two major scholarly trends which have emerged in the last four decades. The first is the wide-scale abandonment of quantitative History after its perceived failures in the 1970s, and the migration of economic history away from the humanities. The second is the influence of post-structuralism from the mid-1980s, which encouraged historians of language to focus on close readings, and shift from the macro to the micro, and from the top-down to the bottom-up. Political historians’ ambitions became centred around reconstructions of localised culture rather than ontologies, cliometrics, model making, and broad theories. Unsurprisingly, computerised quantitative text analysis found few, if any, champions in this environment.

In the last five years, the release of a plethora of machine-readable historical texts (among them Hansard) online, as well as the popularity of Google Ngram, have reopened the debate on how and how far text analysis techniques developed in linguistics and the social and political sciences can benefit historical research. The Dilipad project is thus a potentially timely intervention, and presents a genuine opportunity to push the methodological envelope in History.

We aim to publish outputs which will appeal to a mainstream audience of historians who will have little familiarity with our methodologies, rather than to prioritise a narrower digital humanities audience. We will aim to make telling interventions in existing historical debates which could not be made using traditional research methods. With this in mind, we are pursuing a number of exciting topics using our roughly two centuries-worth of Parliamentary data, including the language of gender, imperialism, and democracy. While future blog posts will expand upon all three areas in more detail, I offer a few thoughts below on the first.

The Parliamentary language of gender is a self-evidently interesting line of enquiry during a historic period where the role of women in the political process in Great Britain, Canada, and the Netherlands was entirely transformed. There has been considerable recent historical interest on the impact of women on the language of politics, and female rhetorical culture. The Dilipad project will examine differences in vocabulary between male and female speakers, such as on genre of topics raised, and also discursive elements, hedging, modality, the use of personal pronouns and other discourse markers- especially those which convey assertiveness and emotion. Next to purely textual features we will analyse how the position of women in parliament changed over time and between countries (time they spoke, how frequently they were interrupted, the impact of their discourse on the rest of the debate etc.).

A second area of great interest will be how women were presented and described in debate – both by men and by other women. This line of enquiry might present an opportunity to utilise sentiment analysis (which in itself would be methodologically significant) which might shed light on positive or negative attitudes towards women in the respective political cultures of our three countries. We will analyze tone, and investigate what vocabulary and lexical formations tended to be most associated with women. In addition, we can also investigate whether the portrayal of women varied across political parties.

More broadly, this historical analysis could help shed light on the broader impact of women in Parliamentary rhetorical culture. Was there a discernible ‘feminized language of politics’, and if so, where did it appear, and when? Similarly, was there any difference in Parliamentary behaviour between the sexes, with women contributing disproportionately more to debates on certain topics, and less to others? Finally, can we associate the introduction of new Parliamentary topics or forms of argument to the appearance of women speakers?

Insights in these areas – made possible only by linked ‘big data’ textual analysis – will undoubtedly be of great interest to historians, and will (we hope) demonstrate the practical utility of text mining and semantic methodologies in this field.

The importance of “small” words: pronominal selection in parliamentary discourse

This post is by Kaspar Beelen. Kaspar is a post-doctoral researcher and a member of the Toronto branch of our project team.

Introduction

Democratization fundamentally changed the form and function of parliamentary representation. From an assembly dominated by a class of notables, parliament evolved to an arena where socio-economic antagonisms became more and more explicitly articulated by parties and their leaders. In “The Principles of Representative Government” Bernard Manin described how deliberation in these representative institutions changed from an open discussion between independent MPs to a confrontation between more or less disciplined party formations. Put differently, the deictic center of parliamentary discourse shifted from an independent “I” to an exclusive “we”.

Inspired by Manin and others, I investigated to what extent parliamentary deliberation in the Belgian lower house (“Kamer van Volksvertegenwoordigers”) changed in times of democratization by looking at the linguistic behavior of MPs through the prism of pronominal selection. Instead of analyzing (theoretical) reflections on the changing character of political representation, I chose to study how, from a longitudinal perspective, shifts in the daily discursive practice of MPs relate to the transformation of representative government. In addition to scrutinizing chronological evolution, I analyzed how pronominal selection correlates with certain attributes of the speaker (ideology, power status, age).

Democratization and discursive change

 

To what extent did a shift from an “I”- to a “we”-centered political culture occur in the Belgian “Kamer van Volksvertegenwoordigers”? Manin’s observations were partly confirmed by my analysis. After 1893 and 1919, when the electoral system was reformed, the use of the first person plural increased significantly. Still the frequency of “I”-statements remained more or less stable over the studied period (1844-1940) and even showed a slightly upward tendency.

The “I” remained by far the most important discursive actor and instead of resolving into a “we”, the political “ego” became more expressive. What changed was the way in which representatives articulated their individuality. From a Goffmanian perspective, their speech acts consistently narrowed down their “negative face” or freedom of action. During the nineteenth century the frequency of mental state verbs such as “croire” and “penser” systematically decreased. “Je pense” and “je crois” – both can be translated as “I think” – signify commitment to a proposition but leave room for deliberation by specifying the personal point of view of the claim. In this respect they belong the most “deliberative” class of first person expressions as they leave room for negotiation.

While these phrases were decreasing in frequency, “I”-references embedded in discursive processes (“je dis” (I say), “je demande” (I ask)) were on the rise. The same was true for cognitive processes that signified a higher degree of epistemological certainty (“je sais”) or stronger emotive commitment (“je veux”, “je tiens à”). All of these combinations left less room for negotiation and metadiscursive parliamentary language resembled  written discourse more and more. The political “I” moved from a “negotiator” to a “writer” and instead of deliberation, the expression of a fixed opinion gained in importance. Although these findings partly confirmed Manin’s conclusion, they warned against overemphasizing the influence of parties on processes of identity formation. Despite the increase of collective forms of identification, parliamentary debates remained principally a discussion between individuals. Also, the transformation of the parliamentary “ego” was part of a longitudinal process, which started in the middle of the nineteenth century, well before the introduction of universal suffrage.

The pronouns of power/The power of pronouns

Previous research has shown how pronominal selection correlates with specific attributes of the speaker, such as gender, age or social status. In the context of parliamentary debates however, only the power status of the representative significantly correlated with certain pronominal patterns. MPs belonging to the parliamentary majority used the first person singular more frequently, especially in combination with mental state verbs (“je crois” and “je pense”). Closer reading indicated that MPs with more institutional power left more room for negotiation when giving their personal opinion, thereby displaying greater respect for the negative face of the speaker and the audience. In parliament, politeness and power seemed to correlate positively. An analysis of the textual context of mental state verbs also suggested that members belonging to the majority employed “je crois” to introduce positive opinions while opposition members mostly used this expression to signal a negative or critical stance.