Accelerating and validating ex post facto hypothesis formulation

This is a guest post by Evelijn Martinius, highlighting findings from her internship on the project in January 2015:

Researchers often set out to test their hypothesis, but the question I would like to pose in the post is: how do they come to frame their hypothesis? Generally, researchers test what factors seem to be associated with certain conditions. The problems with this type of “ex post facto” hypothesis formulation is first of all the level of generalization. As these hypotheses are usually formulated with a specific event in mind, it might be difficult to generalize the findings. Secondly, observing the behavior of a pre-established dependent variable might be more prone to lead you to correlations between variables instead of causal relations. Most importantly it is also argued that the non-randomized sample that is selected can threaten the research validity.

The character of the Dilipad database allows us to solve some of these disadvantages quite easy. The formulation of the hypothesis benefits from the homogeneous data that still holds a lot of variance due to the large sample size. A simple method, like for example counting keywords, could give a quick and easy insight into the data and therefore stimulate a faster and easier hypothesis formulation.

Let’s take an example, using a simple method like counting keywords in the Dilipad database. It is often assumed that after the depillarization in the Netherlands politicians are more prone to ‘personalization’ (Kriesi, 2012; McAllister, 2009). Personalization differs from individualization because the public function remains dominant over the person’s image. Personalization can highlight idiosyncrasies in different wording during debates because the focus is on the person rather than the party (Bennett, 2012). Being the first post-war cabinet during the pillarization, we can expect that politicians in the first cabinet of Drees referred to their followers more than politicians in the first cabinet of Kok. We can test this expectation for instance by counting the usage of ‘I’ and ‘we’ for these two parliamentary terms.

Counting ‘I’ and ‘we’ in 1951 and 1995 the first results seem to indicate little change of the usage of the word “I”. Current political theory would have a hard time explaining this, which might make it an interesting topic to examine more in depth.

While the Dilipad database allows us to speed up the process of hypothesis formulation, there is another promising quality to it. The focus on statistical significant outcomes might lead us to focus on events, as it is hard to explain the variance that we see over time theoretically or statistically. The example of the word count given with ‘I’ and ‘we’ shows the difficulty immediately; we cannot explain this variance with current political theory, but perhaps from a sociolinguistic perspective there is an explanation for this. However, larger databases like Dilipad allow us to increase our timeline and see how the variance developed between 1951 and 1995, perhaps eventually discovering new insights from that pattern. We could also count more keywords and see what this does to the variance. Working with the Dilipad data and using the search engine of PoliticalMashup makes hypothesis formulation a lot faster and easier, which leaves us with more time to focus on examining new topics.


Bennet, W.L. (2012). ‘The personalization of politics: political identity, social media and changing patterns of participation.’ The ANNALS of American Academy of Political and Social Science, 644(1),20-39.

Kriesi, H. (2012). ‘Personalization of national election campaigns.’ Party Politics, 18(6), 825-844.

McAllister, I. (2009). ‘The personalization of politics.’ In R.J. Dalton and H.-D. Klingemann (ed.), The Oxford Handbook of Political Behavior, (pp. 571-588). Oxford: Oxford University Press.