Here Kaspar writes about the work of the project’s two summer interns:
Roman Polyanovsky and Tim Alberdingk Thijm, two Computer Science undergraduates who were working as summer interns on the Dilipad project, have created a video to showcase the project to a general audience. The video gives a good insight into how our digital parliamentary corpora are constructed (without getting into technical details) and shows some exciting preliminary research results.
You watch the video by following the link below:
Roman’s work consisted of transforming the OCR-ed text of the Canadian proceedings from its raw form, all the way into a richly annotated XML dataset. This implied overcoming various challenges, such as optimising complex regular expressions to extract the multiple crucial entities that appear in the proceedings, like speaker names, topic titles etc. Because of the noise caused by OCR errors, the regular expressions had to be fine-tuned so they won’t be too conservative – and exclude all slight deviations due to OCR errors – or too general, which would equally lead to information loss. Roman has put a lot of energy into preserving the elaborate topic structure of the original proceedings, which changes over time and differs slightly from the UK Hansards. To accomplish these goals, he has built a general and flexible regex transformer that not only accurately converts parliamentary text to XML, but can also be applied to any other type of (political) text with only minor changes.
Tim focused on both enhancing the database and performing prospective research on the discourse on migration. The database part of his internship comprised of improving the biographical information on MPs in the UK, which concretely boiled down to adding and correcting party information and solving multiple disambiguation problems. Because members were originally identified and linked with string matching techniques, those who shared the same name were often assigned to the same ids. Tim managed to reduce to number of ambiguous MPs – or those without correct biographical information – to almost zero, and has thus significantly improved the overall quality of the Dilipad metadata. Besides this, he made sure that almost all of the UK MPs were properly linked to external databases such as DBPedia and Wikipedia. In his research Tim examined the ideological differences in debates on immigration in the UK Parliament, and applied Structural Topic Models to extract Labour and Conservative frames on this topic.