Description
This text archive focuses on German political speeches held by top officials mostly from 1990 onwards, selected according to their political relevance. This is work in progress, updated and extended versions will follow. The currently included texts come from the following sources:
- Official pages of the German Presidency, Chancellery, Bundestag, Ministry of Foreign Affairs
- Personal pages of the Helmut Kohl archive, Wolfgang Thierse and Norbert Lammert
Reference
If you use the texts please cite at least one of these references:
- Barbaresi, Adrien (2018). "A corpus of German political speeches from the 21st century", Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), pp. 792–797.
http://purl.org/corpus/german-speeches (BibTeX entry) - Adrien Barbaresi. (2019). German Political Speeches Corpus (Version v4.2019) [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.3611246
Feel free to contact me if you have questions or if you would like to collaborate on this corpus.
Data
Work with the texts
The corpus can be queried online here using a faceted full-text search featuring linguistic annotation:
- Online queries on the DWDS website and usage instructions (the text base may be newer than the downloadable archives)
Appropriate tooling:
- Python tutorial using the speeches: Natural Language Processing — Einsteigen und Loslegen!
- CorpusExplorer, corpus linguistics and text mining software featuring the speeches
- List of off-the-shelf NLP tools for German
Current version (4th release, 2019)
The corpus currently includes a total of 6,685 speeches by 71 speakers, spanning a time from 1984 to 2017 and amounting to about 13 million words. The files below consist of texts with metadata encoded in XML format.
- Current archive (v4, 2019) – raw text and metadata (28 MB ZIP-file)
- Description (2018) and corresponding BibTeX entry
Legacy versions (outdated, for reproducibility only)
- LREC 2018 conference version (11 MTokens, 26 MB ZIP-file)
- Presidency Corpus (1984-2012, 1,442 texts, 2.4 MTokens): Raw text and metadata version (5 MB ZIP-file), Tokenized and tagged XML TEI version (20 MB ZIP-file)
- Chancellery Corpus (1998-2011, 1,831 texts, 3.9 MTokens): Raw text and metadata version (8 MB ZIP-file), Tokenized and tagged XML TEI version (35 MB ZIP-file)
Visualizations (beta version from 2018)
- Presidency Corpus Keyword Visualization
List of the 2046 speeches - Chancellery Corpus Keyword Visualization
List of the 2662 speeches - Ministry of Foreign Affairs Corpus Keyword Visualization
List of the 1275 speeches
For maintenance reasons the pages are static: word lists of relevant queries, output in as web pages (CSS/XHTML).
Mentions
The mentions below are updated on a regular basis.
Corpus and Computational Linguistics
- Barbaresi, A. (2015). Ad hoc and general-purpose corpus construction from web sources. PhD thesis, École Normale Supérieure de Lyon.
- Birch, A, Huck, M, Durrani, N & Koehn, P. (2014). Edinburgh SLT and MT System Description for the IWSLT 2014 Evaluation in Proceedings of the 10th International Workshop on Spoken Language Translation. pp.40-48.
- Costa, A. (2019). Koder - A multi-register corpus for investigating register variation in contemporary German. Research in Corpus Linguistics, 7, 69-83. DOI:0.32714/ricl.07.04.
- Dang-Anh, M. and R ̈udiger, J. O. (2015). From Frequency to Sequence: How Quantitative Methods Can Inform Qualitative Analysis of Digital Media Discourse. 10plus1: Living Linguistics, 1:57–73.
- Deyringer, V. (2015). "Text Classification with Support Vector Machines – Authorship Attribution of German Federal President Speeches", Masters thesis, CIS Munich.
- Esposito, F. (2017). Unsupervised Recognition of Motion Verbs Metaphoricity in Atyical Political Dialogues, PhD thesis, University of Naples Federico II.
- Faaß, G. & Heid, U. (2012) "Deutsche politische Kommunikation der Gegenwart als linguistisch annotiertes Korpus", Poster at the DGfS-CL Poster-Session, Frankfurt, 2012.
- Freitag, M., Wuebker, J., Peitz, S., Ney, H., Huck, M., Birch, A., Durrani, N., Koehn, P., Mediani, M., Slawik, I., et al. (2014). Combined spoken language translation. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT).
- Graszewicz, M. (2014). Korpus wypowiedzi polskich polityków (KWPP)/Corpus of statements by Polish politicians, Dziennikarstwo i Media, 5, 205-214.
- Geyken, A., Barbaresi, A., Didakowski, J., Jurish, B., Wiegand, F., and Lemnitzer, L. (2017). Die Korpusplattform des "Digitalen Wörterbuchs der deutschen Sprache" (DWDS). Zeitschrift für germanistische Linguistik, 45(2):327–344
- Guillou, L. K. (2016). Incorporating pronoun function into statistical machine translation, PhD thesis, University of Edinburgh.
- Huck, M., & Birch, A. (2015). The Edinburgh Machine Translation Systems for IWSLT 2015
- Jehl, L., Simianer, P., Hitschler, J., and Riezler, S. (2015). The Heidelberg University English-German translation system for IWSLT 2015. Proceedings of IWSLT.
- Kilgour, K. et al. (2013) The 2013 KIT IWSLT Speech-to-Text Systems for German and English. International Workshop on Spoken Language Translation (IWSLT).
- Kilgour, K. et al. (2014) The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian. International Workshop on Spoken Language Translation (IWSLT).
- Kilgour, K. (2015). Modularity and Neural Integration in Large-Vocabulary Continuous Speech Recognition, PhD thesis, Karlsruher Institut für Technologie.
- Kuhr, C. (2016). Automatic Speech Recognition and Natural Language Processing, Research project report, TH Köln.
- Kupietz, M. & Lüngen, H. (2014) Recent Developments in DeReKo, Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 2378-2385 (a reference corpus for German)
- Lüngen, H. (2017). DeReKo – Das Deutsche Referenzkorpus. Zeitschrift fur germanistische Linguistik, 45(1):161–170.
- Müller, M., Nguyen, T. S., Sperber, M., Kilgour, K., Stüker, S., & Waibel, A. The 2015 KIT IWSLT Speech-to-Text Systems for English and German
- Olifenko, I. & Borysova, N. (2017). Analysis of existing German Corpora. In Computational linguistics and intelligent systems (COLINS 2017): Proceedings of the 1st International conference, National Technical University «KhPI», Lviv Polytechnic National University, Kharkiv.
- Osenova, P. & Simov K. (2012) "The Political Speech Corpus of Bulgarian", Proceedings of LREC, pp. 1744-1747.
- Pace-Sigge, M. (2015). Applying the concepts of Lexical Priming to German polysemantic words, Corpus Linguistics 2015, Lancaster.
- Ruppenhofer, J., Strauß, J. M., Sonntag, J., & Gindl, S. (2014). IGGSA-STEPS: Shared Task on Source and Target Extraction from Political Speeches, JLCL 29(1), pp. 33-46.
- Tandashvili, M. (2014). Title unknown (PDF).
- Tiepmar, J. and Heyer, G. (2017). An Overview of Canonical Text Services. Linguistics and Literature Studies, 5(2):132–148.
- Tiepmar, J. (2018). Big Data and Digital Humanities, Archives of Data Science, Series A (Online First), 5(1), KIT Scientific Publishing.
- Wiedemann, G. and Niekler, A. (2016). Analyse qualitativer Daten mit dem "Leipzig Corpus Miner". In Text Mining in den Sozialwissenschaften, pages 63–88. Springer.
- Zhu, L., Kilgour, K., Stüker, S., & Waibel, A. (2015). Gaussian Free Cluster Tree Construction Using Deep Neural Network. In Sixteenth Annual Conference of the International Speech Communication Association.
- Чудинов, А. П., Будаев, Э. В., Дзюба, Е. В., Кошкарова, Н. Н., Кондратьева, О. Н., Никифорова, М. В., ... & Солопова, О. А. (2016). Теория и методика лингвистического анализа политического текста: монография. Екатеринбург: Уральский государственный педагогический университет.
History and Political Science
- Caccamo, D. (2012). Multipolarismo e American leadership nel discorso politico internazionale. Rivista di Studi Politici Internazionali, 79(4), pp. 549-566.
- Ditfurth, J. (2012). Zeit des Zorns: warum wir uns vom Kapitalismus befreien mussen. Westend Verlag.
- du Plessis, J. J., & Saenger, I. (2017). An overview of the corporate governance debate in Germany. In German corporate governance in international and European context (pp. 17-62). Springer, Berlin, Heidelberg.
- Engelhardt, F. (2014).Deutschland einig Einwanderungsland? Integrationspolitik im Spannungsfeld von Gleichheits- und Differenzvorstellungen, PhD thesis, University of Marburg.
- Górajek, A. (2015). Von der Mehrdimensionalität der Geschichte–Gerhard Schröder und seine Haltung gegenüber Polen. Zeitschrift des Verbandes Polnischer Germanisten, 4(4), 273-280.
- Helwig, N. (2014). The High Representative of the Union The constrained agent of Europe’s foreign policy, PhD thesis, Universität zu Köln.
- James, J. (2012). Preservation and National Belonging in Eastern Germany: Heritage Fetishism and Redeeming Germanness, Palgrave Macmillan.
- Kohlmann, S. (2017). Frank-Walter Steinmeier: Eine politische Biographie, transcript Verlag, Bielefeld.
- Koszel, B. (2018). Rola Niemiec w procesach decyzyjnych Unii Europejskiej w XXI wieku, PhD thesis, UAM Poznań.
- Neumann, D. (2016). Das Ehrenamt nutzen: Zur Entstehung einer staatlichen Engagementpolitik in Deutschland, transkript Verlag, Bielefeld.
- Petroleka, M. and Damianova, K. (2017).Do Geopolitical Changes Challenge Turkey’s Role In EU’s Energy Security Structure?, EUCERS Newsletter 61, King's College London.
- Pühringer, S. (2015) Markets as "ultimate judges" of economic policies: Angela Merkel's discourse profile during the economic crisis and the European crisis policies, ICAE Working Paper Series No. 31, 2015. (also: Pühringer, S. (2015). On the Horizon, 23(3), 246-259.)
- Petersen, R., & Reinert, S. (2018). Mobilität für morgen. In Verkehrspolitik (pp. 467-489). Springer VS, Wiesbaden.
- Pühringer, S. (2015). Marktmetaphoriken in Krisennarrativen von Angela Merkel. Ötsch, Walter Otto, Hirte, Katrin, Pühringer, Stephan, Bräutigam, Lars (Hg.): Markt, 229-251.
- Schax, A. (2012). Tracing transformations, The development of Germany’s Strategic Culture during the last two decades, Masters thesis, Utrecht University.
- Seewald, F. (2013). Die deutsche Außen-und Sicherheitspolitik von 2001 bis 2012 im Lichte des Zivilmachtkonzepts. Master's thesis, University of Hagen.
- Simons, J. P. (2014). Discourse and the Shift in Social Democratic Ideology and Employment Policies: A Comparison of the PvdA and the SPD, Masters thesis, Leiden University.
- Steinbacher, K. (2019). Exporting the Energiewende: German Renewable Energy Leadership and Policy Transfer, Springer VS, Wiesbaden.
- Thonfeld, C. (2014). Cosmopolitan Normalisation? The Culture of Remembrance of World War II and the Holocaust in Unified Germany, 臺大歷史學報 53, pp. 181-227.
- Trzcielińska-Polus, A. (2017). 25 lat polityki zjednoczonych Niemiec wobec Polski. In Ciesielska-Klikowska J., Kuczyński E.,(red.), 25 lat niemieckiego zjednoczenia. Bilans ćwierćwiecza, Wydawnictwo Uniwersytetu Łódzkiego, Łódź 2017;. Wydawnictwo Uniwersytetu Łódzkiego.
- Tsaruski, Y. (2015). The German Revolution of 1918–1919 in modern studies and in public perception, History magazine: researches, 2015-3, pp. 280-287.
- van de Rijt, L. C. M. (2015). Enabling and Constraining: A Study on Possibilities of Agents in the EU-Polity during the Turkish Accession Process from 1999 until 2013.
- Yu, T. (2015). The German revolution of 1918-1919 in modern studies and in public perception. History Magazine: Researches, 3:280–287.
- 裴雷, 孙建军, & 周兆韬/Pei Lei, Sun Jianjun, Zhou Zhaotao. (2016). 政策文本计算: 一种新的政策文本解读方式/Policy Text Computing: A New Methodology of Policy Interpretation., 图书与情报, 160(06), 47-55.
Miscellaneous
- 20 Best German Language Datasets for Machine Learning, Alex Nguyen, Lionbridge AI.
- Awesome Public Datasets, GitHub
- Blog of the computational linguist Daniel Stein
- Course material (Neural Computation & Self Organization) at the IUPR (Kaiserslautern)
- Erik Gahner Larsen's Poldata R module
- German TeX user association, syllabification project for TeX
- Jäkel, R., Peukert, E., Nagel, W. E., & Rahm, E. (2018). ScaDS Dresden/Leipzig–A competence center for collaborative big data research. it-Information Technology, 60(5-6), 327-333.
- Natural Language Processing — Einsteigen und loslegen!, Thomas Timmermann, codecentric.de
- nlp-datasets, GitHub
- Non-English, Parallel & Multilingual Corpora on David Lee's Bookmarks for Corpus-based Linguists
- Non-English, Parallel &Multilingual Corpora (a selection)
- Texts and Corpora page of The Linguist List
- Tools for linguists on The Lousy Linguist blog
- Evaluation Campaign of the International Workshop on Spoken Language Translation (IWSLT) 2013, Data Permissible for MT model and ASR Language Model Training.
- University of British Columbia, Library, Non-English Language Corpora
- Zeit für Vermittlung
Changelog
2019-06-17 4th release: Augmented text base, deduplication and refined metadata.2018-09-28 Refined speaker metadata and text base for the Chancellery.
2018-08-30 Refined text base and updated visualizations.
2018-05-09 3rd release, updated text archive.
2012-08-03 First part of the (now outdated) code released: https://github.com/adbar/gps-corpus-builder
2012-03-05 2nd version: POS-tags, lemmas, XML TEI, keywords.
2011-12-06 Readme and CC BY-SA license added.
2011-09-08 Better visualizations of the speeches and better formatting.
2011-08-16 Minor bugs corrected.
2011-07-25 First release.