DGD 2.0: A Web-based Navigation Platform for the Presentation and Retrieval of German Speech Corpora
| Author: | Gasch, Joachim |
| Abstract: | |
| The "Institut für Deutsche Sprache" (IDS) is hosting a wide range of historical and contemporary German speech corpora. Many of the historical corpora, especially the dialectological corpora like for example the Zwirner Corpus can be accessed online via the "Database for Spoken German" (DGD). Currently, we are developing a new, generic speech corpus management system, where the normalized integration of historical and more recent speech corpora under sustainability aspects and the implementation of improved multi-user interfaces for corpus exploration and retrieval are monitored as main objectives. The XML schema-based standardization on meta documentation and transcript data level allows the implementation of exact mapping mechanisms for the import and export of existing and future speech corpora. The system offers full-text search or - depending on the structuring degree - additional structure-aware, context sensitive retrieval functionalities. Speech corpus management systems process meta-information describing their media sources. However, the information structures of the data components used in different speech corpus projects may vary considerably regarding the linguistic research questions that are investigated by their creators. Such differences between speech corpora can originate for example from the represented genres, from the degree of content restriction, from the physical data structure or from the research field focussed on. Therefore, the new Web-based speech corpus navigation platform focuses on an abstract standardization concept – matching large speech corpus collections rather than creating particular solutions for data sets of single speech corpus projects. This cross-corpus perspective leads to the definition of a generic, system-wide data model, allowing a smooth corpus data integration without information loss. The components of this data model are hierarchically interlinked by system-wide, unique identifiers: (a) the structured XML documentation instances on corpus, event and speaker level (b) the unstructured, semi-structured or time aligned transcripts (c) the media sources (d) in some cases additional unstructured secondary documents The output quality of cross-corpus information retrieval may vary, as it is always directly prescribed by the corpus data component with the lowest structure or data quality. |
|
