The czech broadcast conversation corpus

Kolář, Jáchym; Švec, Jan

Title:	The czech broadcast conversation corpus
Authors:	Kolář, Jáchym Švec, Jan
Citation:	KOLÁŘ, Jáchym; ŠVEC, Jan. The czech broadcast conversation corpus. In: Text, speech and dialogue. Berlin: Springer, 2009, p. 101-108. (Lectures notes in computer science; 5729). ISBN 978-3-642-04207-2.
Issue Date:	2009
Publisher:	Springer
Document type:	článek article
URI:	http://www.kky.zcu.cz/cs/publications/JachymKolar_2009_TheCzechBroadcast http://hdl.handle.net/11025/17175
ISBN:	978-3-642-04207-2
Keywords:	rozhlasové zprávy;rozpoznávání řeči;lingvistická analýza
Keywords in different language:	broadcast news;speech recognition;linguistic analysis
Abstract in different language:	This paper presents the final version of the Czech Broadcast Conversation Corpus that will shortly be released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yields about 33 hours of transcribed conversational speech from 128 speakers. The release does not only include verbatim transcripts and speaker information, but also structural metadata (MDE) annotation that involves labeling of sentence-like unit boundaries, marking of non-content words like filled pauses and discourse markers, and annotation of speech disfluencies. The MDE annotation is based on the LDC's annotation standard for English, with changes applied to accommodate phenomena that are specific for Czech. In addition to its importance to speech recognition, speaker diarization, and structural metadata extraction research, the corpus is also useful for linguistic analysis of conversational Czech.
Rights:	© Jáchym Kolář - Jan Švec
Appears in Collections:	Články / Articles (KKY)

Files in This Item:

File	Description	Size	Format
JachymKolar_2009_TheCzechBroadcast.pdf	Plný text	179,85 kB	Adobe PDF	View/Open

Show full item record

Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/17175

search

navigation