Automatická klasifikace vícejazyčných dokumentů

Hlom, Ladislav

Title:	Automatická klasifikace vícejazyčných dokumentů
Other Titles:	Automatic multilingual document classification
Authors:	Hlom, Ladislav
Advisor:	Král Pavel, Doc. Ing. Ph.D.
Referee:	Konopík Miloslav, Ing. Ph.D.
Issue Date:	2016
Publisher:	Západočeská univerzita v Plzni
Document type:	diplomová práce
URI:	http://hdl.handle.net/11025/23665
Keywords:	klasifikace;více třídní;svm;maximální entropie;naivní bayes;lda;klasifikace vícejazyčných dokumentů;strojový překlad;smt
Keywords in different language:	classification;multi-label;svm;maximum entropy;naive bayes;lda;multilingual document classification;machine translation;smt
Abstract:	Automatická klasifikace dokumentů je úloha, ve které dokumenty zařazujeme do určitých kategorií dle jejich obsahu (např. politika, sport, ...). V práci je řešena především více třídní klasifikace, ve které může dokument patřit do více kategorií. Cílem práce bylo prozkoumat možnosti vícejazyčné klasifikace dokumentů. V rámci řešení je porovnávána metoda LDA s klasifikací po strojovém překladu do cílového jazyka. Použity jsou klasifikační metody maximální entropie a metoda podpůrných vektorů. K překladu textu jsou použity statistické systémy pro strojový překlad Moses a Google translate. Pro testování byly vybrány 3 rozdílné kolekce. První kolekce byla dodána od České tiskové kanceláře, zatímco zbylé dvě byly nalezeny na internetu. Provedené experimenty ukázaly, že varianta se strojovým překladem poskytuje solidní výsledky. Zatímco klasifikování za použití metody LDA dosahovalo nižších výsledků a nelze ho pro úlohu doporučit. Dále bylo ukázáno jak kvalita překladu ovlivňuje výslednou klasifikaci.
Abstract in different language:	Automatic classification of documents is a task, where each document is classified into some categories based on its content (e.g politics, sport, etc.). The thesis is primarily focused on multi-label classification, where each document can belong to more than one category. The main aim of the thesis is a multilingual document classification. LDA method is compared with a classification after machine translation into a target language. Maximum entropy and vector machines are used as classification methods. Statistical machine translation systems Moses and Google Translate are used for the text translation. For testing three different collections were selected. The first collection was delivered from the Czech News Agency, while the other two were found on the Internet. The experiments that were done showed that the machine translation provides good-quality results. On the other hand, classification with LDA method achieved worse results and cannot be recommended for the task. Furthermore, it was shown how the quality of the translation affects the final classification.
Rights:	Plný text práce je přístupný bez omezení.
Appears in Collections:	Diplomové práce / Theses (KIV)

Files in This Item:

File	Description	Size	Format
dp.pdf	Plný text práce	679,58 kB	Adobe PDF	View/Open
A14N0126Phodnoceni-ved.PDF	Posudek vedoucího práce	476,46 kB	Adobe PDF	View/Open
A14N0126Pposudek-op.PDF	Posudek oponenta práce	748,38 kB	Adobe PDF	View/Open
A14N0126Pobhajoba.PDF	Průběh obhajoby práce	206,34 kB	Adobe PDF	View/Open

Show full item record

Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/23665

search

navigation