Title: Efficient Hash Function for Duplicate Elimination in Dictionaries
Authors: Skala, Václav
Hrádek, Jan
Citation: Algoritmy 2009: 18th Conference on Scientific Computing, p. 382-391.
Issue Date: 2009
Publisher: Slovenská technická univerzita v Bratislavě
Document type: preprint
preprint
URI: http://hdl.handle.net/11025/11785
ISBN: 978-80-227-3032-7
Keywords: hešovací funkce;hešovací tabulka;struktura dat
Keywords in different language: hash function;hash table;data structure
Abstract: Fast elimination of duplicate data is needed in many areas, especially in the textual data context. A solution to this problem was recently found for geometrical data using a hash function to speed up the process. The usage of the hash function is extremely efficient when incremental elimination is required especially for processing large data sets. In this paper a new construction of the hash function is presented, giving short clusters with few collisions only. The proposed hash function is not a perfect hash function, nevertheless it gives similar properties to it. The hash function used takes advantage of the relatively large amount of available memory on modern computers, and works well with large data sets. Experiments have proved that different approaches should be used for different types of languages, because the structures of Slavonic and Anglo-Saxon languages are different. Therefore, tests were made with a Czech dictionary having 2.5 million words and an English dictionary having 130 thousands words. Algorithm was also tested for a few other languages. Experimental results are presented in this paper as well.
Rights: Plný text není přístupný.
Appears in Collections:Preprinty / Preprints (KIV)

Files in This Item:
File Description SizeFormat 
Skala_2009_HASH_Dictionary-Algoritmy.pdfPlný text311,07 kBAdobe PDFView/Open    Request a copy


Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/11785

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.