generate_pedict_huff − PAdict database converter |
generate_pedict_huff -i inputfile
-o outputfile [-full ] |
generate_pedict_huff is used to convert EDICT like database files (EUC encoding) into files that can be used with PAdict (see padict.sf.net). You need a dictionary database like the "edict" itself, they can be i.e. found at Jim Breens nihongo ftp server: http://ftp.monash.edu.au/pub/nihongo/00INDEX.html In "normal" mode, only words containing the "public" marker (P) will be added to the output file; but you can also convert the full dictionary by using the -full option. Always remember that you can’t build dictionaries with more than 65536 records, so you should use the "old" edict called edict_prev.gz for this. However, the database generator has already been updated to handle the full databases; but when you use "-oldformat", you are bound to 65536 record per file. An alternative is using word frequency files to filter words based upon their frequency in spoken or written language. Such files can be found at the nihongo ftp, too; just look for a file called wordfreq_ck.zip. It’s not that accurate because the lists are based on Mainichi Shinbun newspaper, so it’s more written than spoken language. Do not use wordfreq.zip, because it has the wrong format (see below). The expected file format for a word frequency files is: word[TAB]frequency The pos number relates to the POS category in the "wordtype" file. Everything is in EUC coding. |
-i inputfile |
name of the input file (EUC edict format) |
-o outputfile |
name of the output file (i.e. pedict.pdb) |
-full |
generate a full edict, not only public (P) words. Remeber that when using -oldformat you cannot generate files containing more than 65535 records. PAdict versions smaller 1.0.0 cannot read files generated without "-oldformat". |
-oldformat |
generate an pre-PAdict-1.0.0 format. You also need to set records_per_pdb to 101 (see example) |
-records_per_pdb recordcount |
records per pdb database record. For old format, this has to be 101. |
-frequency freqfile |
use a word frequency file and threshold to shrink the number of records |
-threshold threshold |
only needed when -frequency is used. Gives the threshold in absolut occurances; words below this threshold are not added to the output file. |
-keeppublic |
optional if -frequency is used. This makes sure that all words that are marked as public, "(P)", are kept even if they don’t match the frequency threshold. |
generate a proper dictionary for PAdict 0.3.2 generate_pedict_huff -i edict -o pedict.pdb -oldformat -records_per_pdb 101 generate a full dictionary for PAdict 0.3.2 filtered against a frequency file generate_pedict_huff -i edict -o pedict_resized.pdb -full -oldformat -records_per_pdb 101 -frequency wordfreq -threshold 1200 |
Lars Grunewaldt <largegreenwood at users.sourceforge.net> |