GENERATE_PEDICT_HUFF

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
EXAMPLES
AUTHOR

NAME

generate_pedict_huff − PAdict database converter

SYNOPSIS

generate_pedict_huff -i inputfile -o outputfile [-full ]
[-frequency freqfile -threshold threshold [-keeppublic] ]
[-oldformat] [-records_per_pdb recordcount]

DESCRIPTION

generate_pedict_huff is used to convert EDICT like database files (EUC encoding) into files that can be used with PAdict (see padict.sf.net). You need a dictionary database like the "edict" itself, they can be i.e. found at Jim Breens nihongo ftp server:

http://ftp.monash.edu.au/pub/nihongo/00INDEX.html

In "normal" mode, only words containing the "public" marker (P) will be added to the output file; but you can also convert the full dictionary by using the -full option. Always remember that you can’t build dictionaries with more than 65536 records, so you should use the "old" edict called edict_prev.gz for this. However, the database generator has already been updated to handle the full databases; but when you use "-oldformat", you are bound to 65536 record per file.

An alternative is using word frequency files to filter words based upon their frequency in spoken or written language. Such files can be found at the nihongo ftp, too; just look for a file called wordfreq_ck.zip. It’s not that accurate because the lists are based on Mainichi Shinbun newspaper, so it’s more written than spoken language. Do not use wordfreq.zip, because it has the wrong format (see below).

The expected file format for a word frequency files is:

word[TAB]frequency

The pos number relates to the POS category in the "wordtype" file. Everything is in EUC coding.

OPTIONS

-i inputfile

name of the input file (EUC edict format)

-o outputfile

name of the output file (i.e. pedict.pdb)

-full

generate a full edict, not only public (P) words. Remeber that when using -oldformat you cannot generate files containing more than 65535 records. PAdict versions smaller 1.0.0 cannot read files generated without "-oldformat".

-oldformat

generate an pre-PAdict-1.0.0 format. You also need to set records_per_pdb to 101 (see example)

-records_per_pdb recordcount

records per pdb database record. For old format, this has to be 101.

-frequency freqfile

use a word frequency file and threshold to shrink the number of records

-threshold threshold

only needed when -frequency is used. Gives the threshold in absolut occurances; words below this threshold are not added to the output file.

-keeppublic

optional if -frequency is used. This makes sure that all words that are marked as public, "(P)", are kept even if they don’t match the frequency threshold.

EXAMPLES

generate a proper dictionary for PAdict 0.3.2

generate_pedict_huff -i edict -o pedict.pdb -oldformat -records_per_pdb 101

generate a full dictionary for PAdict 0.3.2 filtered against a frequency file

generate_pedict_huff -i edict -o pedict_resized.pdb -full -oldformat -records_per_pdb 101 -frequency wordfreq -threshold 1200

AUTHOR

Lars Grunewaldt <largegreenwood at users.sourceforge.net>