generate_pedict_huff − PAdict database converter


generate_pedict_huff -i inputfile -o outputfile [-full ]
[-frequency freqfile -threshold threshold [-keeppublic] ]
[-oldformat] [-records_per_pdb recordcount]


generate_pedict_huff is used to convert EDICT like database files (EUC encoding) into files that can be used with PAdict (see You need a dictionary database like the "edict" itself, they can be i.e. found at Jim Breens nihongo ftp server:

In "normal" mode, only words containing the "public" marker (P) will be added to the output file; but you can also convert the full dictionary by using the -full option. Always remember that you can’t build dictionaries with more than 65536 records, so you should use the "old" edict called edict_prev.gz for this. However, the database generator has already been updated to handle the full databases; but when you use "-oldformat", you are bound to 65536 record per file.

An alternative is using word frequency files to filter words based upon their frequency in spoken or written language. Such files can be found at the nihongo ftp, too; just look for a file called It’s not that accurate because the lists are based on Mainichi Shinbun newspaper, so it’s more written than spoken language. Do not use, because it has the wrong format (see below).

The expected file format for a word frequency files is:


The pos number relates to the POS category in the "wordtype" file. Everything is in EUC coding.


-i inputfile

name of the input file (EUC edict format)

-o outputfile

name of the output file (i.e. pedict.pdb)


generate a full edict, not only public (P) words. Remeber that when using -oldformat you cannot generate files containing more than 65535 records. PAdict versions smaller 1.0.0 cannot read files generated without "-oldformat".


generate an pre-PAdict-1.0.0 format. You also need to set records_per_pdb to 101 (see example)

-records_per_pdb recordcount

records per pdb database record. For old format, this has to be 101.

-frequency freqfile

use a word frequency file and threshold to shrink the number of records

-threshold threshold

only needed when -frequency is used. Gives the threshold in absolut occurances; words below this threshold are not added to the output file.


optional if -frequency is used. This makes sure that all words that are marked as public, "(P)", are kept even if they don’t match the frequency threshold.


generate a proper dictionary for PAdict 0.3.2

generate_pedict_huff -i edict -o pedict.pdb -oldformat -records_per_pdb 101

generate a full dictionary for PAdict 0.3.2 filtered against a frequency file

generate_pedict_huff -i edict -o pedict_resized.pdb -full -oldformat -records_per_pdb 101 -frequency wordfreq -threshold 1200


Lars Grunewaldt <largegreenwood at>