* Developers Notes and Reference * >v.0.2.5< * Preamble * OK, I'll try to describe the internal structures, ideas, methods and, last but not least, the database structures of PAdict. I'll do my best, which might not be much, because this stuff is not that easy, and there are some things in PAdict that should be used differently, or are half-implemented and never used, ... I hope I can make things a little bit clearer for anyone who's coming (hopefully) after me. I'll try to describe things following a pattern, but I'm not sure if I can do it all the time. * Databases * First, the central dogma: databases. This application lives of databases, and there is definitly more than one. I will not explain in detail why PalmOS Databases are different than others, just know: only about 7000 records per database, each record smaller than 64k. To handle many database records, we have to stuff many of them in one PDB record. But everything startet with... ** The EDICT (Palm Database Name: Edict) ** *** Description: This is the Edict maintained by some guys at the Monash University. You can find more infos here: http://www.csse.monash.edu.au/~jwb/edict.html. The EDICT is a japanese-english database file, available in XML and "plain text". There are several character codings available (EUC, ...). The file has to be converted into a PalmOS-conform file format, which is described below. *** Records: Each record in the edict database consists of - a japanese word in kanji, if available - the same word, in kana - english meanings - some flags *** PalmOS Representation: Well, this is tough. The Edict database contains several parts: index data, compression information, and record data. Were to find things is stored in the first record of the database: typedef struct { UInt16 id; // database type UInt16 compressed; // whether this is compressed, and in which record the decompression header resides UInt16 kana_indexed; // whether this has a kana index, and in wich record it begins UInt16 dictionary_data; // first "real" dictionary pdb UInt16 hash_record; // if this has an additional reading index, and where it begins UInt16 aligner_stuff_blabla; // 32-bit-alignment on generator. has to be fixed... UInt32 hash_size; // size of the hash UInt16 kanji_indexed; // if this has an additional kanji index, and where it begins } dict_info; This structure represents the format of the first pdb record of Edict. It shows the contents of the Edict, each part is mentioned here. Where to start. Indicies. *** Kana/Kanji Index: This index is used to search with kana search strings. It's a tree based index, a little bit like a combination of a digital tree and a ternary tree. Kind of. The idea is this: each level of the tree represents one character in the search string, like in a digital tree. Like for "watashi": Level 0: w Level 1: a Level 2: t ... Each node in the tree stores the character it connects to (like, 'w'), a database record if this treepath is connected to a dictioniary record (walking down the tree from node to node one level each step leads to the original word 'watashi', and the node storing 'i' also holds the record index of watashi in the database). To search, we walk down the tree, comparing our next character in search string with all the siblings of our actual node. Have a look at this, I hope it makes things clearer: [structures_tree.png] One additional advantage of such a tree is that it compresses the data. We don't need to save the string "watashi" twice for "watashi" and "watashitachi", because they simply share the nodes in the tree! And we can totally remove the strings, if we introduce a parent node pointer to the tree. We can save the position of the *last* treenode in the database record, instead of the string, and trace the string back by walking the tree backwards. This saves storage. Great! Such a tree would use much memory (which we don't have), so we had to make it "simpler". Each node in the search tree needs: a list of all nodes that are children to itself, a pointer "upwards" to rebuild the strings, the character itself, and a list of records connected to this character string (there are some words which read the same, so we will need more than one link to database records). We (or Ciaran Keating) changed the tree structure like this. Each node knows: it's parent, it's first child, and it's "followers". While the first child resides one level down in the tree (it's the real "sibling"), the followers are the nodes on the same level like this node that are logically connected to the same parent. Now, it's not longer nessesary to store a list of siblings, giving as constant node sizes (which is always a good thing). Furthermore, we can use 16 bit integers to store those sibling id's, because we simply store the index into the array of nodes in the next level. Of course, 16 bit limits us to max. 65535 nodes per level. To store multiple results, we simply use more nodes with the same character. This is much simpler compared to lists, and does not consume much search time and space. Those duplets always follow directly in the chain of nodes, and only the *first* of them might have children leading to further results. To make it easy to find the levels in the pdb records, the number of nodes and the position of the first one is stored in an array. The complete pdb structure is to be found below ;) typedef struct { UInt16 character; UInt16 parent; // parent in upper level UInt16 child; // child in lower level UInt16 sibling; // sibling in this level UInt16 data; // the record ID assigned to this node } treenodes; As you might see, the data pointer is a UInt16. This means, the max record index is 65535. That's why padict can only handle up to 65535 records for now (version 0.2.5). Sparing 2 additional bytes (two would be good because of the word alignment...) on this record ID is not practical for small databases (small is, like, 19000 records), especially because this data field is not used on every treenode. There will be a "large database" flag later, which supports databases that need more than 64k nodes per level (which is likely for databases with more the 64k words, I tried that...) and more than 64k records. *** English Meanings: Mainly the dictionary is Japanese->English, just because of the edict, which is - japanese->English. I just wanted to implement at least a *possibility* to find a word by it's english meaning. It seemed to be nearly impossible to index the english meanings like the kanji/kana, because the strings are much longer, and have things like "to [be]" and so on. So I decided a simple hashing algorithm would be best here. Of course, not the whole strings are hashed, but each word found in the meanings by itself. So "to think it's rainy outside" would lead to hashes for "to","think","it","rainy","outside". Those hashes are stored in a simple hash table and that's it. The world could be so easy... Of course searches for english meanings aren't that accurate, but on the other hand the hash table is relativly small (compared to the kana/kanji indicies), and it's absolutly worth it. *** Compression: Memory capacity, again. The english meanings are huffman compressed. This seemed to be the best idea. First, we only need a relativly small amount of different characters. Second, it's possible to search a huffman compressed string for a word without decompression the searched string (I don't use that feature, but it is possible. Might come later). The huffman tree is stored in a pdb record specified in the header. *** EDICT flags: those flags are simply handled like english meanings, even if they could be stored seperatly. But this would have complicated things much. If anyone volunteers... *** Data Storage: as mentioned before, there are three important points in palm programming. They are space, the pdb structure, space again. Maybe four, an additional space. So a compact data format was needed, and the possibility to store more than one edict record in one pdb record. And of course, it must be relativly easy to locate any record. First, Records are identified by their ID. The ID is simply incremented from "0" when the database is builded. So the first record in the database will get the ID "0", then "1","2",[...]. There are *no* gaps! Each pdb record now takes a constant number of edict records (in 0.2.5, 101. Don't ask me why 101 and not 100 or 128... it's a coincidence. Maybe changed later. Should be marked in the header!! Right now the exact record count is stored in each pdb record at offset[0], as a UInt16). The pdb looks like this: Header: [0] UInt16 - Number of records in this pdb [2..204] UInt16[101] - Offset of each record in this pdb followed by the records: [+0] UInt8 - Level where the kana reading node is found in the index tree [+1] UInt16 - NodeID in this level [+3] WChar[n] - Zero-Terminated kanji string [+3+n+1] Char[m] - first meaning (huffman-compressed, first byte is compressed length) [...] Char[] - second meaning (if available) [...] [...] Char[] - last meaning [...] Next Record beginning with UInt8 - Level... The length of each record is calculated by the next records' offset and the records offset; if the record is the last one, the PalmOS database handler is asked for the size of this pdb. You see, there is no byte spared... (I hope :) ** The kanjifont (PalmDB Name: KanjiFnt) ** This is a 16x16 bitmap font for kanji display which I found on Jim Breens japanese page. This is really easy; each fontpage is put into one pdb record (resulting in 84 records). The font file is consolidated, meaning there are no gaps where are no characters (look into the JIS X 0208-1990 character set to find out what that means). In short, it's easy to look up characters because the JIS code can be directly calculated into an offset into the font file. Logically, the dictionary records must consist of JIS coded characters! (That is, EUC. The only difference is add/substrace 0x08080). ** The elisafont (PalmDB Name: ElisaFnt) ** Same as above, a consolidated JIS font, this time 8x8. Thanks to the people of the elisa project for their great work, and the possibility to include this font in this package. ** Pocketkanji Databases (strokedict, kanjidefinition) ** Sorry, I can't tell you much here, because the databases are simply those from the PocketKanji Project. Check the PocketKanji Web Site! ** Runtime Generated Preferences (PocketKanjiPrefDB) ** The program preferences. Not many for now, but there are some. Additionally, PocketKanji uses some records as temporary storage heap. For now the structure in record 0 is: typedef struct { UInt16 PrefVersion; DmOpenRef pAppDb; Boolean ResultDisplaySmall; // use small/elisa result display /* Declare new preference fields here. */ // this might seem odd, but we need to store a handler to the database anyway; this is the easiest way to do so! } Pref;