MiddleProxy

MiddleProxy - a Chinese Character defining browsing tool

Middleproxy is a web proxy server which rewrites web pages so that Chinese characters in the web page are annotated with links to definitions, and links are replaced with a '*' pointing to the original link target. The definitions link page contains definitions from CEDICT for the character by itself, and for compounds containing the character. If you have javascript turned on, hovering over a character on the original page will display its Mandarin pronunciation and definition in isolation.

If you don't want to install on your local machine or you want to test the code before doing so, you can configure your HTTP proxy server to be asl2.dyndns.org:8001. (If there's a firewall, you'll need to make sure it permits connections to asl2.dyndns.org:8002 also.) You will also need to make sure the proxy is bypassed for asl2.dyndns.org (on my Mozilla, this is the "No proxy for" configuration line). This has several disadvantages. It'll be quite slow (asl2.dyndns.org is a Pentium classic at the other end of a DSL line); HTTP POST and connections to ports other than 80 are disabled for security reasons; and your browsing history is less secure.

If you want to install locally, you'll need Python 2.4. You'll also need to install Twisted (I've only tested with Twisted 1.3). Once you have the prerequisites, download middleproxy. After downloading and extracting, change into the middleproxy directory and run either python middleproxy.py or <path to twistd> -noy middleproxy.tac to run without backgrounding, or <path to twistd> -oy middleproxy.tac to run in the background. It'll parse some files, print that it's ready, and then listen on localhost on port 8001 and 8002. Port 8002 is a web server for character definitions, etc; 8001 is an HTTP proxy server. To use middleproxy, you'll need to configure your browser to use localhost:8001 as a proxy, and bypass proxying for localhost. These ports can be reconfigured by editing globalvars.py.

Individual character links are of the form http://asl2.dyndns.org:8002/chars?char=<unicode hex value>, e.g. http://asl2.dyndns.org:8002/chars?char=83dc for cai4, dish/vegetable. The displayed pages have live links, so you can see definitions for other characters occurring in compounds and lists of characters with the same pronunciation. (The pinyin is displayed with tone marks instead of numbers, which looks somewhat awkward in my browser. If it also looks bad for you, let me know.)

There are also:

All of these pages are linked to from the individual character pages.

The treatment of traditional vs. simplified characters may be a little confusing. In most cases, there's a one-to-one mapping, and the dictionary lookup will display the traditional equivalent. Using data extracted from the Unicode consortium's database, there are 83 simplified characters which map to more than one traditional character, of which 6 are traditional characters in their own right. In all cases, the dictionary displays all traditional characters to which a given simplified character maps. (This is incorrect in those 6 cases if the writer was using traditional characters, and didn't intend them to be read as simplified versions of the other characters: you'll just have to figure this out from context.)

Compatibility characters (which map to other characters, and are only included to preserve distinctions made in other character sets so that information won't be lost when transcoding from those other character sets to unicode and back) are looked up as the base characters.

All traditional/simplified/compatibility character information comes from a database maintained by the Unicode consortium. There's a lot of other information, including an English definition, dictionary indices into a wide assortment of printed CJK dictionaries, stroke count information, etc., some of which I hope to incorporate into future releases of this tool.

If you're interested in the CHANGELOG, it's here.

If you have any problems, please email me the specific URL that is causing trouble, and what you think is wrong.

I'd like to thank the authors of CEDICT, the Mandarin-English dictionary I use, and Michael Foord, author of the HTML parser I use.

If you liked this application, you will probably love Zhongwen, a much more comprehensive dictionary and character utility maintained by Rick Harbaugh.

If you're interested in other rewriting proxies, Ka-Ping Yee maintains a list.

The subversion repository is at <svn://asl2.dyndns.org/middleproxy>.

The code is distributed under the MIT license.

Aaron Lav (http://www.pobox.com/~asl2 / mailto:asl2@pobox.com)