September 8, 2022
I recently rediscovered on my hard drive an electronic edition of the
Oxford English Dictionary (henceforth OED), which I had been carrying
around for years but completely forgot about. It includes the second
edition of the OED, together with the three extra volumes published in
the nineties, as well as some draft entries added at the beginning of
this century. The electronic version itself is numbered 4.0 and dates
back from 2009 (ISBN 978-0-19-956383-8). On Linux, under
wine
, its interface looks like this:
To my knowledge, this is the latest electronic edition that you can possibly purchase and install on your computer. It is now discontinued, and the dictionary is only available online, through a (quite expensive) subscription, as also happened to the Grand Robert de la langue française. Fortunately, you can still buy second-hand a CD-rom of the version I am talking about, or even download the data from some more-or-less respectable sources on the internet.
Now, I have been using for years a homemade tool that queries many dictionaries at the same time and supports some useful features, and I wondered whether I could possibly add the OED to the dictionaries my tool already supports. This requires to extract the dictionary’s data in a textual format that is amenable to further processing. How can we do that?
I first tried to use the software itself. Indeed, you can copy a
whole entry to the clipboard and paste it elsewhere, and it is
straightforward to automate this process by simulating keyboard events
with a program. Unfortunately, the copy/paste method does not preserve
formatting. I thus tried to use the print function—which can be
configured to produce PDF files—, and to pass its output to an OCR tool,
namely tesseract
. The results are quite decent, but much
work would still be necessary to obtain a satisfying output.
Ideally, we would like to figure out where the dictionary’s data is stored, and how to extract it. Let us have a look at the installation directory:
app.n 205K
assets 22
bakdates.t 674K
bmk.ini 6
cpyrt.html 2.3K
dat.t 377K
dblclk.dat 4.5K
ed.t 1.2M
flashplayer.bin 3.6M
flash.png 2.1K
fte.t 183M
fts.t 230M
gc.dll 80K
goabout.html 316
gohelp.html 288
haxe.png 1.8K
Help 358
hw.t 1.2M
ky.t 771K
la.t 1.8M
lemky.t 2.9M
lem.t 6.1M
licence.html 7.2K
neko.dll 92K
oed.ini 7
OED.swf 450K
oed.t 188M
os.ndll 5.0K
oup.png 1.5K
pos.t 91K
pr.t 2.7M
qd.t 15M
regexp.ndll 124K
srtdates.t 916K
std.ndll 68K
swhx.ndll 80K
systools.ndll 84K
ui.ndll 4.5K
version.txt 7
xtra.ndll 44K
zlib.ndll 60K
All the files with a .t
extension are clearly data
files. This extension itself is meaningless and might have been chosen
as a low-effort way to obfuscate the data.
We expect the dictionary’s data to be stored within one of the larger
files in this directory. Three of them are much larger than the others
and are thus potential candidates: fte.t
,
fts.t
and oed.t
. Just looking at the names, it
seems very likely that oed.t
is the file we are looking
for. fts.t
probably stands for “full-text search” and might
contain some kind of word index; fte.t
probably stands for
“full-text elements/entities” and might be used to query the structure
of each entry.
Now, we also find in this directory a file zlib.ndll
,
which obviously is a shared object of zlib
, the now-ubiquitous
compression library. This tells us that the dictionary’s data has been
compressed with it, but then zlib
can produce three
different formats:
deflate
. It is raw compressed
data, devoid of any kind of metadata like file names, permissions, etc.
deflate
stores data in a series of blocks, the last one of
which is tagged to indicate that it ends the stream. A
deflate
decompressor can thus tell where a stream ends and
will not attempt to decompress the data that follows, if any.zlib
format, which consists in
a deflate stream, preceded by a 2-bytes header that describes how the
data is compressed, and followed by a 32-bits checksum.gzip
format, which also
encapsulates a deflate stream, but has longer, more informative, headers
and trailers. It is the most appropriate format for compressing
files.Besides this, it is important to know that none of the above formats support random access. To examine the last byte in a compressed stream, you necessarily have to decompress the whole stream. Given the size of the OED and the fact that random access is required to display a given entry, we can reasonably expect the dictionary to be stored as a sequence of compressed blocks which are themselves indexed in some way.
Our goal is thus to locate these compressed blocks in
oed.t
and to extract them. The easiest way to do so is to
let zlib
do the work: starting at the beginning of the
file, we repeatedly try to decompress the data at the current offset; if
there is indeed a compressed block at the current offset, we print it,
move past the end of the block, and repeat the process; otherwise, we
retry at the next offset.
But then we do not know which of the three formats zlib
supports was used for compressing these blocks. As far as I am
concerned, I would have chosen the zlib
format, thus I
tried this option first. In fact, the zlib
library supports
trying both the zlib
format and the gzip
format at the same time, for their headers differ and can thus be used
to tell the format of the stream, so I used this option.
Here is the Python
code I wrote for extracting the
dictionary’s data. The +47
argument below tells
zlib
to try both the zlib
format and the
gzip
format, and to use the largest possible decompression
window.
import sys, zlib data = sys.stdin.buffer.read() while data: d = zlib.decompressobj(+47) try: t = d.decompress(data) except zlib.error: data = data[1:] continue sys.stdout.buffer.write(t) sys.stdout.flush() data = d.unused_data
This process is rather time-consuming, thus I ran it at various
points in the oed.t
file, to make it more likely to find a
compressed block. In the meantime, I decompressed smaller files in the
installation directory and examined them.
The two most useful ones are hw.t
and ky.t
,
which names stand respectively for “headword” and “key”.
hw.t
basically contains the list of words displayed on the
left side of the window in the dictionary’s interface. It looks like
this (I inserted new lines for readability):
0898 number, <i>n.</i>^
[...]
woved^
woven, <i>ppl. a.</i>^
woves^
wow, <i>n.{1}</i>^
wow, <i>v.{2}</i>^
wow, <i>int.</i>^
[...]
zyxt^
ky.t
contains the bare headwords, probably for string
matching purposes:
#0898number
[...]
#woved
#woven
#woves
#wow
#wow
#wow
[...]
#zyxt
We could probably derive these files from the dictionary’s entries, but it is simpler to just reuse them.
Looking at these files, we can tell that this version of the OED
contains 297,958 entries. Now, there are two compressed files in the
installation directory that, when decompressed, weigh exactly 595,916
bytes, which is rather suspicious. They are pos.t
and
dat.t
. I expected at least one of them to hold a list of
16-bits pointers to the compressed dictionary’s entries, but it does not
seem like they do. However, there is clearly some bit packing going on:
we find many powers of two, as well as numbers that look like set of
flags, such as 4097 = (1 << 0) | (1 << 12).
After about an hour of processing, my decompression code still did
not find a single zlib
or gzip
compressed
block in oed.t
. I decided to run it again but, this time,
instructed the zlib
library to expect raw
deflate
data, by replacing the +47
of the code
above with -15
. Here is the updated code:
import sys, zlib data = sys.stdin.buffer.read() while data: d = zlib.decompressobj(-15) try: t = d.decompress(data) except zlib.error: data = data[1:] continue sys.stdout.buffer.write(t) sys.stdout.flush() data = d.unused_data
With the updated code, it did not take long to locate compressed blocks. At offset 2, ta-da! We extract a 600-KiB block of text like this:
#<e><upd>Draft entry March 2004</upd><br><br><kg><hg text="new"><bf><hw>0898 number</hw>, <ps>n.</ps></bf> <tf><i>Brit.</i></tf></hg></kg><br><br><lg><pg><i>Brit.</i> &fslash;<ph>&smm;&schwa;&shtu; e&shti;t &sm;n&revv;in #e&shti;t</ph> [... more text ...] <spg>Forms: 19&en; <b>101</b>, 19&en; <b>one-oh-one</b></spg> [... more text ...] <hg text="oed"><bf><hw>abere</hw></bf></hg></kg><br><br><sg><def br="no">obs. form of <a href='event:x289'><xr><x>abear</x> <ps>v.</ps></xr></a></def></sg></e>#
We are clearly dealing with some legacy SGML here. Most of the tags
and entities names are non-standard or do not have the meaning we would
expect in XML or HTML. This is the case, for instance, of the
<b>
tag, which marks bold text in HTML but is here
used for tagging alternate spellings of a headword, while bold text is
indicated with the <bf>
tag. Likewise, we find many
non-standard entities—about a thousand—, such as
&schwa;
.
My decompression code managed to extract about 80 MiB of text, but
then got stuck and started to output some binary garbage. This is to be
expected, since the deflate
format, contrarily to the two
other ones, does not include additional data geared towards error
detection. To prevent errors, I added a few extra tests to the
decompression code: for a decompressed block of data to be valid, I
assumed that it must only contain ASCII characters, that it must be at
least 2-bytes long, and that it must both start and end with a
#
character (see the SGML code above). Here is the updated
version:
import sys, zlib data = sys.stdin.buffer.read() while data: d = zlib.decompressobj(-15) try: t = d.decompress(data).decode("ascii") except (zlib.error, UnicodeDecodeError): t = "" if len(t) < 2 or not t.startswith('#') \ or not t.endswith('#'): data = data[1:] continue sys.stdout.write(t) sys.stdout.flush() data = d.unused_data
This new code managed to correctly decompress the whole dictionary. I
made sure that the entry count matched the one obtained from other files
in the same directory, and I also checked the data between compressed
blocks to determine whether something important was lost in the
decompression process. It turns out that each compressed block is
preceded by a two-bytes header and is followed by an Adler-32 checksum,
just as in the zlib
format, except that the headers we have
here are not valid zlib
headers. This probably results from
an attempt to obfuscate the data so that programs like
file(1)
reveal no useful information.
Fully decompressed, the dictionary weighs about 600 MiB, which is really impressive, given that it is just ASCII text, with rather terse markup. The people who typed the text and corrected it—about 200 in total, according to Wikipedia—deserve much credit for their work, not only the authors.
Now that we have the full text of the dictionary, it remains to
determine what the tags and entities names mean, so that we can extract
the information we want and convert the text to a more modern format
suitable for display, like HTML. I am not yet done with that, thus I
will write about it at a later time. Meanwhile, let me point out two
blatant mistakes in the encoding of the dictionary. In the first one, an
opening <dn>
tag is missing and a closing one as
well:
1, 1 / 2</dn>, <nu>1</nu><dn>4</dn>, <nu>1</nu><dn>8</dn>, <nu>1</nu><dn>16
In the second one, what looks like a cross-reference number is
enclosed within a <xcn>
tag that only appears
there.
<xr><xcn>00016130</xcn><x>Babism</x></xr>
This should probably be replaced with:
<a href='event:x16130'><xr><x>Babism</x></xr></a>
See here for the follow-up post.