• Home
  • Sanskrit
  • About
  • Unraveling the Digital Edition of the Oxford English Dictionary (I)

    September 8, 2022

    I recently rediscovered on my hard drive an electronic edition of the Oxford English Dictionary (henceforth OED), which I had been carrying around for years but completely forgot about. It includes the second edition of the OED, together with the three extra volumes published in the nineties, as well as some draft entries added at the beginning of this century. The electronic version itself is numbered 4.0 and dates back from 2009 (ISBN 978-0-19-956383-8). On Linux, under wine, its interface looks like this:

    Interface of the OED

    To my knowledge, this is the latest electronic edition that you can possibly purchase and install on your computer. It is now discontinued, and the dictionary is only available online, through a (quite expensive) subscription, as also happened to the Grand Robert de la langue française. Fortunately, you can still buy second-hand a CD-rom of the version I am talking about, or even download the data from some more-or-less respectable sources on the internet.

    Now, I have been using for years a homemade tool that queries many dictionaries at the same time and supports some useful features, and I wondered whether I could possibly add the OED to the dictionaries my tool already supports. This requires to extract the dictionary’s data in a textual format that is amenable to further processing. How can we do that?

    I first tried to use the software itself. Indeed, you can copy a whole entry to the clipboard and paste it elsewhere, and it is straightforward to automate this process by simulating keyboard events with a program. Unfortunately, the copy/paste method does not preserve formatting. I thus tried to use the print function—which can be configured to produce PDF files—, and to pass its output to an OCR tool, namely tesseract. The results are quite decent, but much work would still be necessary to obtain a satisfying output.

    Ideally, we would like to figure out where the dictionary’s data is stored, and how to extract it. Let us have a look at the installation directory:

              app.n 205K
             assets 22
         bakdates.t 674K
            bmk.ini 6
         cpyrt.html 2.3K
              dat.t 377K
         dblclk.dat 4.5K
               ed.t 1.2M
    flashplayer.bin 3.6M
          flash.png 2.1K
              fte.t 183M
              fts.t 230M
             gc.dll 80K
       goabout.html 316
        gohelp.html 288
           haxe.png 1.8K
               Help 358
               hw.t 1.2M
               ky.t 771K
               la.t 1.8M
            lemky.t 2.9M
              lem.t 6.1M
       licence.html 7.2K
           neko.dll 92K
            oed.ini 7
            OED.swf 450K
              oed.t 188M
            os.ndll 5.0K
            oup.png 1.5K
              pos.t 91K
               pr.t 2.7M
               qd.t 15M
        regexp.ndll 124K
         srtdates.t 916K
           std.ndll 68K
          swhx.ndll 80K
      systools.ndll 84K
            ui.ndll 4.5K
        version.txt 7
          xtra.ndll 44K
          zlib.ndll 60K

    All the files with a .t extension are clearly data files. This extension itself is meaningless and might have been chosen as a low-effort way to obfuscate the data.

    We expect the dictionary’s data to be stored within one of the larger files in this directory. Three of them are much larger than the others and are thus potential candidates: fte.t, fts.t and oed.t. Just looking at the names, it seems very likely that oed.t is the file we are looking for. fts.t probably stands for “full-text search” and might contain some kind of word index; fte.t probably stands for “full-text elements/entities” and might be used to query the structure of each entry.

    Now, we also find in this directory a file zlib.ndll, which obviously is a shared object of zlib, the now-ubiquitous compression library. This tells us that the dictionary’s data has been compressed with it, but then zlib can produce three different formats:

    1. The most basic one is deflate. It is raw compressed data, devoid of any kind of metadata like file names, permissions, etc. deflate stores data in a series of blocks, the last one of which is tagged to indicate that it ends the stream. A deflate decompressor can thus tell where a stream ends and will not attempt to decompress the data that follows, if any.
    2. Then comes the so-called zlib format, which consists in a deflate stream, preceded by a 2-bytes header that describes how the data is compressed, and followed by a 32-bits checksum.
    3. Finally, there is the gzip format, which also encapsulates a deflate stream, but has longer, more informative, headers and trailers. It is the most appropriate format for compressing files.

    Besides this, it is important to know that none of the above formats support random access. To examine the last byte in a compressed stream, you necessarily have to decompress the whole stream. Given the size of the OED and the fact that random access is required to display a given entry, we can reasonably expect the dictionary to be stored as a sequence of compressed blocks which are themselves indexed in some way.

    Our goal is thus to locate these compressed blocks in oed.t and to extract them. The easiest way to do so is to let zlib do the work: starting at the beginning of the file, we repeatedly try to decompress the data at the current offset; if there is indeed a compressed block at the current offset, we print it, move past the end of the block, and repeat the process; otherwise, we retry at the next offset.

    But then we do not know which of the three formats zlib supports was used for compressing these blocks. As far as I am concerned, I would have chosen the zlib format, thus I tried this option first. In fact, the zlib library supports trying both the zlib format and the gzip format at the same time, for their headers differ and can thus be used to tell the format of the stream, so I used this option.

    Here is the Python code I wrote for extracting the dictionary’s data. The +47 argument below tells zlib to try both the zlib format and the gzip format, and to use the largest possible decompression window.

    import sys, zlib
    
    data = sys.stdin.buffer.read()
    while data:
        d = zlib.decompressobj(+47)
        try:
            t = d.decompress(data)
        except zlib.error:
            data = data[1:]
            continue
        sys.stdout.buffer.write(t)
        sys.stdout.flush()
        data = d.unused_data
    

    This process is rather time-consuming, thus I ran it at various points in the oed.t file, to make it more likely to find a compressed block. In the meantime, I decompressed smaller files in the installation directory and examined them.

    The two most useful ones are hw.t and ky.t, which names stand respectively for “headword” and “key”. hw.t basically contains the list of words displayed on the left side of the window in the dictionary’s interface. It looks like this (I inserted new lines for readability):

    0898 number, <i>n.</i>^
    [...]
    woved^
    woven, <i>ppl. a.</i>^
    woves^
    wow, <i>n.{1}</i>^
    wow, <i>v.{2}</i>^
    wow, <i>int.</i>^
    [...]
    zyxt^

    ky.t contains the bare headwords, probably for string matching purposes:

    #0898number
    [...]
    #woved
    #woven
    #woves
    #wow
    #wow
    #wow
    [...]
    #zyxt

    We could probably derive these files from the dictionary’s entries, but it is simpler to just reuse them.

    Looking at these files, we can tell that this version of the OED contains 297,958 entries. Now, there are two compressed files in the installation directory that, when decompressed, weigh exactly 595,916 bytes, which is rather suspicious. They are pos.t and dat.t. I expected at least one of them to hold a list of 16-bits pointers to the compressed dictionary’s entries, but it does not seem like they do. However, there is clearly some bit packing going on: we find many powers of two, as well as numbers that look like set of flags, such as 4097 = (1 << 0) | (1 << 12).

    After about an hour of processing, my decompression code still did not find a single zlib or gzip compressed block in oed.t. I decided to run it again but, this time, instructed the zlib library to expect raw deflate data, by replacing the +47 of the code above with -15. Here is the updated code:

    import sys, zlib
    
    data = sys.stdin.buffer.read()
    while data:
        d = zlib.decompressobj(-15)
        try:
            t = d.decompress(data)
        except zlib.error:
            data = data[1:]
            continue
        sys.stdout.buffer.write(t)
        sys.stdout.flush()
        data = d.unused_data
    

    With the updated code, it did not take long to locate compressed blocks. At offset 2, ta-da! We extract a 600-KiB block of text like this:

    #<e><upd>Draft entry March 2004</upd><br><br><kg><hg text="new"><bf><hw>0898 number</hw>, <ps>n.</ps></bf> <tf><i>Brit.</i></tf></hg></kg><br><br><lg><pg><i>Brit.</i> &fslash;<ph>&smm;&schwa;&shtu; e&shti;t &sm;n&revv;in #e&shti;t</ph> [... more text ...] <spg>Forms: 19&en; <b>101</b>, 19&en; <b>one-oh-one</b></spg> [... more text ...] <hg text="oed"><bf><hw>abere</hw></bf></hg></kg><br><br><sg><def br="no">obs. form of <a href='event:x289'><xr><x>abear</x> <ps>v.</ps></xr></a></def></sg></e>#

    We are clearly dealing with some legacy SGML here. Most of the tags and entities names are non-standard or do not have the meaning we would expect in XML or HTML. This is the case, for instance, of the <b> tag, which marks bold text in HTML but is here used for tagging alternate spellings of a headword, while bold text is indicated with the <bf> tag. Likewise, we find many non-standard entities—about a thousand—, such as &schwa;.

    My decompression code managed to extract about 80 MiB of text, but then got stuck and started to output some binary garbage. This is to be expected, since the deflate format, contrarily to the two other ones, does not include additional data geared towards error detection. To prevent errors, I added a few extra tests to the decompression code: for a decompressed block of data to be valid, I assumed that it must only contain ASCII characters, that it must be at least 2-bytes long, and that it must both start and end with a # character (see the SGML code above). Here is the updated version:

    import sys, zlib
    
    data = sys.stdin.buffer.read()
    while data:
        d = zlib.decompressobj(-15)
        try:
            t = d.decompress(data).decode("ascii")
        except (zlib.error, UnicodeDecodeError):
            t = ""
        if len(t) < 2 or not t.startswith('#') \
            or not t.endswith('#'):
            data = data[1:]
            continue
        sys.stdout.write(t)
        sys.stdout.flush()
        data = d.unused_data
    

    This new code managed to correctly decompress the whole dictionary. I made sure that the entry count matched the one obtained from other files in the same directory, and I also checked the data between compressed blocks to determine whether something important was lost in the decompression process. It turns out that each compressed block is preceded by a two-bytes header and is followed by an Adler-32 checksum, just as in the zlib format, except that the headers we have here are not valid zlib headers. This probably results from an attempt to obfuscate the data so that programs like file(1) reveal no useful information.

    Fully decompressed, the dictionary weighs about 600 MiB, which is really impressive, given that it is just ASCII text, with rather terse markup. The people who typed the text and corrected it—about 200 in total, according to Wikipedia—deserve much credit for their work, not only the authors.

    Now that we have the full text of the dictionary, it remains to determine what the tags and entities names mean, so that we can extract the information we want and convert the text to a more modern format suitable for display, like HTML. I am not yet done with that, thus I will write about it at a later time. Meanwhile, let me point out two blatant mistakes in the encoding of the dictionary. In the first one, an opening <dn> tag is missing and a closing one as well:

    1,  1 / 2</dn>, <nu>1</nu><dn>4</dn>, <nu>1</nu><dn>8</dn>, <nu>1</nu><dn>16

    In the second one, what looks like a cross-reference number is enclosed within a <xcn> tag that only appears there.

    <xr><xcn>00016130</xcn><x>Babism</x></xr>

    This should probably be replaced with:

    <a href='event:x16130'><xr><x>Babism</x></xr></a>

    See here for the follow-up post.