• Home
  • Sanskrit
  • About
  • Unraveling the Digital Edition of the Oxford English Dictionary (II)

    October 24, 2022

    Since my previous post, I mostly made sense of the OED encoding and wrote a script to convert it to a HTML representation. The code and auxiliary data—excluding the dictionary contents, of course—are available here. (See below for instructions.)

    The most laborious task by far was to map the ad hoc SGML entities used throughout the dictionary to the appropriate symbols. I have not been able to find all correspondences, because I do not have at my disposal a complete printed edition of the dictionary. Only a few volumes of the OED are available online. I found the following so far:

    The digital version of the OED is unhelpful because it lacks mappings for many entities and uses instead entity names as placeholders.

    Whenever possible, I mapped entities to Unicode characters or sequences of characters. For symbols that cannot be represented as text, I used SVG images, which I borrowed from Wikipedia or created myself. I did not attempt to deal with math or chemical formulas.

    Building the Dictionary

    The script I wrote basically parses the OED data, converts it to HTML and stores the result in a SQLite database. To run it, you need to install GNU make, python3, and finally the Beautiful Soup library. The latter can be installed with:

    pip3 install beautifulsoup4

    Now, download the archive I prepared and unpack it. Move to the corresponding directory and run:

    make oed_dir=/path/to/oed

    Where /path/to/oed is the location of the installation directory of the OED on your computer. This directory should contain a file named oed.t, among others.

    Three other arguments can optionally be given to make:

    On my computer, the creation of the database takes about an hour to complete, so you might have to be patient.

    Consulting the Dictionary

    If everything went well, there should now be a new file named oed.sqlite in the current directory. To consult this database, you first need to run the following:

    make run

    This starts a barebones HTTP server, which I wrote for testing purposes. You can now visit the page localhost:8888 with your web browser. It includes a search form that supports glob-like patterns with the ? and * metacharacters, as well as links to the OED’s bibliography and to abbreviations tables. Here is a sample page of the dictionary:

    Sample Page of the Dictionary

    The formatting of dictionary entries is very minimalistic. I tried to reproduce the typographic conventions of the printed edition, except for citations, which I separated into paragraphs while they are aggregated in the printed edition. You can change the appearance of the interface by editing the file oed.css. It defines various CSS classes, the meaning of which is documented in the parsing script, oed.py.

    Data Organization

    Let me finish with a few remarks that might be useful if you want to integrate the dictionary within another application.

    The database file contains three tables, as follows:

    create table dictionary(
        entry_id integer primary key,
        search_key text not null,
        headword text not null,
        contents text not null
    );
    create table lemmas(
        lemma text not null,
        entry_id integer not null,
    );
    create table extra_pages(
        path text primary key,
        html text not null
    );

    The dictionary table holds dictionary entries; search_key is an ASCII normalization of the entry heading, in plain text; headword and contents are both encoded as HTML. The lemmas table is unused for now; it enumerates the compounds, idiomatic expressions, etc., given under dictionary entries. Finally, the extra_pages table holds the contents of help pages and of the bibliography.

    SVG images, which are all located in the svg directory, are not stored within the database. You will thus need to copy this directory to the directory you chose for hosting the OED on your server.