October 24, 2022
Since my previous post, I mostly made sense of the OED encoding and wrote a script to convert it to a HTML representation. The code and auxiliary data—excluding the dictionary contents, of course—are available here. (See below for instructions.)
The most laborious task by far was to map the ad hoc SGML entities used throughout the dictionary to the appropriate symbols. I have not been able to find all correspondences, because I do not have at my disposal a complete printed edition of the dictionary. Only a few volumes of the OED are available online. I found the following so far:
The digital version of the OED is unhelpful because it lacks mappings for many entities and uses instead entity names as placeholders.
Whenever possible, I mapped entities to Unicode characters or sequences of characters. For symbols that cannot be represented as text, I used SVG images, which I borrowed from Wikipedia or created myself. I did not attempt to deal with math or chemical formulas.
The script I wrote basically parses the OED data, converts it to HTML and stores the result in a SQLite database. To run it, you need to install GNU make, python3, and finally the Beautiful Soup library. The latter can be installed with:
pip3 install beautifulsoup4
Now, download the archive I prepared and unpack it. Move to the corresponding directory and run:
make oed_dir=/path/to/oed
Where /path/to/oed
is the location of the installation
directory of the OED on your computer. This directory should contain a
file named oed.t
, among others.
Three other arguments can optionally be given to
make
:
oed_base
(default: /
). Set this to the
absolute path you want the OED to be located at on your server.
Hyperlinks and cross-references will be made to point to subdirectories
of the chosen value.headings_start
(default: 2
) Set this to
the top-level heading number you want the generated HTML to use within
help pages and within the bibliography. The purpose of this option is to
make it possible to use higher-level headings in your own HTML code.
Headings will be numbered starting from this number. With
headings_start=3
, for instance, the generated headings will
be <h3>
, <h4>
,
<h5>
. Valid values are >= 1
and
<= 4
.entry_headings_start
(default: 2
) Like
headings_start
, but for dictionary headings.On my computer, the creation of the database takes about an hour to complete, so you might have to be patient.
If everything went well, there should now be a new file named
oed.sqlite
in the current directory. To consult this
database, you first need to run the following:
make run
This starts a barebones HTTP server, which I wrote for testing
purposes. You can now visit the page localhost:8888 with your web browser.
It includes a search form that supports glob-like patterns with the
?
and *
metacharacters, as well as links to
the OED’s bibliography and to abbreviations tables. Here is a sample
page of the dictionary:
The formatting of dictionary entries is very minimalistic. I tried to
reproduce the typographic conventions of the printed edition, except for
citations, which I separated into paragraphs while they are aggregated
in the printed edition. You can change the appearance of the interface
by editing the file oed.css
. It defines various CSS
classes, the meaning of which is documented in the parsing script,
oed.py
.
Let me finish with a few remarks that might be useful if you want to integrate the dictionary within another application.
The database file contains three tables, as follows:
create table dictionary(
entry_id integer primary key,
search_key text not null,
headword text not null,
contents text not null
);
create table lemmas(
lemma text not null,
entry_id integer not null,
);
create table extra_pages(
path text primary key,
html text not null
);
The dictionary
table holds dictionary entries;
search_key
is an ASCII normalization of the entry heading,
in plain text; headword
and contents
are both
encoded as HTML. The lemmas
table is unused for now; it
enumerates the compounds, idiomatic expressions, etc., given under
dictionary entries. Finally, the extra_pages
table holds
the contents of help pages and of the bibliography.
SVG images, which are all located in the svg
directory,
are not stored within the database. You will thus need to copy this
directory to the directory you chose for hosting the OED on your
server.