March 10, 2022
A while ago, I wrote a small library and command-line tool for detecting the natural language of written texts. My main purpose was to be able to recognize transliterated Sanskrit texts, which isn’t possible with any language detection library out there.
You can download the source code here.
The library is available in source form, as an amalgamation. Compile
the file sabir.c
together with your source code, and use
the interface described in sabir.h
. A C11 compiler is
required for compilation, which means either GCC or CLang on Unix.
You’ll also need to link the compiled code to utf8proc
.
Two command-line tools are also included: a classification program,
sabir
, and a script for creating and evaluating
classification models, sabir-train
. Finally, a model file
model.sb
is included. It recognizes English, French,
German, Italian, and transliterated Sanskrit. I trained it on a few
megabytes of Wikipedia dumps and on electronic texts from the Gretil project.
To compile and install all the above:
$ make && sudo make install
There is a practical usage example in the file
example.c
. Compile this file with make
, and
use the produced binary like so:
$ ./example <<< 'hello world!'
en
Full details are given in sabir.h
. There is also a
manual page for the command-line tool.
Provided you have enough data, it is possible—and even desirable, to
achieve better accuracy—to train a specialized model. The script
sabir-train
can be used for that purpose. You should first
create one file per language, named after the language. Then, to get an
idea of the accuracy of your future classifier, invoke
sabir-train eval
at the command-line with the names of your
files as argument. Here we use testing corpus files distributed together
with the source code:
$ sabir-train eval test/data/*
macro-precision: 99.078
macro-recall: 99.076
macro-F1: 99.077
If you’re satisfied with the results, you can then create a model file with the following:
$ sabir-train dump test/data/* > my_model
The resulting model can then be used with the C API, or with the command-line classifier, e.g.:
$ sabir --model=my_model <<< 'hello world!'
en
The approach I used is similar to that of Google’s libcld2
,
though simpler. Conceptually, we first preprocess the text to remove
non-alphabetic code points. Each remaining letter sequence is then
padded on the left and on the right with 0xff
bytes—which
cannot appear in valid UTF-8 strings—, and, if the resulting sequence is
long enough, byte quadgrams are extracted from it and fed to a
multinomial Naive Bayes classifier. The string “Ô, café!
”,
for instance, is turned into the following quadgrams—in hexadecimal
notation:
ff c3 94 ff
ff 63 61 66
63 61 66 c3
61 66 c3 a9
66 c3 a9 ff
I’ve made two simplifying assumptions as concerns the Naive Bayes classifier: priors are treated as if they were uniform—which is of course likely not to be the case in practice—, and the length of a document is assumed to be constant. Refinements are certainly possible, but the resulting classifier is already good enough for my purpose.
I initially implemented the approach described in Cavnar and Trenkle, 1994, before writing this library. This method seems to be the most popular out there, probably because it is very simple. However, it requires to create a map of n-grams at classification time, which means that a lot of memory allocations and deallocations are necessary to classify every single document. By contrast, my current implementation only allocates a single chunk of memory at startup, for loading the model.
For a comparison of the existing approaches to language classification, see Baldwin and Lui, 2010.