Simple and Practical Language Detection

March 10, 2022

A while ago, I wrote a small library and command-line tool for detecting the natural language of written texts. My main purpose was to be able to recognize transliterated Sanskrit texts, which isn’t possible with any language detection library out there.

Installation

You can download the source code here.

The library is available in source form, as an amalgamation. Compile the file sabir.c together with your source code, and use the interface described in sabir.h. A C11 compiler is required for compilation, which means either GCC or CLang on Unix. You’ll also need to link the compiled code to utf8proc.

Two command-line tools are also included: a classification program, sabir, and a script for creating and evaluating classification models, sabir-train. Finally, a model file model.sb is included. It recognizes English, French, German, Italian, and transliterated Sanskrit. I trained it on a few megabytes of Wikipedia dumps and on electronic texts from the Gretil project.

To compile and install all the above:

$ make && sudo make install

Usage

There is a practical usage example in the file example.c. Compile this file with make, and use the produced binary like so:

$ ./example <<< 'hello world!'
en

Full details are given in sabir.h. There is also a manual page for the command-line tool.

Training

Provided you have enough data, it is possible—and even desirable, to achieve better accuracy—to train a specialized model. The script sabir-train can be used for that purpose. You should first create one file per language, named after the language. Then, to get an idea of the accuracy of your future classifier, invoke sabir-train eval at the command-line with the names of your files as argument. Here we use testing corpus files distributed together with the source code:

$ sabir-train eval test/data/*
macro-precision: 99.078
macro-recall: 99.076
macro-F1: 99.077

If you’re satisfied with the results, you can then create a model file with the following:

$ sabir-train dump test/data/* > my_model

The resulting model can then be used with the C API, or with the command-line classifier, e.g.:

$ sabir --model=my_model <<< 'hello world!'
en

Implementation

The approach I used is similar to that of Google’s libcld2, though simpler. Conceptually, we first preprocess the text to remove non-alphabetic code points. Each remaining letter sequence is then padded on the left and on the right with 0xff bytes—which cannot appear in valid UTF-8 strings—, and, if the resulting sequence is long enough, byte quadgrams are extracted from it and fed to a multinomial Naive Bayes classifier. The string “Ô, café!”, for instance, is turned into the following quadgrams—in hexadecimal notation:

ff c3 94 ff
ff 63 61 66
63 61 66 c3
61 66 c3 a9
66 c3 a9 ff

I’ve made two simplifying assumptions as concerns the Naive Bayes classifier: priors are treated as if they were uniform—which is of course likely not to be the case in practice—, and the length of a document is assumed to be constant. Refinements are certainly possible, but the resulting classifier is already good enough for my purpose.

References

I initially implemented the approach described in Cavnar and Trenkle, 1994, before writing this library. This method seems to be the most popular out there, probably because it is very simple. However, it requires to create a map of n-grams at classification time, which means that a lot of memory allocations and deallocations are necessary to classify every single document. By contrast, my current implementation only allocates a single chunk of memory at startup, for loading the model.

For a comparison of the existing approaches to language classification, see Baldwin and Lui, 2010.