Creating flexible highlight rules in Ace that change according to editor mode

I’ve only recently started using the excellent Ace editor in projects, but I’m really enjoying it so far. It has very flexible and well designed custom highlighting rulesets that developers can extend to various different syntax. However it does not currently support in-editor spell-checking (I believe this functionality is planned for a release soon though – hopefully!).

One of the situations I’m using the editor in is site which has input from many different languages and character sets. As part of this I wanted to ensure that the language that a given article is claimed to be written in matches the character set of various content sections of the editor. For example if the article is in Thai, but you are using Latin script characters it should highlight them as errors. A big issue we have is that there is a lot of Cyrillic content, but as a number of characters render the same or very similarly in Cyrillic and Latin (for example Р and P), some of the users input mostly Cyrillic, but with the occasional character of Latin. This wouldn’t be a problem if we were just rendering the text, but we are also unidecoding it for searching – in this case Latin P goes to p, but the Cyrillic character Р encodes to r (as it is pronounced). This means that it throws off the searching.

I wrote a script which parses the CLDR data’s exemplarCharacters data to get the expected character sets for a language, adds a few in (as the CLDR is not totally complete unfortunately, especially for languages that use extended Cyrillic sets such as Karakalpak, lacks Traditional Mongolian entirely, and is incomplete for some ideographic languages such as Chinese). It then adds some general punctuation and other characters and generates a javascript regexp for matching characters that should not exist. For example for Armenian, the regexp is /[^\u0020-\u0022\u0025-\u0029\u002b-\u003b\u003f\u005b-\u005d\u00ab\u00b4\u00b8\u00bb\u0531-\u0556\u055a-\u055f\u0561-\u0587\u058a\u2030]/.

So far so good, but how to integrate this with Ace editor? Usually you only have a mode for Ace which specifies the language (eg PHP, HTML), but I don’t want to create a new mode for each language/script that this app wants to support.

Initially I tried basing some code on this code which adds spell checking outside of Ace, however there are a number of limitations with this approach namely it doesn’t integrate with the existing highlighting system so if you are mean to be editing eg HTML document containing only Thai characters it doesn’t know which parts are HTML and which parts should be checked, without redoing the whole highlighting run a second time. It is also needs to reprocess the entire document every change.

Digging around in the Ace source I found that you can actually pass an object into the setMode function which enables you to pass new parameters, such as the invalid-characters regexp for the current language. However the highlighting functions are usually static which means it is complex to update this on-the-fly, especially after normalizing it for something based on the Text Highlight Rules. My solution is as follows:

define('ace/mode/my_highlight_rules', ... {
    var MyHighlightRules = function(regex) {
        var rules = function () {
            this.$rules = {
                start: [ ... ],
                words: [ ... something with regex ... ],
                ...
            this.normalizeRules();
        };

        rules.metaData = { ... };

        oop.inherits(rules, TextHighlightRules);
        return rules;
    };

    exports.MyHighlightRules = MyHighlightRules;
});

define("ace/mode/my_mode", ... {
    var Mode = function (opts) {
        this.HighlightRules = MyHighlightRules(opts.regex);
        ...
    };
});

You can then just do .setMode({ path: 'ace/mode/my_mode', regex: /.../ }) changing the regex for each different language or character set that you wish to validate.