GitHub - Stepets/utf8.lua: pure-lua 5.3 regex library
pure-lua 5.3 regex library for Lua 5.3, Lua 5.1, LuaJIT
This library provides simple way to add UTF-8 support into your application.
Example:
local utf8 = require('.utf8'):init() for k,v in pairs(utf8) do string[k] = v end local str = "пыщпыщ ололоо я водитель нло" print(str:find("(.л.+)н")) -- 8 26 ололоо я водитель print(str:gsub("ло+", "보라")) -- пыщпыщ о보라보라 я водитель н보라 3 print(str:match("^п[лопыщ ]*я")) -- пыщпыщ ололоо я
Usage:
This library can be used as drop-in replacement for vanilla string library. It exports all vanilla functions under raw sub-object.
local utf8 = require('.utf8'):init() local str = "пыщпыщ ололоо я водитель нло" utf8.gsub(str, "ло+", "보라") -- пыщпыщ о보라보라 я водитель н보라 3 utf8.raw.gsub(str, "ло+", "보라") -- пыщпыщ о보라보라о я водитель н보라 3
It also provides all functions from Lua 5.3 UTF-8 module except utf8.len (s [, i [, j]]). If you need to validate your strings use utf8.validate(str, byte_pos) or iterate over with utf8.validator.
Please note that library assumes regexes are valid UTF-8 strings, if you need to manipulate individual bytes use vanilla functions under utf8.raw.
Installation:
Download repository to your project folder. (no rockspecs yet)
Examples assume library placed under utf8 subfolder not utf8.lua.
As of Lua 5.3 default utf8 module has precedence over user-provided. In this case you can specify full module path (.utf8).
Configuration:
Library is highly modular. You can provide your implementation for almost any function used. Library already has several back-ends:
- Runtime character class processing using hardcoded codepoint ranges or using native functions through
ffi. - Basic functions for working with UTF-8 characters have specializations for
ffi-enabled runtime and for tarantool.
Probably most interesting customizations are utf8.config.loadstring and utf8.config.cache if you want to precompile your regexes.
local utf8 = require('.utf8') utf8.config = { cache = my_smart_cache, } utf8:init()
For lower and upper functions to work in environments where ffi cannot be used, you can specify substitution tables (data example)
local utf8 = require('.utf8') utf8.config = { conversion = { uc_lc = utf8_uc_lc, lc_uc = utf8_lc_uc }, } utf8:init()
Customization is done before initialization. If you want, you can change configuration after init, it might work for everything but modules. All of them should be reloaded.
Documentation:
Issue reporting:
Please provide example script that causes error together with environment description and debug output. Debug output can be obtained like:
local utf8 = require('.utf8') utf8.config = { debug = utf8:require("util").debug } utf8:init() -- your code
Default logger used is io.write and can be changed by specifying logger = my_logger in configuration