Configuring Rules
Actions
Rules are added with the .add
method on the Tokenizer. On their own just adding patterns is not very interesting. Actions can also be attached to each token via the add method:
import { Tokenizer, Rule, TapeInterface as Tape, Token } from "tlex";
const tokenizer = new Tokenizer()
.add(/\d+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
console.log("Found Token: ", token);
token.value = parseInt(token.value);
return token;
})
.add(/\w+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
console.log("Found a word: ", token);
token.value = token.value.toUpperCase();
return token;
})
.add(/\s+/ , (rule: Rule, tape: Tape, token: Token, owner: any) => {
console.log("Found a space: ", token);
return null;
});
const tokens = tokenizer.tokenize("123 hello world");
Running the tokenizer above would execute the actions:
Found Token: Token {
tag: null,
matchIndex: 0,
start: 0,
end: 3,
id: 0,
value: '123',
groups: {},
positions: {}
}
Found a space: Token {
tag: null,
matchIndex: 2,
start: 3,
end: 5,
id: 1,
value: ' ',
groups: {},
positions: {}
}
Found a word: Token {
tag: null,
matchIndex: 1,
start: 5,
end: 10,
id: 2,
value: 'hello',
groups: {},
positions: {}
}
Found a space: Token {
tag: null,
matchIndex: 2,
start: 10,
end: 12,
id: 3,
value: ' ',
groups: {},
positions: {}
}
Found a word: Token {
tag: null,
matchIndex: 1,
start: 12,
end: 17,
id: 4,
value: 'world',
groups: {},
positions: {}
}
In this example, the match handlers is a RuleMatchHandler. These handler can be used to:
- Modify token values
- Filter out unwanted tokens
- Change the semantics of tokens (eg convert a integer string into an integer value).
- Change tokenizer states
With the tokenizer executing the actions we can print the resulting tokens:
console.log("All Tokens: ", tokens);
producing:
All Tokens: [
Token {
tag: null,
matchIndex: 0,
start: 0,
end: 3,
id: 0,
value: 123,
groups: {},
positions: {}
},
Token {
tag: null,
matchIndex: 1,
start: 5,
end: 10,
id: 2,
value: 'HELLO',
groups: {},
positions: {}
},
Token {
tag: null,
matchIndex: 1,
start: 12,
end: 17,
id: 4,
value: 'WORLD',
groups: {},
positions: {}
}
]
Skipping Tokens
Explicit handler returning null can be used to skip particular tokens. Alternatively the "skip" parameter can be used:
import { Tokenizer, Rule, TapeInterface as Tape, Token } from "tlex";
const tokenizer = new Tokenizer()
.add(/\d+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
token.value = parseInt(token.value);
return token;
})
.add(/\w+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
token.value = token.value.toUpperCase();
return token;
})
.add(/\s+/, {skip: true});
console.log(tokenizer.tokenize("123 hello world"));
resulting in:
[
Token {
tag: null,
matchIndex: 0,
start: 0,
end: 3,
id: 0,
value: 123,
groups: {},
positions: {}
},
Token {
tag: null,
matchIndex: 1,
start: 5,
end: 10,
id: 2,
value: 'HELLO',
groups: {},
positions: {}
},
Token {
tag: null,
matchIndex: 1,
start: 12,
end: 17,
id: 4,
value: 'WORLD',
groups: {},
positions: {}
}
]
Tagging Tokens
Tokens can also be given custom tags so that they can referred by more meaningful labels instead of by their rules.
import { Tokenizer, Rule, TapeInterface as Tape, Token } from "tlex";
const tokenizer = new Tokenizer()
.add(/\d+/, {tag: "NUMBER"}, (rule: Rule, tape: Tape, token: Token, owner: any) => {
token.value = parseInt(token.value);
return token;
})
.add(/\w+/, {tag: "WORD"}, (rule: Rule, tape: Tape, token: Token, owner: any) => {
token.value = token.value.toUpperCase();
return token;
})
.add(/\s+/, {skip: true});
const tokens = tokenizer.tokenize("123 hello world");
console.log("Tokens: ", tokens);
Running the tokenizer above would execute the actions:
Tokens: [
Token {
tag: 'NUMBER',
matchIndex: 0,
start: 0,
end: 3,
id: 0,
value: 123,
groups: {},
positions: {}
},
Token {
tag: 'WORD',
matchIndex: 1,
start: 5,
end: 10,
id: 2,
value: 'HELLO',
groups: {},
positions: {}
},
Token {
tag: 'WORD',
matchIndex: 1,
start: 12,
end: 17,
id: 4,
value: 'WORLD',
groups: {},
positions: {}
}
]
Rule ordering and priorities
By default when rules are added to the tokenizer they are matched in the order in which they are specified.
For example in the following config:
import * as TLEX from "tlex";
const tokenizer = new TLEX.Tokenizer()
.add("hello")
.add("world")
.add(/\s+/)
.add(/h.*o/);
matching:
console.log(tokenizer.tokenize("hello hello"));
would yield the following tokens:
[
Token {
tag: null,
matchIndex: 0,
start: 0,
end: 5,
id: 0,
value: 'hello',
groups: {},
positions: {}
},
Token {
tag: null,
matchIndex: 2,
start: 5,
end: 6,
id: 1,
value: ' ',
groups: {},
positions: {}
},
Token {
tag: null,
matchIndex: 0,
start: 6,
end: 11,
id: 2,
value: 'hello',
groups: {},
positions: {}
}
]
Notice the matchIndex - denoting the rule that matched. Both (non-space) tokens have a matchIndex of 0 (instead of the longer 3 corresponding to "h.*o").
Rules can be given priorities so that those are matched ahead of rules with lower priorities (regardless of ordering). Two rules that have equal priority will be matched in the order of addition, eg:
import * as TLEX from "tlex";
const tokenizer = new TLEX.Tokenizer()
.add("hello")
.add(/\s+/)
.add(/h.*o/, {priority: 20});
console.log(tokenizer.tokenize("hello hello"));
Output
[
Token {
tag: null,
matchIndex: 2,
start: 0,
end: 11,
id: 0,
value: 'hello hello',
groups: {},
positions: {}
}
]
By default all rules have a priority of 10.