Home TLEX

Configuring Rules

Actions

Rules are added with the .add method on the Tokenizer. On their own just adding patterns is not very interesting. Actions can also be attached to each token via the add method:

import { Tokenizer, Rule, TapeInterface as Tape, Token } from "tlex";
const tokenizer = new Tokenizer()
  .add(/\d+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
    console.log("Found Token: ", token);
    token.value = parseInt(token.value);
    return token;
  })
  .add(/\w+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
    console.log("Found a word: ", token);
    token.value = token.value.toUpperCase();
    return token;
  })
  .add(/\s+/ , (rule: Rule, tape: Tape, token: Token, owner: any) => {
    console.log("Found a space: ", token);
    return null;
  });
const tokens = tokenizer.tokenize("123  hello  world");

Running the tokenizer above would execute the actions:

Found Token:  Token {
  tag: null,
  matchIndex: 0,
  start: 0,
  end: 3,
  id: 0,
  value: '123',
  groups: {},
  positions: {}
}
Found a space:  Token {
  tag: null,
  matchIndex: 2,
  start: 3,
  end: 5,
  id: 1,
  value: '  ',
  groups: {},
  positions: {}
}
Found a word:  Token {
  tag: null,
  matchIndex: 1,
  start: 5,
  end: 10,
  id: 2,
  value: 'hello',
  groups: {},
  positions: {}
}
Found a space:  Token {
  tag: null,
  matchIndex: 2,
  start: 10,
  end: 12,
  id: 3,
  value: '  ',
  groups: {},
  positions: {}
}
Found a word:  Token {
  tag: null,
  matchIndex: 1,
  start: 12,
  end: 17,
  id: 4,
  value: 'world',
  groups: {},
  positions: {}
}

In this example, the match handlers is a RuleMatchHandler. These handler can be used to:

  • Modify token values
  • Filter out unwanted tokens
  • Change the semantics of tokens (eg convert a integer string into an integer value).
  • Change tokenizer states

With the tokenizer executing the actions we can print the resulting tokens:

  console.log("All Tokens: ", tokens);

producing:

All Tokens:  [
  Token {
    tag: null,
    matchIndex: 0,
    start: 0,
    end: 3,
    id: 0,
    value: 123,
    groups: {},
    positions: {}
  },
  Token {
    tag: null,
    matchIndex: 1,
    start: 5,
    end: 10,
    id: 2,
    value: 'HELLO',
    groups: {},
    positions: {}
  },
  Token {
    tag: null,
    matchIndex: 1,
    start: 12,
    end: 17,
    id: 4,
    value: 'WORLD',
    groups: {},
    positions: {}
  }
]

Skipping Tokens

Explicit handler returning null can be used to skip particular tokens. Alternatively the "skip" parameter can be used:

import { Tokenizer, Rule, TapeInterface as Tape, Token } from "tlex";
const tokenizer = new Tokenizer()
    .add(/\d+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
      token.value = parseInt(token.value);
      return token;
    })
    .add(/\w+/, (rule: Rule, tape: Tape, token: Token, owner: any) => {
      token.value = token.value.toUpperCase();
      return token;
    })
    .add(/\s+/, {skip: true});
console.log(tokenizer.tokenize("123  hello  world"));

resulting in:

[
  Token {
    tag: null,
    matchIndex: 0,
    start: 0,
    end: 3,
    id: 0,
    value: 123,
    groups: {},
    positions: {}
  },
  Token {
    tag: null,
    matchIndex: 1,
    start: 5,
    end: 10,
    id: 2,
    value: 'HELLO',
    groups: {},
    positions: {}
  },
  Token {
    tag: null,
    matchIndex: 1,
    start: 12,
    end: 17,
    id: 4,
    value: 'WORLD',
    groups: {},
    positions: {}
  }
]

Tagging Tokens

Tokens can also be given custom tags so that they can referred by more meaningful labels instead of by their rules.

import { Tokenizer, Rule, TapeInterface as Tape, Token } from "tlex";
const tokenizer = new Tokenizer()
    .add(/\d+/, {tag: "NUMBER"}, (rule: Rule, tape: Tape, token: Token, owner: any) => {
      token.value = parseInt(token.value);
      return token;
    })
    .add(/\w+/, {tag: "WORD"}, (rule: Rule, tape: Tape, token: Token, owner: any) => {
      token.value = token.value.toUpperCase();
      return token;
    })
    .add(/\s+/, {skip: true});
const tokens = tokenizer.tokenize("123  hello  world");
console.log("Tokens: ", tokens);

Running the tokenizer above would execute the actions:

Tokens:  [
  Token {
    tag: 'NUMBER',
    matchIndex: 0,
    start: 0,
    end: 3,
    id: 0,
    value: 123,
    groups: {},
    positions: {}
  },
  Token {
    tag: 'WORD',
    matchIndex: 1,
    start: 5,
    end: 10,
    id: 2,
    value: 'HELLO',
    groups: {},
    positions: {}
  },
  Token {
    tag: 'WORD',
    matchIndex: 1,
    start: 12,
    end: 17,
    id: 4,
    value: 'WORLD',
    groups: {},
    positions: {}
  }
]

Rule ordering and priorities

By default when rules are added to the tokenizer they are matched in the order in which they are specified.

For example in the following config:

  import * as TLEX from "tlex";
  const tokenizer = new TLEX.Tokenizer()
                  .add("hello")
                  .add("world")
                  .add(/\s+/)
                  .add(/h.*o/);

matching:

  console.log(tokenizer.tokenize("hello hello"));

would yield the following tokens:

[
  Token {
    tag: null,
    matchIndex: 0,
    start: 0,
    end: 5,
    id: 0,
    value: 'hello',
    groups: {},
    positions: {}
  },
  Token {
    tag: null,
    matchIndex: 2,
    start: 5,
    end: 6,
    id: 1,
    value: ' ',
    groups: {},
    positions: {}
  },
  Token {
    tag: null,
    matchIndex: 0,
    start: 6,
    end: 11,
    id: 2,
    value: 'hello',
    groups: {},
    positions: {}
  }
]

Notice the matchIndex - denoting the rule that matched. Both (non-space) tokens have a matchIndex of 0 (instead of the longer 3 corresponding to "h.*o").

Rules can be given priorities so that those are matched ahead of rules with lower priorities (regardless of ordering). Two rules that have equal priority will be matched in the order of addition, eg:

  import * as TLEX from "tlex";
  const tokenizer = new TLEX.Tokenizer()
                  .add("hello")
                  .add(/\s+/)
                  .add(/h.*o/, {priority: 20});
  console.log(tokenizer.tokenize("hello hello"));

Output

[
  Token {
    tag: null,
    matchIndex: 2,
    start: 0,
    end: 11,
    id: 0,
    value: 'hello hello',
    groups: {},
    positions: {}
  }
]

By default all rules have a priority of 10.