novelai_api.Tokenizer

class SentencePiece[source]

Bases: sentencepiece.SentencePieceProcessor

Wrapper around sentencepiece.SentencePieceProcessor that adds the encode and decode methods

__init__(model_path: str)[source]
trans_table_ids: Dict[int, str]
trans_table_str: Dict[str, int]
trans_regex_str: re.Pattern
encode(s: str) List[int][source]

Encode the provided text using the SentencePiece tokenizer. This workaround is needed because sentencepiece cannot handle some tokens

Parameters:

s – Text to encode

Returns:

List of tokens the provided text encodes into

decode(t: List[int])[source]

Decode the provided tokens using the SentencePiece tokenizer. This workaround is needed because sentencepiece cannot handle some tokens

Parameters:

t – Tokens to decode

Returns:

Text the provided tokens decode into

class Tokenizer[source]

Bases: object

Abstraction of the tokenizer behind each Model

classmethod get_tokenizer_name(model: novelai_api.Preset.Model) str[source]

Get the tokenizer name a model uses

Parameters:

model – Model to get the tokenizer name of

classmethod decode(model: novelai_api.Preset.Model | novelai_api.ImagePreset.ImageModel, o: List[int]) str[source]

Decode the provided tokens using the chosen tokenizer

Parameters:
  • model – Model to use the tokenizer of

  • o – List of tokens to decode

Returns:

Text the provided tokens decode into

classmethod encode(model: novelai_api.Preset.Model | novelai_api.ImagePreset.ImageModel, o: str) List[int][source]

Encode the provided text using the chosen tokenizer

Parameters:
  • model – Model to use the tokenizer of

  • o – Text to encode

Returns:

List of tokens the provided text encodes into