sander/rltk - Forgejo: Beyond coding. We forge.

No description

Find a file

Sander Hautvast f72da25396 added some more documentation		2022-05-09 22:39:12 +02:00
.github/workflows	Create rust.yml	2022-04-29 16:16:03 +02:00
src	added some more documentation	2022-05-09 22:39:12 +02:00
.gitignore	and gitignores	2022-04-28 16:22:12 +02:00
Cargo.lock	getting there	2022-04-29 12:27:22 +02:00
Cargo.toml	getting there	2022-04-29 12:27:22 +02:00
README.md	added some more documentation	2022-05-09 22:39:12 +02:00

RLTK

An attempt to manually port some of nltk to rust.

So as to avoid re-creating the text in memory, both train and vocab are lazy iterators. They are evaluated on demand at training time.

rltk has the same philosophy: everything is done using iterators (on iterators) on string slices.

Currently in it's infancy (but growing):

rltk::lm::preprocessing::pad_both_ends(["a","b","c"], 2) -> "<s>", "a", "b", "c", "</s>"]
rltk::util::pad_sequence == same as above with customisation
rltk::util::pad_sequence_left == same
rltk::util::pad_sequence_right == same
rltk::util::ngrams(["a","b","c"],2) -> "a", "b"], ["b", "c"
rltk::util::bigrams(["a","b","c"]) == ngrams(..., 2)
rltk::util::trigrams(["a","b","c"]) == ngrams(..., 3)
rltk::util::everygrams(["a","b","c"],2) == "a"], ["a", "b"], ["b"], ["b", "c"
rltk::util::flatten("a"], ["a", "b"], ["b"], ["b", "c") == ["a", "a", "b", "b", "b", "c"]
rltk::metrics::distance::edit_distance(): calculate the levenshtein distance between two words (see doc)