code, language

The Next Generation Semantic Search

A friend runs a game on WhatsApp called Picardle, a play on the popular game Wordle, except he posts screenshots of Star Trek: The Next Generation and we try to guess the episode by describing the plot. Guessing the exact episode name is too difficult, so he judges whether we’re close and awards points. It would be interesting to automate this process, so I build a semantic search across all ST:TNG episodes using a python library called WordLlama. WordLlama is an NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework. I scraped and cleaned episode synopses, which was used as the data then deployed a small front-end to a Hugging Face space using Gradio. It works really well.

https://huggingface.co/spaces/m-butler/picardle