Notes on Data Processing for LLM Context

So, after building an intersting scraping algorithm, I need to do things like remove extremely similar or identical sentences repeated in close proximity, remove captured code elements, and other non-alphanumeric noise.

To that end here are some notes on research:

  1. https://towardsdatascience.com/deduplication-deduplication-1d1414ffb4d2 — word2vec based de-duplication
  2. https://www.youtube.com/watch?v=4b5d3muPQmA — K-Means clustering explained

Okay, so after experimenting with some AI assisted implementations of BERT embeddings and K-Means clustering, I am getting garbage output and need to further research the actual mechanics of this approach. To this end, I am implementing a Jupyter notebook and doing some testing on raw scrape data generated from a previous method of text processing.

The working implementation at this point ivolved spaCy.

Leave a Reply

Your email address will not be published. Required fields are marked *