Notes on Data Processing for LLM Context
So, after building an intersting scraping algorithm, I need to do things like remove extremely similar or identical sentences repeated in close proximity, remove captured code elements, and other non-alphanumeric noise.
To that end here are some notes on research:
- https://towardsdatascience.com/deduplication-deduplication-1d1414ffb4d2 — word2vec based de-duplication
- https://www.youtube.com/watch?v=4b5d3muPQmA — K-Means clustering explained
Okay, so after experimenting with some AI assisted implementations of BERT embeddings and K-Means clustering, I am getting garbage output and need to further research the actual mechanics of this approach. To this end, I am implementing a Jupyter notebook and doing some testing on raw scrape data generated from a previous method of text processing.
The working implementation at this point ivolved spaCy.