BM25S — Efficacy Improvement of BM25 Algorithm in Document Retrieval | by Chien Vu | Aug, 2024
bm25s, an implementation of the BM25 algorithm in Python, utilizes Scipy and helps boost speed in document retrieval
BM25, short for Best Match 25, is a popular vector-based document retrieval algorithm. BM25 aims to deliver accurate and relevant search results by scoring documents based on their term frequencies and lengths.
BM25 uses term frequency and inverse document frequency as a part of its formula. Term frequency and inverse document frequency are the core of TF-IDF.
First, let’s take a quick look at the TF-IDF formula.
In TF-IDF, the importance of the word increases proportionally to the number of times that word appears in the document but is offset by the frequency of the word in the corpus. The first part, Term Frequency (TF), indicates how often a term appears in a specific document. If the term appears more frequently within a document, it is more likely to be significant. However, it is normalized by the total number…