[BUILD-418] fix: AI search discarding results, split text in AI text search (PR #2133)

Related issues

  • Closes #

Proposed changes

Describe the big picture of your changes here to communicate to the reviewers why we should accept this pull request. If it fixes a bug or resolves a feature request, be sure to link to that issue.

How to reproduce

Explain like if someone who doesn’t know this project is reviewing your changes and how they can replicate this.

Screenshots and videos

Paste here any screenshots or videos related to your changes, if applicable.

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc…

GitHub

:mag: Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

:page_facing_up: File: ankihub/ai/use_cases.py

Function Unhandled Issue
semantic_similarity_search_from_loaded_document IndexError: list index out of range /ai/{deck_id}…
Event Count: 1

Did you find this useful? React with a :+1: or :-1:

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.64%. Comparing base (fd0538d) to head (8efacf1).

Additional details and impacted files
@@             Coverage Diff             @@
##              dev    #2133       +/-   ##
===========================================
- Coverage   97.82%   86.64%   -11.18%     
===========================================
  Files         156      156               
  Lines       21820    21820               
===========================================
- Hits        21345    18907     -2438     
- Misses        475     2913     +2438     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

The title of this pull request changed from “fix: Combine results when using multiple embeddings for AI search” to "fix: Scores of AI search resutls when there are multiple chunks

The title of this pull request changed from “fix: Scores of AI search resutls when there are multiple chunks” to "fix: AI search discarding results, not using best scores

I changed this because if some note matches multiple embeddings, the note will be counted multiple times if we use .count().
Also, results_amount was previously not returning the amount of results which were actually returned, because they later got limited to PAGE_SIZE.
Now results_amount will bet the amount of actually returned results.
I also adjusted the other usages of _log_search to also use len(search_results) for consistency.

The title of this pull request changed from “fix: AI search discarding results, not using best scores” to "[BUILD-418] fix: AI search discarding results, not using best scores

This is needed because multiple embeddings can match a note. In that case the results will contain the note multiple times, so we need to filter out duplicates.

This is needed because multiple embeddings can match a note. In that case the results will contain the note multiple times, so we need to filter out duplicates.

:thinking: :question: How this can happen if we are performing an union of all querysets? The generated SQL will be a lot of concatenated WHERE … OR … OR … OR. How this SQL would return duplicate records from the note embeddings table?

When you use .union() it uses UNION in the SQL and not OR.
The django docs say this:

The UNION operator selects only distinct values by default.

However, it results duplicates in our case. I think it’s because the “duplicates” have different distance values.

I used UNION instead of OR, because when using OR, not all results have a distance annotation.

@heitorado @pedroven
We are currently not splitting the text when doing an AI text search. Shouldn’t we do that? I’m pretty sure we should. Some people are copying large amounts of text into the search text input when searching.
I started working on a PR for that: https://github.com/ankipalace/ankihub/pull/2135

I don’t know if we should make this change. The results amount to me refer to the amount of the NoteEmbedding generated by some query. @heitorado what do you think about this?

:recycle: This function has a possible error, we need to add a condition to avoid processing the querysets if the list of embeddings is empty. See Sign In | Sentry

:pick: since this function is only called internally we should rename to _keep_result_with_lowest_distance_for_each_note

@heitorado @pedroven We are currently not splitting the text when doing an AI text search. Shouldn’t we do that? I’m pretty sure we should. Some people are copying large amounts of text into the search text input when searching. I started working on a PR for that: #2135

Do you have some data to justify this change

The title of this pull request changed from “[BUILD-418] fix: AI search discarding results, not using best scores” to "[BUILD-418] fix: AI search discarding results, split text in AI text search

Renamed it now