Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LanceDB Destination #1375

Merged
merged 179 commits into from
Jun 27, 2024
Merged

LanceDB Destination #1375

merged 179 commits into from
Jun 27, 2024

Conversation

Pipboyguy
Copy link
Collaborator

@Pipboyguy Pipboyguy commented May 16, 2024

Description

This PR adds support for using LanceDB as a destination.

  • Updated documentation
  • Unit tests

Related Issues

Additional Context

With this change, dlt users can easily load their transformed data into LanceDB for efficient vector similarity search, full-text search, and SQL querying. LanceDB's ability to store raw data alongside vector embeddings makes it a powerful destination for AI applications.

@Pipboyguy Pipboyguy requested a review from rudolfix May 16, 2024 20:56
@Pipboyguy Pipboyguy linked an issue May 16, 2024 that may be closed by this pull request
@Pipboyguy Pipboyguy requested a review from sh-rp May 16, 2024 20:56
@Pipboyguy Pipboyguy self-assigned this May 16, 2024
Copy link

netlify bot commented May 16, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit db1e81d
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/667d4d36be836c0008cdcb5a

@Pipboyguy Pipboyguy added destination Issue related to new destinations community This issue came from slack community workspace labels May 16, 2024
Pipboyguy added 22 commits May 16, 2024 23:19
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
…ame for reserved fields

Signed-off-by: Marcel Coetzee <[email protected]>
Storage options are only available in asynchronous Python API. See https://lancedb.github.io/lancedb/guides/storage/

Signed-off-by: Marcel Coetzee <[email protected]>
@akelad
Copy link
Contributor

akelad commented Jun 19, 2024

Update: if I change the lancedb_adapter to from dlt.destinations.impl.lancedb.lancedb_adapter import lancedb_adapter the script runs. The docs say from dlt.destinations.adapters import lancedb_adapter, and that gave me the TypeError

@Pipboyguy
Copy link
Collaborator Author

Pipboyguy commented Jun 21, 2024

@akelad All configuration fields are optional now. Please see associated tests as well. Mind checking whether it works for you?

Provided you aren't computing embeddings, you should be able to run as is after a dlt init

@akelad
Copy link
Contributor

akelad commented Jun 21, 2024

@Pipboyguy I'll try it out, thanks. Can you also take a look at my other comments? Both Anuun and I get the TypeError: 'module' object is not callable when using from dlt.destinations.adapters import lancedb_adapter - that should be fixed.

Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
@akelad
Copy link
Contributor

akelad commented Jun 21, 2024

Can confirm it works without defining the api/embedding api key

Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
@Pipboyguy
Copy link
Collaborator Author

@akelad

I actually can't get the lancedb_adapter to work properly so far. I did a dlt init rest_api lancedb, which doesn't produce the code to use the lancedb_adapter, it just treats lancedb as a normal db. Not sure how dlt init works and whether it's even feasible to adjust that.

Not all fields contain search-relevant information, and fields may require distinct embedding functions, especially for multi-modal data. Users must explicitly specify source fields for embeddings in their schema to ensure optimal performance and accuracy in vector search operations.

This is why the adapter has to be written by the user that understands what they are trying to achieve. If we want the adapter auto generated for the example we'd have to add this to the init command manually for each example as you pointed out.

I'm not sure how to do this though. @rudolfix @sh-rp ?

Signed-off-by: Marcel Coetzee <[email protected]>
@Pipboyguy
Copy link
Collaborator Author

Pipboyguy commented Jun 21, 2024

I understood you can only pass a resource if you also want to embed, and not a full source - is that correct?

Actually it seems that you can. Like you demonstrated above ( also ran it on my end), the adapter seems to work just fine with the pokemon source.

I speak under correction here though. I actually wasn't aware the an adapter also works on sources.

The general flow at least in my experience is to use the adapter with resources.

Signed-off-by: Marcel Coetzee <[email protected]>
@Pipboyguy
Copy link
Collaborator Author

Update: if I change the lancedb_adapter to from dlt.destinations.impl.lancedb.lancedb_adapter import lancedb_adapter the script runs. The docs say from dlt.destinations.adapters import lancedb_adapter, and that gave me the TypeError

Thanks for this catch @akelad !

This should be fixed now. Kindly test again?

@akelad
Copy link
Contributor

akelad commented Jun 24, 2024

The fixed import works now! About the dlt init command - that's fine with me, maybe we should add some sort of note either in the generated code or as a warning that you have to add the lancedb_adapter for embeddings? Because out of the box it will just act as a normal DB. @sh-rp what do you think?

Actually it seems that you can. Like you demonstrated above ( also ran it on my end), the adapter seems to work just fine with the pokemon source.

Actually the example I showed there only runs on a resource: pokemon_source.resources['pokemon']. I don't think it works properly with a full source, at least when I tried it last week it did run, but there was some weird behaviour. I think we need to make it extra clear in the docs that this is intended to be used with resources and not on a source level - I can try and make those docs updates.

@Pipboyguy
Copy link
Collaborator Author

Pipboyguy commented Jun 25, 2024

@akelad thank you so much for the detailed investigation.

Apologies, you are right the following does indeed work:

    lancedb_adapter(pokemon_source.resources['pokemon'], embed=["name"])
    load_info = pipeline.run(pokemon_source)

whereas the following doesn't raise a warning or exception, but nothing is embedded:

    lancedb_adapter(pokemon_source, embed=["name"])
    load_info = pipeline.run(pokemon_source)

I've updated the docs to make it clear that the adapter should only be used on resources, and that no fields will be embedded unless explicitly listed in the adapter.

@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024
@sh-rp sh-rp merged commit 78cdb0b into devel Jun 27, 2024
52 checks passed
@sh-rp sh-rp deleted the 1370-lancedb-destination branch June 27, 2024 14:24
@rudolfix rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace destination Issue related to new destinations
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

LanceDB Destination
4 participants