Chroma入门

一、安装与环境准备

1. 安装方法

通过pip安装Chroma库，注意当前版本不支持Python 3.11，建议使用Python 3.10或更低版本：

pip install chromadb

若需服务端部署，还需安装chromadb-client包。

2. 版本兼容性

Chroma依赖SQLite 3.35+，若安装失败，可尝试升级Python到3.11+或安装旧版Chroma。

二、客户端与连接方式

1. 基础客户端

import chromadb
client = chromadb.Client()  # 内存模式，关闭后数据丢失

内存模式：数据不会持久化，客户端关闭后数据丢失

2. 持久化客户端

client = chromadb.PersistentClient(path="/data/chroma")  # 数据自动保存至本地路径

持久化模式：数据自动保存至指定路径，下次启动时可加载

3. 服务端模式

服务端模式适合分布式环境和高并发场景：

启动服务端：

chroma run --path /db_path  # 指定存储路径

客户端连接：

client = chromadb.HttpClient(host="localhost", port=8000)

三、集合（Collection）操作

1. 创建集合

collection = client.create_collection(
    name="my_collection",
    embedding_function=emb_fn  # 可选自定义嵌入模型
)

命名规则：3-63字符，首尾为字母/数字，非IP地址

2. 添加数据

collection.add(
    documents=["文本1", "文本2"],
    ids=["id1", "id2"],
    metadatas=[{"类别": "科技"}, {"类别": "体育"}]
)

支持自动向量化（默认使用all-MiniLM-L6-v2模型）或手动提供embeddings

四、查询与检索

1. 相似性搜索

results = collection.query(
    query_texts=["查询文本"],
    n_results=3  # 返回最相似的3条结果
)

2. 过滤查询

results = collection.query(
    where={"metadata_field": "value"},  # 按元数据过滤
    where_document={"$contains": "关键词"}
)

3. 混合检索

支持同时基于向量相似度和元数据过滤，返回结果包含文档、ID、距离值和元数据。

五、进阶功能

1. 自定义嵌入模型

示例使用Hugging Face的BGE模型：

from chromadb.utils.embedding_functions import HuggingFaceEmbeddingFunction
emb_fn = HuggingFaceEmbeddingFunction(
    model_name="BAAI/bge-base-zh-v1.5",
    device="cuda"  # GPU加速
)

2. 多模态支持

可集成OpenCLIP等模型处理图像：

collection.add(images=[image1, image2])  # 自动生成图像向量

3. 持久化与恢复

vectordb.persist()  # 显式保存到磁盘
client = chromadb.PersistentClient().get_collection("my_collection")  # 从磁盘加载

六、应用场景示例

在LangChain中构建知识库：

from langchain.vectorstores import Chroma
vectordb = Chroma.from_documents(
    documents=texts,
    embedding=HuggingFaceEmbeddings(),
    persist_directory="./chroma_db"
)  # 自动处理分块和向量化

注意事项

1. 性能优化

批量插入数据时建议分批次（如100条/批）。
服务端模式适合生产环境，支持高并发查询。

2. 资源限制

内存模式适合小规模数据（<10万条）。
大规模数据建议使用Milvus、Weaviate等支持分布式和GPU加速的数据库。

通过以上步骤，开发者可快速掌握Chroma的核心功能。建议参考官方文档和示例代码进行实践。