为什么要做去中心化向量搜索

由 Rorical 发布
源于初中某一次的契机,那时候去书店随便翻,找到了一本谈区块链的书。当时简直是如获至宝,因为从网上也听说过很多区块链的事,觉得很有意思。这本现在已经泛黄并且边缘破烂的书就是整个旅途的开始。
而到现在,经过好几年的积累,我终于可以有勇气和底气说,我也可以参与去中心化项目的开发了
一般这种文章好像开头都会先引用比特币啊什么的巴拉巴拉说上一堆,然后扯自己的项目怎么好然后等待新韭菜的上钩。我对于完全着眼于华而不实东西的项目说白了就是嗤之以鼻。践行“技术至上”原则的时候,那就摆出来谈技术,看看这个系统到底是怎么按照你的想法运行的。没有技术其它的一切都是瞎扯。
上面都是我在瞎扯请不要在意

去中心化可能已经是个老生常谈的词了。相信别的文章都阐述的十分全面,这里就不多赘述了。至于一项技术怎么使其发挥最大的效用,那就是做出实际有用的东西。标题来自于前一个月我刚刚萌生的想法,将索引查询一类的东西和去中心化系统结合一下,或许会行得通。这个月呢我就大概水了水技术文档,完善了下思路,然后终于能到拿出来和人谈的阶段了。

现在是名词解释时间=啥也没解释

  • 向量搜索:是一种寻找最邻近向量的过程,其向量可以是任意别的东西提取的特征值
  • 去中心化:这个词有点大所以自己谷歌吧
  • 区块链:是一种在去中心化系统中对过去记录以及当前状态达成共识的共识工具(自己理解)

传统的向量搜索多半是用在逆向搜索和推荐系统上。比如我去年夭折暂停的一个项目就是通过图片找源头,其中就用到了向量搜索。大概就是用神经网络把图片特征提取出来一个矩阵,然后矩阵展平成向量,在用向量搜索找到数据库里最邻近的几个图片特征就是结果。至于推荐系统就是根据用户的标签转换成特征向量然后找到最邻近的就是推荐的。在今天互联网巨头这么多的情况下向量搜索已经很常见了,比如抖音给你推荐的视频或者百度通过图片搜索出处。

然鹅在去中心化系统里鲜有一个能用的向量搜索系统。现在去中心化金融也发展的这么离谱的情况下,一堆DAPP随处可见,甚至都出现了去中心化诈骗。但是为什么这些DAPP没有发展出一个特别完善的生态呢,除了大部分人没有需求之外,很大程度上是因为他们还是处于应用层,而底层的基础生态仍然不够完善,DAPP仍然不能发展太多的功能。假如你想做一个去中心化的推特,就会先发现搜索推文的功能你都弄不了,因为没有这个平台。所以首先有一个去中心化的搜索平台供dapp们使用是个必要的事情。

其次,这个项目主要是考虑到我自己的学习曲线。由于现在最吃香的是发展一个通用平台,通用的区块链系统并且能支持各种dapp(比如以太坊就算是通用合约平台),自然去中心化的存储和计算也是功不可没的。做一个这样平台的基本前提,就是了解到运行在其之上的应用层到底需要什么样的平台。相比来说,去中心化搜索这层应该是在计算层之上的。所以这样一个项目可以为未来指明方向。

那么铺垫完了,接下来就可以正式引出这个项目的名字了:MuriData !

(目前项目在Github上,还是个只有技术档案的阶段)

ShugetsuSoft/MuriData: MuriData is a decentralized vector indexing and searching service. (github.com)

(区块链在写了在写了(咕咕咕))

名字哪里来的:muri是日文罗马音的"無理"

因为这个意思可能颇符合我要处理的数据(?),所以就有了这个名字

具体就是会把向量数据导入到一个去中心化的系统里存储并且执行搜索,然后通过区块链进行奖励和惩罚。

这个系统可以给其他dapp提供个向量搜索的平台(未来可能加入文本搜索),保证公开透明的数据存储,并且以一种完全自由市场的形式运行,通过这种设计可以支持数百万级别的向量数据在去中心化的情况下进行查询检索,按量付费。

最后,因为这个项目是我第一次尝试在去中心化领域做东西(网站都做不好还有脸做这个),大部分内容是偏向学习和增强自己实力的部分(这句话千万别让有可能的投资者看到(投机者就算了爱看不看)),所以说还没什么经验。
并且最重要的是,周围没几个人是这个方向的,要不然是做全栈开发要不然桌面的,特别是缺少可以请教的社区生态工作者。但是即使这样,这个项目我也祈求她一定能够被我们团队孵化出来。

关于团队:Shugetsu Soft 是在 Project Shugetsu 之下的独立开发团队(只有两个人的团队),在去年年末成立。当然现在没什么人,如果有加入的意愿,欢迎联系我: [email protected]

如果想看完整的技术文档,看下面

因为懒得再写一遍所以我把技术文档英文搬过来然后翻译一遍)

看中文的直接翻到最下面就行

MuriData Protocol

MuriData is a ==decentralized vector indexing and searching service==. With blockchain providing transaction and verification ability, It allows anyone to create vector databases and send searching tasks to it.

Technology Stack

  • BlockChain
  • BFT Consensus
  • Consistent hashing
  • Distributed Hash Table
  • Distributed Computing
  • Distributed Storage
  • Vector Indexing (Annoy Faiss NMSLib)
  • And Some Kawaii Characters (Very important!!!)

Glossary

  • MuriData : The general term of this system.
  • Magu : One vector database with index.
  • MurIndexer : Single node in a Magu who perform searching tasks.
  • MuriChain : The blockchain where muridata runs on.
  • MURI : The token of this whole ecosystem.

Introduction

MuriData is the name of the decentralized vector searching service. It includes a blockchain to organize nodes, give incentives and allocate tasks.

With MuriData, everyone could either publish a vector database or search on existing databases. Some specialized Indexing nodes will deal with queries and reach a consensus towards the best result.

Design

Layer 1 : Blockchain

The MuriData will run on the base of MuriChain.
MuriChain is designed to handle the transactions of the token MURI, which could be used as both a payment method of searching for data and a governance tool for the whole community.
There is no special requirement for the implementation of MuriChain. It could be a fork of either Ethereum or Polkadot etc. However, to allocate tasks, it must be able to track a list of Magu and perform incentives and punishments, which also requires a higher scalability and speed.

Layer 2 : Magus

Magu refers to the Japanese version of "mug". In MuriData this means a cluster of nodes who maintain a corresponding database. Because in order to search for a vector, nodes must store the vector data and related index data, which is impossible when the vector database is extremely large. To increase the scalability, support big data, and also prevent the single point of failure (SPOF), the searching task must be run on a number of nodes. With Magus, searching tasks could be finished efficiently and effectively.

Operations

Publishing a Magu (Vector Database)

The database is created when an account send a transaction indicating the creation of one database, with the staking of some MURI Token and related data published at the network. The whole network would orgnize a Magu for this database and randomly choose some nodes to become its members, MurIndexers. These nodes would form a small consensus group dealing with any tasks related to this database.

Becoming a MurIndexer

A node who wants to become a MurIndexer and provide indexing ability for the whole network will first stake some MURI with the promise of being online and store some data during a time interval. After staked some token, it would be on the list of candidates and allocated to a random Magu. It will start to store some vectors and perform tasks.

Searching for vectors

When a searching task is commited on the blockchain, the Magu would receive the task and start the searching process. Every MurIndexer would search for the nearst k vectors within its local data and send it to its Magu. Then all MurIndexers will reach a consensus on the top k best result among those commitments and finally commit it on the blockchain. The task would finish and receive a result.

Fishing at Muri Network

Fishers are the one who eventually being fished

Now Shrimp could eat the one who fish them.

Detalis

Vector Searching

A vector is a quantity or phenomenon that has two independent properties: magnitude and direction. In computer, it is stored as a series of numbers.
To compare the similarity of two vectors, we need a method to calculate the distance between two vectors. There are several ways:

  • Euclidean Distance
  • Normalized Distance
  • Hamming Distance

And also to speed up the process of searching a vector, there are many indexing methods:

  • Annoy Index
  • Hnsw Index

Distributed Computing

We could easily design a market of vector searching by simply just connect a vector searching server with its clients. However, there are several problems.

  1. The vector searching server is a centralized form without any tolerance to SPOF, so the request may not be responded.
  2. As a centralized server, it could not provide a searching service fast enough for use, due to the limited processing ability.
  3. The result is totally controlled by just one server. There is nothing to prove the correctness of its result.

Unlike any other form of computation whose answer is definite, tasks like vector searching could only produce the nearest vector as result. Sometimes the nearest may not be find out because the index method is inaccurate. Therefore it is hard to prove that the result is the nearest. However, we could still get the answer as accurately as possible, by putting many nodes into competing:

  1. Select many nodes in the network and form a smaller consensus group.
  2. When task is received, every node start to work and find out the result.
  3. All nodes in this group reach a consensus towards the answer by communication.

Distributed Storage

Just like the problems faced by distributed computing, the storage also has many difficulties to be deal with. It is inefficient to let every node store the same copy, while storing pieces of data in many node pulls the network into the risk of losing data. The way is to maintain several replications of one single piece of a data in a number of nodes, above a certain threshold. And when a node shutdown with its data, other node should rapidly get the copy of these data and store them. In MuriData system, we use Consistent hashing to achieve this.
The vector database will be broken into many pieces called 'segment', each segment contains a merkel tree to prove the existence of a vector, and has a unique hash to identify. Also, segments would be allocated to all nodes through consistent hashing.
All nodes joined to the small consensus group will be placed on a hash ring. Through calculation, the segment of vector database would get its location on the hash ring and then the host it will be stored on. One segment would have many replications on the hash ring. When one node quits, the data is also avaliable in other nodes. And to ensure the right number of copy exists, all node must frequently send proofs to the blockchain.

Distributed Hash Table

With the need of forming subgroups, nodes should be able to find and connect with each other. A DHT could be helpful.

Progress

sequenceDiagram

participant Baka
participant MuriChain
participant Magu
participant Node

Node ->> MuriChain : Stake to prove online.
Node ->> MuriChain : Node Info.
Baka ->> MuriChain : Register for a Magu.
MuriChain ->> MuriChain : Orgnize new Magu...
Baka ->> MuriChain : Add batch vectors..

MuriChain ->> Node : Allocate to a Magu..
activate Node
Node ->> Magu : Request for vector datas.
deactivate Node
Magu ->> Node : Vector datas.
Node ->> Node : Build Indexes.

loop Proof of Replicate
MuriChain ->> Node : Replicationg of vectors Challange.
activate Node
Node ->> MuriChain : Proof.
deactivate Node
MuriChain ->> Node: Reward for storing data.
end


Baka ->> MuriChain : Search for vector {...}
MuriChain ->> Magu : Task starts, input {...}

Magu ->> Node: Start searching task.
Node ->> Node: Find Nearst Vectors.
Node ->> Magu: Commit for best results.
loop Waiting for all nodes..
Magu ->> Magu : Ranking the results.
end

Magu ->> MuriChain: Commit the best K results.
Magu ->> MuriChain: Ranking of Nodes.
MuriChain ->> Node: Reward.

MuriChain ->> Baka: Final Result.

MuriData Protocol Detailed Design

This document includes a possible implementation of the MuriData Protocol.

BlockChain Details

All the consensus about Magus and token incentives will be handled by MuriChain. It should be able to track a list of Magus and their corresponding Vector Segments.

Magus List

One Magu represents One database. A new Magu with special id and name would be created in the blockchain.
Its current state could be represented as:

IdAccess ControllerNodesVector SegmentsVector DimensionsNode Quantity

Name is specified by the creater.
Access Controller refers to the account that can manage and add vectors to it.
Nodes is the list of node in it.
Vector Segments can be the list of all vector segments in it.

Nodes List

Every Magu must have a list of nodes for allocating tokens and tasks.

NodeIdNodeStatusAccountPublicKey

NodeId is generated from the public key of a node, which is also used in DHT for peer discovery.
NodeStatus indicates the status of this node, including its allowed disk spaces, computing ability and so on.
AccountPublicKey is the account that controls the node. An account could be added to a node and is responsible for the behavior of this node.

Vector Segments

Merkel RootVector Data Source

A vector segment is identified by its Merkel Root. It is also used to proof the existence of one vector.
Vector Data Source is the source of vector data, such as ipfs or other blockchain, usually being a hash pointer.

Code Example

Searching for vector

When MurIndexers in a Magu receive a query from MuriChain, they will start searching for the nearst results for this vector. They search them in the index, adding to a list of vectors, then give each of them a merkle proof and signature. After finishing this work, they will send the hash of this list to their Magu. When all node send their result hash, they will send their result. Any node who refuse to send the hash in a specified time interval or send the vector list will be punished, and removed from this Magu.
Their vector list looks like this:

Segment Merkle RootMerkle ProofVectorSignature

The node id will be also sent with this list.

When MurIndexer receive all the vector lists of their partners, they will start ranking the vectors in all of the lists. All node will get a ordered list of vectors. Then the first-finished node will propose its result in the Magu, starting a BFT consensus. If this proposal gets most of the signature in this group, it will be sent to MuriChain.

Node Details

MuriData Node is sperated into two: the blockchain node and index node.
Blockchain node deals with blockchain and index node is responsible for performing tasks.
Index node is connected with blockchain node. They communicate by rpc calls.


MuriData协议

MuriData是一个去中心化的矢量索引和搜索服务。通过区块链提供的交易和验证能力,它允许任何人创建矢量数据库并向其发送搜索任务。

MuriData: 这个系统的总称。
Magu:一个带索引的矢量数据库。
MurIndexer:Magu中执行搜索任务的单个节点。
MuriChain:运行MuriData的区块链。
MURI:整个生态系统的代币。

MuriData是一个去中心化的矢量搜索服务的名字。它包括一个区块链来组织节点,给予奖励和分配任务。
通过MuriData,每个人都可以发布一个矢量数据库或在现有的数据库上进行搜索。一些专门的索引节点将处理查询,并对最佳结果达成共识。

设计

第1层:区块链

MuriData将在MuriChain的基础上运行。MuriChain旨在处理代币MURI的交易,它既可以作为搜索数据的支付方式,也可以作为整个社区的治理工具。对MuriChain的实现没有特殊要求。它可以是以太坊或Polkadot等的一个分叉。然而,为了分配任务,它必须能够跟踪Magu的清单,并进行奖励和惩罚,这也需要更高的可扩展性和速度。

第二层: Magus
Magu指的是日语版的 "杯子"。在MuriData中,这意味着一个维护相应数据库的节点集群。因为为了搜索一个矢量,节点必须存储矢量数据和相关的索引数据,当矢量数据库极其庞大时,这是不可能的。为了提高可扩展性,支持大数据,同时防止单点故障(SPOF),搜索任务必须在多个节点上运行。有了Magus,搜索任务可以高效地完成。

发布一个MAGU(矢量数据库
当一个账户发送一个交易,表示创建一个数据库,用一些MURI代币做赌注,并在网络上发布相关数据,这个数据库就创建了。整个网络将为这个数据库组织一个Magu,并随机选择一些节点成为其成员,MurIndexers。这些节点将形成一个小型的共识小组,处理与这个数据库有关的任何任务。

成为一个MurIndexer
一个想成为MurIndexer并为整个网络提供索引能力的节点将首先押注一些MURI,并承诺在某个时间间隔内在线并存储一些数据。在押上一些代币后,它将被列入候选名单并被分配到一个随机的Magu。它将开始存储一些向量并执行任务。

搜索向量
当一个搜索任务被提交到区块链上时,Magu将收到任务并开始搜索过程。每个MurIndexer将在其本地数据中搜索最近的k个向量,并将其发送给其Magu。然后,所有MurIndexers将在这些承诺中就前k个最佳结果达成共识,并最终将其提交到区块链上。该任务将完成并收到一个结果。

分布式计算
我们可以很容易地设计一个矢量搜索市场,只需将一个矢量搜索服务器与它的客户相连。然而,这其中有几个问题。
矢量搜索服务器是一种集中式的形式,对单点故障没有任何容忍度,所以请求可能不会被响应。
作为一个集中式的服务器,由于处理能力有限,它不能提供足够快的搜索服务供使用。
结果是完全由一个服务器控制的。没有什么可以证明其结果的正确性。
不像任何其他形式的计算,其答案是确定的,像矢量搜索的任务只能产生最近的矢量作为结果。有时最近的可能找不到,因为索引方法不准确。因此,很难证明结果是最近的。然而,我们仍然可以尽可能准确地得到答案,通过把许多节点放到竞争中。
在网络中选择许多节点,形成一个较小的共识组。
当收到任务时,每个节点都开始工作并找出结果。
这个小组中的所有节点通过通信达成对答案的共识。

分布式存储
正如分布式计算所面临的问题一样,存储也有许多困难需要处理。让每个节点存储相同的副本是低效的,而在许多节点上存储数据片断会使网络陷入丢失数据的风险。方法是在一定数量的节点上对一个单一的数据进行多次复制,超过一定的阈值。当一个节点关闭其数据时,另一个节点应迅速获得这些数据的副本并将其存储。在MuriData系统中,我们使用一致性散列来实现这一目标。所有加入小型共识组的节点将被放在一个哈希环上。通过计算,矢量数据库的段会得到它在哈希环上的位置,然后是它将被存储的主机。一个段会在哈希环上有许多次的复制。当一个节点退出时,数据在其他节点也是可用的。而为了确保存在正确数量的副本,所有节点必须经常向区块链发送证明。矢量数据库将被分解成许多片断,称为 "片段",每个片段都包含一个默克尔树来证明一个矢量的存在,并且有一个独特的哈希值来识别。同时,片段将通过一致的散列方式分配给所有节点。

分布式哈希表
随着形成子群的需要,节点应该能够找到并相互连接。一个DHT可能是有帮助的。


仅有一条评论

  1. EL_File4138
    EL_File4138 · 2021-09-12 21:49

    每周固定状态简报
    在每天学c++了,目前看到了函数和类
    在每天背雅思单词了,目前背了170+个
    在每天学密码学了,目前看完了diffe-hellman密钥交换和一堆连带的理论(中国剩余定理啦算法复杂度啦balabala)
    目前没有任何进展,只能说是“学了点啥但又不知道自己会不会”的状态
    等待最近project shugetsu的状态(记得AES加密)

发表评论