This project has moved. For the latest updates, please go here.

Index, search and Matchcode

Jul 17, 2014 at 7:22 PM
Edited Jul 19, 2014 at 2:42 PM
Hello,

can you help me to understand the BPlusTree ?

I have seen, you search the sub., pred. and obj.
There is a uInt hash, for searching.

I the tree, we can only search exact values ?
Which role hast the hash here ?

In the 90' I have been build a index based MatchCode search, maybe this logic can be used here.
But for this I need an ascending numerical index. Must for a number (int) a hash formed ?

Or can we make such thing:
Actor has Property Name
Name has value "Toni"
Value "Toni" HasT

Query:
Give me all Name which HasT

Does the backup represent the Triples ? I think so.
Here is my Backup :
<http://www.brightstardb.com/.well-known/genid/cff0c597-a0cf-4d13-9635-7d8691f523c7> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://brightstardb.com/namespaces/default/Material> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/cff0c597-a0cf-4d13-9635-7d8691f523c7> <http://brightstardb.com/namespaces/default/nummer> "22777"^^<http://www.w3.org/2001/XMLSchema#string> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/cff0c597-a0cf-4d13-9635-7d8691f523c7> <http://brightstardb.com/namespaces/default/kurzbezeichnung> "Logic 22777"^^<http://www.w3.org/2001/XMLSchema#string> <http://www.brightstardb.com/.well-known/model/defaultgraph> .

<http://www.brightstardb.com/.well-known/genid/3d94c4db-697c-4ac0-8db2-dcaf07ca4528> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://brightstardb.com/namespaces/default/Material> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/3d94c4db-697c-4ac0-8db2-dcaf07ca4528> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://brightstardb.com/namespaces/default/SYS_Entity> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/3d94c4db-697c-4ac0-8db2-dcaf07ca4528> <http://brightstardb.com/namespaces/default/nummer> "22779"^^<http://www.w3.org/2001/XMLSchema#string> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/3d94c4db-697c-4ac0-8db2-dcaf07ca4528> <http://brightstardb.com/namespaces/default/kurzbezeichnung> "Logic 22779"^^<http://www.w3.org/2001/XMLSchema#string> <http://www.brightstardb.com/.well-known/model/defaultgraph> .

<http://www.brightstardb.com/.well-known/genid/c473a399-94d9-46be-9775-4b69dc4ca3da> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://brightstardb.com/namespaces/default/Material> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/c473a399-94d9-46be-9775-4b69dc4ca3da> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://brightstardb.com/namespaces/default/SYS_Entity> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/c473a399-94d9-46be-9775-4b69dc4ca3da> <http://brightstardb.com/namespaces/default/nummer> "22776"^^<http://www.w3.org/2001/XMLSchema#string> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
<http://www.brightstardb.com/.well-known/genid/c473a399-94d9-46be-9775-4b69dc4ca3da> <http://brightstardb.com/namespaces/default/kurzbezeichnung> "Logic 22776"^^<http://www.w3.org/2001/XMLSchema#string> <http://www.brightstardb.com/.well-known/model/defaultgraph> .
Thank you.
Jul 20, 2014 at 3:58 PM
Edited Jul 21, 2014 at 4:49 AM
Hello,

you add sub., pred. and obj. to the tree as an resource and the you add a sub, / obj. relation between the resources !?
Each relation storage is a B+tree, but why in the same file ?

So the obj. is a own resource (in ResourceIndex.cs) and you can find it exact or sequential, because you have indexed a hash.

To search not all rows, the list should be limited. Best by an index.

For the MatchCode search, I need at lease 10 ascending integer indices. This allows the list to be dramatically reduced and the rest can be searched sequential.
Jul 21, 2014 at 1:05 PM
Edited Jul 21, 2014 at 1:07 PM
I have found a alternative solution.

The obj. in the ResourceIndex are in a BPlusTree, better may be a Ternary Search Tree.
So you can search with placeholder, a "start with" and exact ...

It's not better then the MatchCode but uses less memory ...

http://www.codeproject.com/Articles/5819/Ternary-Search-Tree-Dictionary-in-C-Faster-String
http://www.codeproject.com/Articles/14363/HTTree-Hacked-Ternary-Tree-Implementation
http://www.codeproject.com/Articles/84860/Ternary-Search-Tree-Algorithm-for-word-list-storag
http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/

Regards.
Coordinator
Jul 29, 2014 at 2:15 PM
CyborgDE wrote:
I have found a alternative solution.

The obj. in the ResourceIndex are in a BPlusTree, better may be a Ternary Search Tree.
So you can search with placeholder, a "start with" and exact ...

It's not better then the MatchCode but uses less memory ...

http://www.codeproject.com/Articles/5819/Ternary-Search-Tree-Dictionary-in-C-Faster-String
http://www.codeproject.com/Articles/14363/HTTree-Hacked-Ternary-Tree-Implementation
http://www.codeproject.com/Articles/84860/Ternary-Search-Tree-Algorithm-for-word-list-storag
http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/

Regards.
Interesting but I'm not sure where the memory saving would come from here ? It might be worth exploring some more. Having some sort of "proper" text index would help us improve performance for string matching (especially in FILTER statements) but would it really make it faster than a B-Tree for straight string matching ?
Jul 29, 2014 at 3:13 PM
Hello,

this is not for memory saving but for searching of objects.
With the hash, you can only search exact. With a e.g. Ternary-Search-Tree you can search for parts, contains, start with (maybe) ...

With the FILTER you can search nothing, I got a timeout with 5 000 000 Objects in the db.

Best regards.
Jul 29, 2014 at 6:50 PM
Edited Jul 30, 2014 at 4:44 AM
Hello,

in a WMS we have a list with 100000 materials. We a searching all materials where the material number or the description contains "140" and "210" ( a example from today). With a sequential search, this takes a long time.

Our stock booking list contains today 1900000 records, I want to get all the bookings in a date range, this is impossible without an index.

There are working 50 people with the system and currently I got the booking list in 200ms with 305 result records.

The customers has a "Roberto Brosch". With index support, a "starts wiith" gives me immediately a list of customers. With a index based match code a contains of "Brosch" gives me also immediately a list of customers. The index based search of "osch" should also be possible.
The index must not find the result completely, but should reduces the result list to find the result sequential.

Sequential searches let the server load increase enormously.

Best regards.
Coordinator
Jul 30, 2014 at 9:03 AM
These are definitely good use-cases for considering additional indexing of property values. Essentially beefing up the resource index from what it currently is (a simple BTree using string hashcode as a key). I would like to consider this in a future release, possibly making use of the work from the lucene.net project for this and maybe just adding it as an option for those projects where the requirement is for faster search at the cost of more disk-space/indexing time used.
Jul 30, 2014 at 10:34 AM
Edited Jul 30, 2014 at 10:46 AM
I decided against a full text search, as it maybe cannot fulfill some of the above examples.

Maybe the Number is "TRW1/140x210/LP1" or "TRW1/1402100/LP1"
We a searching all materials where the material number or the description contains "140" and "210"
The customers has a "Roberto Brosch"
The index based search of "osch" should also be possible
This can be done with a Ternary Search Tree (mybe) or my MatchCode search. I'm not sure with lucene, I have not tested it, but I think, It uses a Ternary Search Tree and can not search effective with leading wildcard.

I my head, I keep the MatchCode search. But I need for every indexed object at least 10 integer indexes.
Coordinator
Jul 30, 2014 at 12:31 PM
That is true, Lucene doesn't deal well with leading wildcards.

Of course what it comes down to is a tradeoff between functionality and the time taken to specify, implement and test that functionality. Plugging in Lucene.NET might be a quick way to get much more indexing functionality, whereas developing the indexing code from scratch would take longer but maybe allow for coverage of more use cases. IMO the key usecases would be driven by what is possible in SPARQL (which is quite a lot if you take regex filtering and the various string functions into account).

Plus there are usecases for range matching on other datatypes (numeric & date/time at least - maybe more) which would probably require different indexing strategies to be efficient.

I'm not saying I have made a decision about this or have a plan as this sort of indexing isn't really on my roadmap just yet.
Jul 30, 2014 at 1:13 PM