SQL Trees

Seryi · Post by **Seryi** » 23 Nov 2003 10:10

Хочу поговорить о довольно стандартной проблеме - хранении древовидных структур в реляционных базах данных.

Порывшись в теории нашел несколько методов:

1) Adjacency List, это самый простой, структура:
ItemID ParentID

2) Materialized Path

Code: Select all

Name   Path
Vasya  1.1
Petya   1.1
Kolya   1.1.1

http://www.dbazine.com/tropashko4.html

3) Nested Sets
http://www.intelligententerprise.com/001020/celko.shtml

4) Nested Intervals
http://www.dbazine.com/tropashko4.html

Больше всего интересуют последние 2, особенно номер 3. Кто-то ими пользовался? Особенно если в таблице более 10000 записей? Какие проблемы с ними могут возникнуть?

vovap · Post by **vovap** » 23 Nov 2003 14:38

Seryi wrote:Порывшись в теории нашел несколько методов:

1) Adjacency List, это самый простой, структура:
ItemID ParentID

2) Materialized Path
Code: Select all
Name   Path
Vasya  1.1
Petya   1.1
Kolya   1.1.1
http://www.dbazine.com/tropashko4.html

3) Nested Sets
http://www.intelligententerprise.com/001020/celko.shtml

4) Nested Intervals
http://www.dbazine.com/tropashko4.html

Больше всего интересуют последние 2, особенно номер 3. Кто-то ими пользовался? Особенно если в таблице более 10000 записей? Какие проблемы с ними могут возникнуть?

Ну я пользовася 3 методами - Adjacency List, Materialized Path и чем - то вроде Nested sets - когда записывается только номер уровня и номер по порядку.
Каждый имеет свои достоинства и недостатки для свиох целей.
Adjacency List без рекурсии юзать сложно, а рекурсия - не есть друг SQL.
Materialized Path сложноват при построении и приходистя применять трюки в работе со строками - очень желательны SQL функции (в MS SQL 7 их скажем нет)
Упрощенный Nested sets легче для SQL но держится на порядке - перестраивать его сложно, особенно если нет дырок в нумерации.

Seryi · Post by **Seryi** » 23 Nov 2003 15:08

vovap wrote:Ну я пользовася 3 методами - Adjacency List, Materialized Path и чем - то вроде Nested sets - когда записывается только номер уровня и номер по порядку.
Каждый имеет свои достоинства и недостатки для свиох целей.
Adjacency List без рекурсии юзать сложно, а рекурсия - не есть друг SQL.
Materialized Path сложноват при построении и приходистя применять трюки в работе со строками - очень желательны SQL функции (в MS SQL 7 их скажем нет)
Упрощенный Nested sets легче для SQL но держится на порядке - перестраивать его сложно, особенно если нет дырок в нумерации.

Я раньше пользовался Adjacency List и Materialized Path, оба мне не очень то нравились. Сейчас хочу попробовать Nested Sets, пока что пишу без дырок в нумерации, потом вероятно введу дырки.
Основной недостаток Nested Sets, который все приводят это тормозной INSERT и DELETE при большом размере дерева. У меня дерево будет около 10000 элементов. Как было у вас? Сталкивались ли вы с проблемами производительности?
СУБД - SQL Server 2000

vc · Post by vc » 23 Nov 2003 15:42

Seryi wrote:Хочу поговорить о довольно стандартной проблеме - хранении древовидных структур в реляционных базах данных.

Порывшись в теории нашел несколько методов:

1) Adjacency List, это самый простой, структура:
ItemID ParentID

2) Materialized Path
Code: Select all
Name   Path
Vasya  1.1
Petya   1.1
Kolya   1.1.1
http://www.dbazine.com/tropashko4.html

3) Nested Sets
http://www.intelligententerprise.com/001020/celko.shtml

4) Nested Intervals
http://www.dbazine.com/tropashko4.html

Больше всего интересуют последние 2, особенно номер 3. Кто-то ими пользовался? Особенно если в таблице более 10000 записей? Какие проблемы с ними могут возникнуть?

Regarding (3) and (4).

We're using dynamic trees with the number of nodes close to 400 K.

Method 3 was not suitable because it requires re-computing the labels every time the tree is changed.

Method (4) labeling implementation suffers from arithmetic overflow during label computation, even for relatively small trees. Here's an excerpt from an exchange between me and Vadim Tropashko:

Code: Select all

vadimtro@yahoo.com (Vadim Tropashko) wrote in message news:<22d2e427.0309171245.534732eb@posting.google.com>...
> "Robin Tucker" <r.tucker@thermoteknix.com> wrote in message news:<bk6os9$hrt$1$8302bc10@news.demon.co.uk>...
> > The error is "Arithmetic overflow error converting expression to data type
> > float".  This, I assume, is for the POWER function.  So even when I use
> > BIGINT or DECIMAL(12,0), the problem is still there.
> 
> I suggest posting the function and invocation context at microsoft
> forum, as there must be definitely a simple solution to your overflow
> problem.
> 
> Otherwise, I'm inclined to add the following paragraph into my next
> article
> <quote>
> An immediate property of the above labeling schema is density: we have
> all integer positive numbers utilized. Since some database
> implementations have integers of limited range only, density might
> become an attractive feature, because one would be able to store and
> manipulate bigger trees before hitting arithmetic overflow exception.
> Today, the problem of 16-bit overflow is ridiculous, of course, but
> that is nevertheless the practical limitation that database developer
> have to live with in some RDBMS implementations.
> </quote>  
> 
> I'm suspecting once again, however, that the blank statement at the
> very end that I had drawn from your experience is wrong. I would be
> happy to learn that the overflow is a non-issue and remove the last
> statement.


Hello,

I'll have to defend SQL Server.  COnsider this in Oracle:

select rownum, path, path_numer(path), path_denom(path) from (
  select '1' path from dual
   union all
  select '1.'||rownum from all_objects
)

Result:
      rownum    path
...............................................
       119       1.118           6.6461E+35      6.6461E+35
       120       1.119           1.3292E+36      1.3292E+36

ERROR:
ORA-06502: PL/SQL: numeric or value error: number precision too large

... only after 120 rows.

VC.

120 rows selected

... and Vadim's response:

..This is correct, the coding numbers are growing exponentially.
Therefore, any fixed precision ariphmetics would break at some point.
This is why the new "dense" integer coding schema.

And then he goes to describe a new encoding method that, he says, should be published in "dbazine" soon.

Practically, for MS SQL Server we implemented BFS in a stored procedure and Oracle has a recursive query which is trivial to use.

The new version of SQL Server is going to have a recursive query implementation too. It will be conformant SQL'99 and is the same as implemented in DB2 (the 'with' query).

Rgds.

vovap · Post by **vovap** » 23 Nov 2003 16:06

Seryi wrote:Основной недостаток Nested Sets, который все приводят это тормозной INSERT и DELETE при большом размере дерева.

Ну это для того, что было описано в статье. Там надо перерисовавать все дерево при любом его изменении. Тот упрощенный, что мы применяли этим в целом не страдает - если нумеровать с дырками - вставка не является проблеммой. В конце концов, конечно можен возникнуть ситуация, когда всталять некуда - тогда надо переорганизовывать все. Для пущей крутизны можно нумеровать каждую ветку отдельно с начала - тогда вообще можно не сильно бояться - хотя построение и усложнится.
Удаление там не проблемма по определению - от того, что исчезнет блок порядковых номеров дерево не изментся.

У меня дерево будет около 10000 элементов. Как было у вас? Сталкивались ли вы с проблемами производительности?
СУБД - SQL Server 2000

Ну, 10000 элементов - это не много. Это в принципе - чепуха, все в памяти может легко поместиться. У нас было, скажем того же порядка. Производительность у нас определялась в целом иными факторами но особых проблемм я не вижу.

vc · Post by vc » 23 Nov 2003 18:43

vovap wrote:
Seryi wrote:Основной недостаток Nested Sets, который все приводят это тормозной INSERT и DELETE при большом размере дерева.

Ну это для того, что было описано в статье. Там надо перерисовавать все дерево при любом его изменении. Тот упрощенный, что мы применяли этим в целом не страдает - если нумеровать с дырками - вставка не является проблеммой. В конце концов, конечно можен возникнуть ситуация, когда всталять некуда - тогда надо переорганизовывать все. Для пущей крутизны можно нумеровать каждую ветку отдельно с начала - тогда вообще можно не сильно бояться - хотя построение и усложнится.
Удаление там не проблемма по определению - от того, что исчезнет блок порядковых номеров дерево не изментся.
У меня дерево будет около 10000 элементов. Как было у вас? Сталкивались ли вы с проблемами производительности?
СУБД - SQL Server 2000

Ну, 10000 элементов - это не много. Это в принципе - чепуха, все в памяти может легко поместиться. У нас было, скажем того же порядка. Производительность у нас определялась в целом иными факторами но особых проблемм я не вижу.

1. Yes, for inserts and deletes, you are right. The problem, however, is not solved for updates. Consider the nested sets below:

[10, [20, [30,...,40], [50, ...,60], 70], [80, 200], 1000]

Assuming you want to move subtrees (subsets) from [20,70] to [80, 200]. You'd need to re-number all the highest level nodes plus all the inner subsets thereof, recursively:

[10, [20, 70], [80,[90,...,100], [110,..,120], 200], 1000].

With the adjacency list, you'd need to change only _one_ node. Not only is it more performant (which may not be a serious consideration for 10000 nodes), more importantly the adjacency model is much better for concurrent updates for obvious reasons.

2. The alternative labeling breaks completely if your hierarchical structure is a simple DAG (e.g. at least one node has two parents).

3. If re-numbering is required, then why not go all the way and use the transitive closure that can be incrementally updated. The TC is much more flexible with respect to ad hoc queries and works for DAGs as well as for trees.

Rgds.

vovap · Post by **vovap** » 23 Nov 2003 19:01

vc wrote:1. Yes, for inserts and deletes, you are right. The problem, however, is not solved for updates. Consider the nested sets below:

[10, [20, [30,...,40], [50, ...,60], 70], [80, 200], 1000]

Assuming you want to move subtrees (subsets) from [20,70] to [80, 200]. You'd need to re-number all the highest level nodes plus all the inner subsets thereof, recursively:

[10, [20, 70], [80,[90,...,100], [110,..,120], 200], 1000].

With the adjacency list, you'd need to change only _one_ node. Not only is it more performant (which may not be a serious consideration for 10000 nodes), more importantly the adjacency model is much better for concurrent updates for obvious reasons.

Да, перенос кусков дерева создает затруднения. Но если каждую ветку нумеровать отдельно - то небольшие.
В целом все зависит - для чего это дерево надо и что с ним будут делать.
Да, применявшаяся нами модель годилась только для строгого дерева, без сетевых связей

vc · Post by vc » 23 Nov 2003 19:19

vovap wrote:
vc wrote:1. Yes, for inserts and deletes, you are right. The problem, however, is not solved for updates. Consider the nested sets below:

[10, [20, [30,...,40], [50, ...,60], 70], [80, 200], 1000]

Assuming you want to move subtrees (subsets) from [20,70] to [80, 200]. You'd need to re-number all the highest level nodes plus all the inner subsets thereof, recursively:

[10, [20, 70], [80,[90,...,100], [110,..,120], 200], 1000].

With the adjacency list, you'd need to change only _one_ node. Not only is it more performant (which may not be a serious consideration for 10000 nodes), more importantly the adjacency model is much better for concurrent updates for obvious reasons.

Да, перенос кусков дерева создает затруднения. Но если каждую ветку нумеровать отдельно - то небольшие.
В целом все зависит - для чего это дерево надо и что с ним будут делать.

Could you please elaborate ? If am parsing your message correctly ('ветку' == branch), then it means that you must have invented a completely different labeling scheme since the nested sets labeling, as well as the materialized path one, define a global position within the hierarchy, and only the adjacency list can get way with local lableling. Eager to learn ...

Rgds.

vovap · Post by **vovap** » 23 Nov 2003 19:33

vc wrote:Could you please elaborate ? If am parsing your message correctly ('ветку' == branch), then it means that you must have invented a completely different labeling scheme since the nested sets labeling, as well as the materialized path one, define a global position within the hierarchy, and only the adjacency list can get way with local lableling. Eager to learn ...

Rgds.

Да нет, это я просто заврался с устатку

Не получится, конечно не иметь сплошной нумерации, Ваша правда.

vc · Post by vc » 23 Nov 2003 19:41

vovap wrote:
vc wrote:Could you please elaborate ? If am parsing your message correctly ('ветку' == branch), then it means that you must have invented a completely different labeling scheme since the nested sets labeling, as well as the materialized path one, define a global position within the hierarchy, and only the adjacency list can get way with local lableling. Eager to learn ...

Rgds.

Да нет, это я просто заврался с устатку :) Не получится, конечно не иметь сплошной нумерации, Ваша правда.

Thank you.

I was genuinely intrigued for a while ;)

Best rgds.

Siberian Cableman · Post by **Siberian Cableman** » 24 Nov 2003 13:50

Hi everybody,

I have a similar problem with tree in SQL Server that should be developed for the 10.000 – 100.000 records. Any data records should be assigned to some node in the tree. So there are no records outside of the tree. I have two questions about better implementation or recommendations:

1. Bulk Moving: When one of the parent nodes (with all children nodes, and all data records under it) is moved under another node:
Example:
How it was: Alex -> Piter -> Tom -> Anton -> Vadim,
Now we are moving the whole branch of Tom -> Anton -> Vadim under
Alex -> Oleg -> Konstantin -> Dennis to get them like this:
Alex -> Oleg -> Konstantin -> Dennis -> Tom -> Anton -> Vadim.

What are your recommendation regarding better algorithms for my purposes and number of records (10.000 – 100.000 ) in the context of this topic?

2. I called it Bulk Result Logging: When the whole branch of records depending on the value of parent’s record should change their values:
Example: You have a set of technical parameters of engine that are categorized and which should be checked against the car you want to buy. The parent of them is “Turn On Engine”. If result is False, you cannot check anything else on the engine that does not run, so all other children and grand children in the branch should change their value to “Not available for checking”.
The question is how to create structure of the table such way that allows doing it (changing values of the children) fast, and SQL Server will not be overloaded in such operation.

Thanks,
Alex

Siberian Cableman · Post by **Siberian Cableman** » 25 Nov 2003 02:12

Guys, Anybody can help?

Siberian Cableman · Post by **Siberian Cableman** » 26 Nov 2003 10:23

Another article about the same topic. Just without lots of details:

http://www.scenic-route.com/program/db/lists.htm
http://www.scenic-route.com/program/db/lists2.htm
http://www.scenic-route.com/program/db/lists3.htm
http://www.scenic-route.com/program/db/lists4.htm

Привет

SQL Trees

SQL Trees

Re: SQL Trees

Re: SQL Trees

Re: SQL Trees

Re: SQL Trees

Re: SQL Trees

Re: SQL Trees

Re: SQL Trees

Re: SQL Trees

Re: SQL Trees