
🤗Third solution. Building the ideal database design
© Alexander Sychev, Sergei Sychev, “TRIZ-RI Group”
The name that can be named is not the eternal name. Lao Tzu
Now let's try to apply TRIZ. Let's try to resolve the contradiction described in the previous paragraph and to create a truly ideal structure to store any data we want. Let's change a little bit the classical formulation of TRIZ in our case and write: "There is no object, but its data is stored."
But what is the record of an object? Strictly speaking, the informational system can't contain actual object by any means. Since we are dealing with information, it only contains objects descriptions. And what is the "objects description"?. It is a set of "keys" and "values." And what about the names of the objects? See epigraph. For the informational systems, the quote of Lao Tzu should not be understood as a metaphor, but literally. There are no objects in the database, there are only keys and values.
Let's do so, shall we?
Entities (object’s keys)
1
2
3
7
11
Properties (of any entities)
_id Integer foreign key = Who has current property is stated here
Names = Name of property (String)
Values = Value of property (String)
1
2
2
7
11
This approach provides a “one to many" relationship. Any data can be obtained with a single query, hence good performance is achieved. That's a good point. However, if there are entities with the same properties, we end up having identical records in the table Properties. That's a problem. A significant problem.
Let us try, for example, to store documents we make when dealing with our counterparties. Let the document type ("Document_Type" property) always be chosen from a fixed set of options (let it be "account", "false", "agreement", "acts", etc.), and let the vendors often be the same ("Provider" property).
Then if we make, for example, three thousand deals it will result in a creation of 3000 new records, which will have identical properties "Document_Type" and "Provider". 6000 identical records in total. And that is so if we assume that only two of all properties are the same. In general, if we add to table Properties m entities with n properties, k of which are identical (k<=n), it will generate m*n new records (k*m of which are identical). This base will grow rapidly, but will work fast for its size.
Let us improve the solution following the general logic of normalization, moving all values from the table Properties (except Id) into separate tables. That means that two new tables are created: table Names and table Values. They will store only unique names and values of properties respectively, which will then be "gathered into sets" of properties in the table Sets.
Keys in the table Sets are not unique. They link the ID of the set of properties with properties themselves.
Sets (of properties’s names and values)
Here:
id - integer
Name_id - integer foreign key table Names
Value_id - integer foreign key table Values
Names (of properties)
Values (of properties)
Any data can be obtained using a single query, thus good performance is achieved. Duplicate records are not created as well. If we add m sets with n names/values of properties, k of which are identical (k<=n), it will generate n-k new records in table Names/Values and m*n in table Sets. (If k=n, new records in the table Names/Values are not created at all).
Thus, space is saved when we add sets with identical properties. The size of such database will grow much slower because when you add duplicate name/value only table Sets increases in size.
Here we have to make the necessary methodological remark: There are no objects, there are no names of objects, there is no evidence that they exist at all ("There is no spoon") There are only properties (names of properties, values of properties and sets of properties). You can gather these properties in any way you want. You can mark gathered “set of properties” with a tag (give an appropriate name), but you, as a developer, should not consider this "pack" of properties as an "object" because it entails the inertia of thinking. Spoon (fork, groundhog) does not have properties, but "this pack of properties" we (if we have such a weird idea) label as "spoon", this as "fork ", this as "groundhog ".
You have obviously already noticed that the initial contradiction: "Properties of different entities should be different, so we can work with entities according to their features, and should not be different, so the number of tables would not increase." - is eliminated.
We can now customize any properties into a corresponding Set and treat them in whatever specific way we want. In addition, new tables are not required. So, we have a universal interface between any arbitrarily specific, entities.
But let us continue curtailing:
The table "Names" contains two columns "Id" and "Names". both storing only unique values.
New names (of properties)
But the second column of this table is used only for the understanding of what we are working with only by the programmer (not by a user, not even by a program), and for nothing else. In addition, the table Names stores the relation "unique to a unique". So, you can use only one field. Since the search through numerical identifiers is faster, you can just throw column Names away, leaving only the column Id. (How to do that without the programmer losing comprehension will be described further on.)
Then, since there is only one column Id, you can throw away the entire table Names and track identifiers we put in table Sets in field Name_id.
Thus, only two tables remain in our database: the table Values (same as it was) and the table Sets, which now has two primary keys: Set_id and Name_id, and key Value_id as a foreign key linking appropriate value in the table Values.
New sets (of properties’s names and values)
Here:
Set_id - integer
Name_id - integer
Value_id - integer foreign key table Values not null
New values (of properties)
The table Names has disappeared (again remember Lao Tzu :)), or rather, has become ideal. Now it is gone, but its function is completely performed. Again remember TRIZ.
So now we have a significant gain in performance due to the fact that the whole table was thrown away. Before that, it was necessary to go through this entire table in order to determine the numerical identifier corresponding to a string parameter. Now this operation, which required a comparison of a large number of strings is terminated.
Let's return to our example and populate our universal and simultaneously specific database:
Final sets
Here:
Set_id - integer
Name_id - integer
Value_id - integer foreign key table Values not null
1
4
3
1
3
1
2
2
1
2
1
...
...
...
...
Final values
1
Pán Myšlenek Company
2
TRIZ-RI Group
3
Alexander Sychev
4
Сertificate No. 8 ________
5
...etc
One could argue that the situation has become less readable. For example, we do not see any human readable information in the field name_id in our first table, and we have no idea what entities we are dealing with. We only see their id (4,3,2,1). The machine understands everything and works fast, but for humans it is not very convenient.
Queries also end up being barely readable as well, for example:
get(2, 1, None)(Note: here and in the following examples we will use php as our programming language, identical examples in other languages may be given.)
Contradiction
Once again we have a contradiction:
"Name should be called explicitly for the sake of convenience of the programmer, and should not be called at all, so that machine does not look through the whole new table, which is, by the way, filled with string data." What should we do?
We solve this contradiction by moving the list of names out of the "search area" since it is the zone with contradiction. It is possible to determine which id corresponds to representation which is readable to the programmer outside of the database - at the program level.
For that purpose let us create a dictionary (for example in header file), which contains a list of constants as follows:
document_name = 1
provider_name = 2
agent_name = 3
people_name = 4After that our queries gain a readable form (for example): \
get_document(2, document_name, None)So no extra table, and great performance.
Note about the predecessors
Here we should properly refer to a pretty well-known model of data organization called EAV (Entity - Attribute - Value), which is pretty similar to the structure described above but was created much earlier. And we point that out with great respect. But at the same time, we want to point out differences in our implementation. These differences allow to get rid of some problems, which are common not for EAV itself, but for the most well-known but not so well made implementations of this progressive model The difference is described in the previous paragraph: one should not store descriptions “made for humans” in the database. In the last “curtailed” variation, we have not 3 but 2 tables. The table which in EAV is called Attribute, and in our case called Names, is eliminated and replaced by a list of constants stored in a header file. That gives a significant improvement in performance without any inconveniences for the programmer. There is another major difference in described implementation. All Values are stored in a separate table consisting of 2 columns ID and Values. It means that “values” are separated from “Sets” and not mixed with them, which firstly dramatically decreases the size of the database and secondly makes it work a lot faster. The topic of this database's performance will be discussed below in section 3. We hope that, considering the facts described it the article (above and below in section 3), all statements about the poor performance of databases with “Key-value” ideology will turn into myths and legends. But we insist on using this exact implementation which we described in this publication. Let's call it “EAV as Owls”.
Last updated