|
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Guys,
I have been always wondering about this...is it better to have a really wide table or a table with less columns but more rows? Thanks. |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On Mon, 7 Apr 2008 12:02:43 -0700 (PDT), quakey@gmail.com wrote:
> Guys, > > I have been always wondering about this...is it better to have a > really wide table or a table with less columns but more rows? As a vast oversimplification, more rows. If you have many attributes, for a thing, which are not all used, the usual correct way to do it is to have a table with the common attributes, and the additionals in either their own tables joined to the master table, or in a single extended attribute table with joined to the master table. A common extended attribute table might look like: attr_id INT AUTO_INCREMENT -- primary key mstr_id INT -- foreign key to master table attr_type VARCHAR(20) -- type wot attribute is attr_value VARCHAR(250) -- value for the attribute -- Because of the diverse conditions of humans, it happens that some acts are virtuous to some people, as appropriate and suitable to them, while the same acts are immoral for others, as inappropriate to them. -- Saint Thomas Aquinas |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On Mon, 07 Apr 2008 12:02:43 -0700, quakey@gmail.com wrote:
> Guys, > > I have been always wondering about this...is it better to have a really > wide table or a table with less columns but more rows? > > Thanks. I don't know the answer, but I'm pretty sure it depends on circumstances. What kind of queries do you fire often? Do you have a lot of NULL values? -- Remove caps to reply |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
quakey@gmail.com wrote:
> Guys, > > I have been always wondering about this...is it better to have a > really wide table or a table with less columns but more rows? > > Thanks. > It's better to have a database which is properly normalized. Once you normalize the database, the question of rows/columns is generally no longer pertinent. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
I'm with Jerry. Please google "normalization" and understand it to at least
3rd normal form. Simple, but you get one fact in one place, and wide vs. long is meaningless. "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message news:pZmdnTxLxP3xBGfanZ2dnUVZ_judnZ2d@comcast.com. .. > quakey@gmail.com wrote: > > Guys, > > > > I have been always wondering about this...is it better to have a > > really wide table or a table with less columns but more rows? > > > > Thanks. > > > > It's better to have a database which is properly normalized. > > Once you normalize the database, the question of rows/columns is > generally no longer pertinent. > > -- > ================== > Remove the "x" from my email address > Jerry Stuckle > JDS Computer Training Corp. > jstucklex@attglobal.net > ================== > |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
I studied normalization in school but I also heard that they are not
practical in some real-world applications, especially when performance is a big factor. Let's say I have 10 millions users with 10 columns of attributes. I can normalize it to be 2 tables with 1 one table sharing common attributes. Since the 2nd table contains optional attributes, it has less than 10 millions rows. For the sake of example, I will say 1st table has 10 millions rows and 2nd table has 5 millions. In that case, is LFET JOIN a viable solution to query the data? Wouldn't it be "faster" and consume less memory if I only had one table with all the columns? Thanks. |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
On Mon, 7 Apr 2008 17:42:50 -0700 (PDT), quakey@gmail.com wrote:
> I studied normalization in school but I also heard that they are not > practical in some real-world applications, especially when performance > is a big factor. > > Let's say I have 10 millions users with 10 columns of attributes. I > can normalize it to be 2 tables with 1 one table sharing common > attributes. Since the 2nd table contains optional attributes, it has > less than 10 millions rows. For the sake of example, I will say 1st > table has 10 millions rows and 2nd table has 5 millions. > > In that case, is LFET JOIN a viable solution to query the data? > Wouldn't it be "faster" and consume less memory if I only had one > table with all the columns? Test it. Report back. That's the only real answer for a specific case. For this example, it's probably faster/"better" to have it all in one table. But it's not possible to generalize from that into all cases. Here's a real-world example from my employer. Businesses have addresses. They all have a physcial address. Some of them have mailing addresses that are different from physical addresses, such as post office boxes. So, companies always have one address, but *may* have additional addresses stored in another table. We don't often mail to the business as a whole either, we mail to individuals at that company, and individuals often (but not always) have internal information to specify which building on a large campus or which office in a region they work from. Additionally, some folks like to get promotional offers at different addresses than invoices. (Maybe invoices stay at the office while promo offers go to the house, so they can heat their house with them. I don't know.) Anyway, that means that there's potentially thousands of things like "address" that can be associated with a company, and it makes a good deal of sense to roll all those extra ones (beyond the main physical address) into a long table. Actually, in this case, two idential tables, one which joins to company entities, and one that joins to people. The layouts are the same and they work the same, but the names are different, and obviously the constraints point to the respective company or person table. -- "Friendship is born at that moment when one person says to another, 'What! You too? I thought I was the only one!'" --C.S. Lewis |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
quakey@gmail.com wrote:
> I studied normalization in school but I also heard that they are not > practical in some real-world applications, especially when performance > is a big factor. > Normalization is ALWAYS practical. > Let's say I have 10 millions users with 10 columns of attributes. I > can normalize it to be 2 tables with 1 one table sharing common > attributes. Since the 2nd table contains optional attributes, it has > less than 10 millions rows. For the sake of example, I will say 1st > table has 10 millions rows and 2nd table has 5 millions. > > In that case, is LFET JOIN a viable solution to query the data? > Wouldn't it be "faster" and consume less memory if I only had one > table with all the columns? > > Thanks. > Maybe. Will all 10 columns have values all the time? Will you *never* need an 11th column? But the whole point is meaningless. You are doing what's known as "premature optimization". You're trying to solve a problem you don't - and may never - have. Design the database properly. If you have performance problems find the source of the problem. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#9 |
|
Messages: n/a
Hébergeur: |
Thank you all for the feedback so far. I really appreciate it.
Jerry, I do currently have a table like I described. Actually it has way more than 10 columns...try 130+ columns. It is a huge table with 10+ millions records and 130+ columns wide. It is a leads/marketing table that contains users and each column describes an "event". We run queries to extract a subset of users with specific "events" defined. For a simple hypothetical example, if we like to find out all the users who have downloaded a specific product of ours within last 30 days but haven't purchased, with one table, we only need to run one query without any JOIN to get the list. I know it is a terrible table structure but I don't know if in our case normalization is practical. And yes, we do need to add more columns if we need more "events". From the perspective of a programmer, this kind of table is easy to write queries against. However, as an admin myself as well, I find the size of the table (1.8gb) unmaintainable (since any operation performed on the table takes eons to finish). There are two other ways to do what I need: 1) As Peter suggested, split the columns into different tables. However, I can see that most of the time we would pretty much join every tables together so I am not sure if that buys us anything than doing it all in one table besides complicating the queries. 2) Make each "event" a row. However, with 10 millions+ users and all 130+ events, we could be talking about a table with 10 millions x 130 rows! I hope you guys understand what I am trying to convey and provide more input. Thanks! |
|
|
|
#10 |
|
Messages: n/a
Hébergeur: |
quakey@gmail.com wrote:
> I studied normalization in school but I also heard that they are not > practical in some real-world applications, especially when performance > is a big factor. > > Let's say I have 10 millions users with 10 columns of attributes. I > can normalize it to be 2 tables with 1 one table sharing common > attributes. Since the 2nd table contains optional attributes, it has > less than 10 millions rows. For the sake of example, I will say 1st > table has 10 millions rows and 2nd table has 5 millions. > > In that case, is LFET JOIN a viable solution to query the data? > Wouldn't it be "faster" and consume less memory if I only had one > table with all the columns? > > Thanks. I was taught to normalize first, and then to carefully denormalize for performance adding triggers and anything else needed to maintain integrity. Physical design such as adding appropriate covering? indexes and distributing data across spindles/servers can be important. -- Chad Hanna Systems Developer FamilyHistoryOnline www.familyhistoryonline.net FreeBSD Apache MySQL Perl mod_perl PHP |
|
|
|
#11 |
|
Messages: n/a
Hébergeur: |
quakey@gmail.com wrote:
> > I know it is a terrible table structure but I don't know if in our > case normalization is practical. And yes, we do need to add more > columns if we need more "events". > > From the perspective of a programmer, this kind of table is easy to > write queries against. However, as an admin myself as well, I find the > size of the table (1.8gb) unmaintainable (since any operation > performed on the table takes eons to finish). > I don't know how well MySQL handles tables at 1.8gb but I can tell you that in the Oracle world this is a very small table indeed. The reason that your queries take eons to run is that you almost certainly cannot use indexing effeciently and you will be processing the entire table for each query. I would probably set up a "Customer" table, an "Events" table and a "Customer/Event" table with one row per customer/event. With proper indexing your queries will fly. Regards |
|
|
|
#12 |
|
Messages: n/a
Hébergeur: |
On Apr 8, 10:56 am, "qua...@gmail.com" <qua...@gmail.com> wrote:
> Thank you all for the feedback so far. I really appreciate it. > > Jerry, I do currently have a table like I described. Actually it has > way more than 10 columns...try 130+ columns. It is a huge table with > 10+ millions records and 130+ columns wide. > > It is a leads/marketing table that contains users and each column > describes an "event". We run queries to extract a subset of users with > specific "events" defined. For a simple hypothetical example, if we > like to find out all the users who have downloaded a specific product > of ours within last 30 days but haven't purchased, with one table, we > only need to run one query without any JOIN to get the list. > [snip] Maybe it would be a good idea to investigate a datawarehouse solution for reporting purposes. I assume that the mentioned table is an operative table (changes every day). Best wishes John |
|
|
|
#13 |
|
Messages: n/a
Hébergeur: |
quakey@gmail.com wrote:
> Thank you all for the feedback so far. I really appreciate it. > > Jerry, I do currently have a table like I described. Actually it has > way more than 10 columns...try 130+ columns. It is a huge table with > 10+ millions records and 130+ columns wide. > > It is a leads/marketing table that contains users and each column > describes an "event". We run queries to extract a subset of users with > specific "events" defined. For a simple hypothetical example, if we > like to find out all the users who have downloaded a specific product > of ours within last 30 days but haven't purchased, with one table, we > only need to run one query without any JOIN to get the list. > > I know it is a terrible table structure but I don't know if in our > case normalization is practical. And yes, we do need to add more > columns if we need more "events". > > From the perspective of a programmer, this kind of table is easy to > write queries against. However, as an admin myself as well, I find the > size of the table (1.8gb) unmaintainable (since any operation > performed on the table takes eons to finish). > > There are two other ways to do what I need: > 1) As Peter suggested, split the columns into different tables. > However, I can see that most of the time we would pretty much join > every tables together so I am not sure if that buys us anything than > doing it all in one table besides complicating the queries. > 2) Make each "event" a row. However, with 10 millions+ users and all > 130+ events, we could be talking about a table with 10 millions x 130 > rows! > > I hope you guys understand what I am trying to convey and provide more > input. > > Thanks! > > > This is a perfect reason for normalization. You need to normalize the table. 1.8GB is not a huge table. And any time you have to change the structure of the table for different data, you have a problem. Make each event a row. That isn't as many rows as you think. Databases easily will handle that many. And it's not going to be as big as you think. I doubt all 10M rows currently use all 130+ events. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#14 |
|
Messages: n/a
Hébergeur: |
Jerry Stuckle schrieb:
> quakey@gmail.com wrote: >> I studied normalization in school but I also heard that they are not >> practical in some real-world applications, especially when performance >> is a big factor. >> > > Normalization is ALWAYS practical. Id Name PostalCode City ---------------------------- 1 Thomas 77694 Kehl 2 Foo 12345 Bar This table is not in 3rd normal form as City depends on the non-key attribute PostalCode and not on Id. Theoretically, one should remove the City column and make PostalCode a foreign key which links to a second table containing PostalCode and City. But I don't think this is really necessary in a real-world application. Greetings, Thomas -- Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison! (Coluche) |
|
|
|
#15 |
|
Messages: n/a
Hébergeur: |
Thomas Mlynarczyk wrote:
> Jerry Stuckle schrieb: > >> quakey@gmail.com wrote: >>> I studied normalization in school but I also heard that they are not >>> practical in some real-world applications, especially when performance >>> is a big factor. >>> >> >> Normalization is ALWAYS practical. > > Id Name PostalCode City > ---------------------------- > 1 Thomas 77694 Kehl > 2 Foo 12345 Bar > > This table is not in 3rd normal form as City depends on the non-key > attribute PostalCode and not on Id. Theoretically, one should remove the > City column and make PostalCode a foreign key which links to a second > table containing PostalCode and City. But I don't think this is really > necessary in a real-world application. > > Greetings, > Thomas > It depends. If you have a lot of addresses in 77694 or 12345, it would make sense to put it in a different table. But another advantage of doing it that way - let's say Thomas was the only entry in 77694, and you delete his entry. You have now lost the link between 77694 and Kehl, and if you enter another row with the same PostalCode, you need to enter the city again. If you have them in a separate table, it's still there. In fact, it's not at all uncommon to have a table with PostalCodes and Cities for lookup purposes. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#16 |
|
Messages: n/a
Hébergeur: |
Jerry Stuckle:
> Thomas Mlynarczyk wrote: > > This table is not in 3rd normal form as City depends on the non-key > > attribute PostalCode and not on Id. Theoretically, one should remove the > > City column and make PostalCode a foreign key which links to a second > > table containing PostalCode and City. But I don't think this is really > > necessary in a real-world application. > In fact, it's not at all uncommon to have a table with PostalCodes and > Cities for lookup purposes. ....and then hope that the postal code system isn't changed nationally! (In case you don't believe something like that happens, then watch countries merge or split up. See the Balkan. See Germany. Oh, and keep an eye on Belgium!) |
|
|
|
#17 |
|
Messages: n/a
Hébergeur: |
On Apr 8, 12:56 am, "qua...@gmail.com" <qua...@gmail.com> wrote:
> 2) Make each "event" a row. However, with 10 millions+ users and all > 130+ events, we could be talking about a table with 10 millions x 130 > rows! > I'll bet if you looked closely at the data, you'd find that more than 90% of the users have populated less than 10% of the fields. So it would be a LOT less than 10 million x 130. |
|
|
|
#18 |
|
Messages: n/a
Hébergeur: |
Erick T. Barkhuis schrieb:
>> In fact, it's not at all uncommon to have a table with PostalCodes and >> Cities for lookup purposes. > > ...and then hope that the postal code system isn't changed nationally! Table1: Id, Name, PostalCodeId Table2: PostalCodeId, PostalCode, City But then Table2 is not in 3NF... However, if the postal code system is changed, only Table2 needs to be updated. Greetings, Thomas -- Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison! (Coluche) |
|
|
|
#19 |
|
Messages: n/a
Hébergeur: |
Jerry Stuckle schrieb:
>> Id Name PostalCode City >> ---------------------------- >> 1 Thomas 77694 Kehl >> 2 Foo 12345 Bar [Second table to comply with 3NF?] > It depends. If you have a lot of addresses in 77694 or 12345, it would > make sense to put it in a different table. It would make the first table shorter (column-wise), but I would need an additional JOIN to retrieve the full address - is this a question of speed vs. space, the benefits of 3NF aside? Greetings, Thomas -- Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison! (Coluche) |
|
|
|
#20 |
|
Messages: n/a
Hébergeur: |
Erick T. Barkhuis wrote:
> Jerry Stuckle: >> Thomas Mlynarczyk wrote: > >>> This table is not in 3rd normal form as City depends on the non-key >>> attribute PostalCode and not on Id. Theoretically, one should remove the >>> City column and make PostalCode a foreign key which links to a second >>> table containing PostalCode and City. But I don't think this is really >>> necessary in a real-world application. > >> In fact, it's not at all uncommon to have a table with PostalCodes and >> Cities for lookup purposes. > > ...and then hope that the postal code system isn't changed nationally! > > (In case you don't believe something like that happens, then watch > countries merge or split up. See the Balkan. See Germany. Oh, and keep an > eye on Belgium!) > When it does, you, of course, need to update the postal codes. But you have to do that either way. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#21 |
|
Messages: n/a
Hébergeur: |
Thomas Mlynarczyk wrote:
> Jerry Stuckle schrieb: > >>> Id Name PostalCode City >>> ---------------------------- >>> 1 Thomas 77694 Kehl >>> 2 Foo 12345 Bar > > [Second table to comply with 3NF?] > >> It depends. If you have a lot of addresses in 77694 or 12345, it >> would make sense to put it in a different table. > > It would make the first table shorter (column-wise), but I would need an > additional JOIN to retrieve the full address - is this a question of > speed vs. space, the benefits of 3NF aside? > > Greetings, > Thomas > Yes, it requires an extra join. But it also means less duplication of data and a smaller database. It also means the first rule of design - DON'T PREMATURELY OPTIMIZE. Design correctly. If you have a performance problem, fix the problem. That MAY require denormalizing a table or two. But more often than not it means fine tuning the RDBMS parameters and possibly adding indexes where necessary. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#22 |
|
Messages: n/a
Hébergeur: |
>>> Normalization is ALWAYS practical.
>> >> Id Name PostalCode City >> ---------------------------- >> 1 Thomas 77694 Kehl >> 2 Foo 12345 Bar >> >> This table is not in 3rd normal form as City depends on the non-key >> attribute PostalCode and not on Id. Are you SURE of this? Do you know this to be true of all countries with postal codes, now and in the future? What prohibits a country from ruling that the postal code boundaries and the city boundaries may not coincide for a distance larger than one centimeter, except along national boundaries? (This might be done because city boundaries are in dispute, but GPS readings aren't, and it saves on bloodshed.) Thus, the relationship between City and Postal Code is many-to-many. I seem to recall an odd case where 3 "cities" (actually near-ghost-towns) shared a single (5-digit AND 9-digit) ZIP code and a single block of a single street in the USA (Texas or Arizona, I think). They had a combined population of about 5. Further, the boundary lines between the cities ran through the middle of the bed of two of the residents (someone merged two of the apartments into one), and the building they were in was in all 3 cities. The population may have died out by now. I also recall at one time a large political fight over a car-pooling database. People wanted their addresses to be stated on the list AS THEY WROTE THEM DOWN, not edited by computer. You had a lot of married couples, one living in Fort Worth, Texas 76126 and another living in Benbrook, Texas 76126, with identical street addresses. I think some of this had to do with the Post Office actually opening a branch in Benbrook, and there were some strong feelings about Benbrook's appearance on the postal map. The alternative to allowing both was to shut the whole thing down as there was so much fighting over it and so many people threatened to pull out, even a lot of people not in that zip code. Since the car-pool-matching algorithm used the 5-digit ZIP code in any case, it would still pair husband and wife into the same car-pool regardless of the printed city. Now, if you REALLY want political fights over addressing, consider some of the fights Microsoft got involved in over about four pixels of disputed territory between India and Pakistan on a map in one of its educational products, arguments over whether Taiwan is a separate country (hey, PRC, you can either claim that the USA accidentally sent nuclear fuses to the territory of one of its enemies, or you can claim that Taiwan is your territory, but not both at the same time), and the existence of a country called "Palestine". |
|
|
|
#23 |
|
Messages: n/a
Hébergeur: |
I should clarify...when I said it takes eons, it's the not the queries
themselves, but whenever any kind of operations that alter the table structure (adding new columns, indexes, etc). What you proposed is a many-to-many relationship and I think is nice. Jerry thinks I should put each event on its own row, so now I have two possible solutions and I wonder which one will be better...hmmm. On Apr 8, 3:24am, Gilbert <bugger....@no.spam.com> wrote: > I don't know how well MySQL handles tables at 1.8gb but I can tell you that > in the Oracle world this is a very small table indeed. The reason that your > queries take eons to run is that you almost certainly cannot use indexing > effeciently and you will be processing the entire table for each query. > > I would probably set up a "Customer" table, an "Events" table and > a "Customer/Event" table with one row per customer/event. With proper > indexing your queries will fly. > > Regards |
|
|
|
#24 |
|
Messages: n/a
Hébergeur: |
On Tue, 8 Apr 2008 18:30:46 -0700 (PDT), quakey@gmail.com wrote:
> I should clarify...when I said it takes eons, it's the not the queries > themselves, but whenever any kind of operations that alter the table > structure (adding new columns, indexes, etc). > > What you proposed is a many-to-many relationship and I think is nice. > > Jerry thinks I should put each event on its own row, so now I have two > possible solutions and I wonder which one will be better...hmmm. At the general case, they're actually the same solution. Jerry's solution just basically omits the event detail table, leaving you with just the person table and the corollation table with only the person and event identifiers, but no event detail. -- Windows is a pane in the ass.... |
|
|
|
#25 |
|
Messages: n/a
Hébergeur: |
Peter H. Coffin wrote:
> On Tue, 8 Apr 2008 18:30:46 -0700 (PDT), quakey@gmail.com wrote: >> I should clarify...when I said it takes eons, it's the not the queries >> themselves, but whenever any kind of operations that alter the table >> structure (adding new columns, indexes, etc). >> >> What you proposed is a many-to-many relationship and I think is nice. >> >> Jerry thinks I should put each event on its own row, so now I have two >> possible solutions and I wonder which one will be better...hmmm. > > At the general case, they're actually the same solution. Jerry's > solution just basically omits the event detail table, leaving you with > just the person table and the corollation table with only the person and > event identifiers, but no event detail. > Not necessarily - it just wasn't specified one way or the other. It looks like currently there is no event detail; rather just a column referencing the event. If all they need is the event id, then they don't need a third table. However, if they need the detail (highly recommended), then there should be a third table containing the event id and the description. The second table then just becomes a many-to-many link between the original (first) and detail tables. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
![]() |
| Outils de la discussion | |
|
|