Duplication Refactoring Threshold

Once And Only Once is often not practical. Sometimes the duplication is purely coincidental, for example, and factoring may just be a waste of time if the two things drift apart again. Thus in practice there is a repeat quantity limit that triggers one to seek duplication fixes. Every developer or shop has some level (formal or informal), it just varies widely.

For example, a Duplication Refactoring Threshold of 4 means that you wait until you see a concept duplicated 4 times before you bother to refactor it. Even the worst of shops will start to factor if they see large chunks of code repeated say 100 times. One would say they have a high refactoring threshold.

See Three Strikes And You Refactor.

Strikes me as exactly the same issue as repeated values in a relational table not necessarily violating normal forms. It's the functional dependencies that matter.

There is a long debate on the topic of table design and repetition somewhere around here. I'll link to it when I find it. Note that the term "repetition" is also subject to some degree of Laynes Law.

In other words, if the pieces of code can be changed to do something different with out of itself being incorrect, then it isn't duplication.

This doesn't by itself handle the cases like "this window here does the same thing for foos as this other one does for bars". But then again, that's what Do The Simplest Thing That Could Possibly Work is for: if it is possible to use the same window for both, then that is (at least a candidate for) the simplest thing.

Then again, perhaps 'functional dependency' would include that example as a case of 'doing the same thing', in the sense of 'the meaning of the one operation is tied to the meaning of the other'. Dunno.

-- William Underwood

Re: "Strikes me as exactly the same issue as repeated values in a relational table not necessarily violating normal forms. It's the functional dependencies that matter."

You mean like 500 people in a People table having the same favorite color of "blue"?

[kinda]

I don't see any way around that. They can all point to "blue" instead of store it, but that might save a few bytes at best. It is almost like saying all the "print" or "output" statements in a large program is duplication. Other than shortening the function or method name, it is hard to factor more compactly. Some "things" are simply popular and are used/referenced a lot. It is generally considered only a violation of Once And Only Once if it is larger or more complex.

That's why I said kinda and not yes or no. And we're not talking about compactness.

'Print' is a perfect example of what not to factor, because anything which depends on the fact that we're printing is factored into the print method. Likewise, if there's anything which is known to be common between people with the favourite colour of 'blue', we'd put that in a colour table, and not duplicate that knowledge for every person.

--cwillu

A "print" statement is kind of like a foreign key to the "function list". You cannot really factor out a foreign key. Factorable duplication in code usually comes from multiple sets with multiple items where most items of the sets stay the same between each occurance. (Damned tabbed-up colons drived me nuts, so I used bullets instead.)

I am not sure that is always a good idea. I have seen schemas where people over-group stuff that grows ungrouped over time due to policy changes, and then one is left with excessive joins. In short, be careful.

In other words, they have a too-low Duplication Refactoring Threshold. Once And Only Once is a beautiful thing when it's done properly, but you need to have a sufficiently high threshold to know how to do it properly in any given case.

Print is a perfect example of what not to factor, because anything which depends on the fact that we're printing is factored into the print method. (cwillu)

Why do the log kits (like log4j) exist then? I believe that they are refactoring the "print" logic. I feel like the choice where to refactor or not is very difficult. I think that is another sign that Programming Is Art. (feel free to refactor, move, delete, etc. But please keep the meaning, Aureliano Calvo).

Logging is a rather specialized form of printing. Unless your application is dedicated to be run as a one-shot command-line tool, you probably don't want your log output spewed to stdout or wherever print outputs to by default. You want to be able to choose what KIND of data you want logged, and WHERE you want it logged. In my own experience, when I started writing if(verboseLevel > 3) output(x) I realized it was time to get a smarter logging system. (ahigerd)

__________________

If I see two methods, each 200 lines long. Three strikes? No, I extract one method. :-)

The chances of both being identical in usage is small in my observation. Managing the differences can be no small chore. Related: Variations Tend Toward Cartesian Product. If multiple things are "kind of the same" but have nitty little differences, which is very common in my domain, perhaps it's time to split up the parts that are the same from the parts that are not. In other words, change the granularity of your refactoring. I often get better reuse when I refactor "in the small".

Category Refactoring

See original on c2.com