On Mon, 20 Oct 2008 18:10:49 -0700 (PDT), "Steven G. Johnson"
<stevenj@[EMAIL PROTECTED]
> wrote:
>On Oct 20, 5:21 pm, Eric Jacobsen <eric.jacob...@[EMAIL PROTECTED]
> wrote:
>> Why is it so im****tant to you to make FHTs look inferior to FFTs?
>
>Straw man.
I thought it was a legitimate question, as was the one regarding
whether you think there's never any possibility of an advantage with
an FHT. If you don't think so then I think there's no disagreement.
If you do think so, that's fine, too, we just disagree.
>I never said FHTs were inferior. I said that the two are nearly
>identical in every measurable respect, so far as is currently known---
>no one has demonstrated any significant intrinsic advantage one way or
>the other in terms of performance or memory utilization.
>
>You, on the other hand, claimed "substantial" benefits in
>"computational resources" for the FHT in some cir***stances for
>unexplained reasons, including "substantial" memory savings.
>
>> have the other as well. And either sin() or cos() can be folded so
>> that much less than a full wave needs to be stored, and you can do
>> that with cas() as well.
>
>No you can't do it with cas as well. That was my point: if you look
>at the trig table of sin and cos values, they have twice as much
>redundancy as a table of cas values (which has less symmetry), and
>because of that the number of trigonometric constants is essentially
>the same for an FHT or FFT.
cas(x) = sqrt(2)*cos(x-pi/4)
The cas() function has identical symmetry to cos() with an offset of
pi/4. For a LUT that has a power-of-two length that offset is fairly
trivial to work around as far as folding the LUT is concerned.
According to your argument (which I don't quite follow because it
doesn't work out), that suggests (I think) that less storage is
required for cas() compared to cos() and sin()? I don't agree,
despite the fact that cas() can be folded the same as sin() and cos().
I think the amount of storage is the same, but since FHT twiddles are
always real-valued the accesses to that memory can be managed
differently.
If one needs simultaneous access to cos() and sin() from a
non-dual-****t LUT for one's complex FFT twiddle pipeline, then it
seems to me that an FHT would have a memory advantage in that only a
single LUT would be required to feed a similar but real-valued
butterfly. Clearly there are ways to work around that from almost any
perspective, but it's what I had in mind when I said I thought there
could be a twiddle memory advantage with an FHT in certain
cir***stances.
>> Do you really think that there's never a difference?
>
>I think that no one has demonstrated any substantial difference; the
>only clear differences that have been demonstrated slightly favor FFTs
>(but only slightly....a few flops saved is irrelevant in practice).
>
>You were claiming "substantial" benefits in performance in memory,
What I said was, "In my experience the real advantage of the FHT was
the memory requirement, which can be substantially smaller than an FFT
depending on the constraints and what you're doing. Since memory is
seldom a constraint these days it's pretty moot."
This was just before I explained in my next post, "Some of my FHT
experience was on my Amiga in grad school when I was processing audio
sensor data (i.e., real-valued), and coded the whole thing in 68k
assembly with LUTs for the cas() function. I also had a hand-coded
68k-assembly FFT with LUT twiddles, and just the buffer handling
alone, with extra car paid to handling the input buffer and forming
the ouput magnitude vector provided a substantial advantage with the
FHT."
Note the operative use of past-tense, since the experience I was
describing was in about 1988, which was pretty contem****ary with the
early "Look how great the FHT is!" papers, as well as Sorensen's RFFT
work. I was reading pretty much everything that came out on the
topic at the time, and, as I also stated previously, some of the
claims then of the nearly 2x superiority of the FHT were clearly
incorrect as far as I could tell (and, as you've repeatedly pointed
out, I was right about that). So you don't need to tell me about that
as I've known it from first-hand experience for twenty years.
This afternoon I dug out my June, 1987 Trans. on ASSP and discovered
that I'd left a fair number of hand-written notes around Sorensen's
RFFT FORTRAN code from when I translated (and debugged) it to C and
68k assembly for my thesis.
In my experience, at that time, in my application, on my Amiga, with
Sorensen's translated RFFT and my hand-coded FHT, I was able to get my
algorithms working for my thesis using less memory (and about the same
computation load) with the FHT than with the RFFT. I don't have
access to that code now without digging the Amiga out of storage and
cranking it up again and seeing if everything still works well enough
to read the files. I wouldn't hold my breath that I could recover
any of it, but I'm also not motivated enough to go do it in the first
place.
So, that was my basis for my claim of "substantial" improvement with
the FHT. I used the past-tense because, in fact, it was a hell of a
long time ago. It's very possible that improvements have been made on
the RFFT that could have changed that outcome, but I don't know for
certain so I didn't try to claim it was still true. Sundararajan's
1997 paper (which, also, oddly has a post-it bookmark in my Trans on
SP and a bunch of notes, so I must have read it when it came out)
might have helped, but that was nearly ten years later so I didn't
compare. By that time we were already in the realm where memory and
processing power was cheap enough that the small differences were no
longer enough to matter, so I was just plugging in canned IP and
routines at that point.
>and
>I was pointing out that, over the last 20 years, despite strenuous
>efforts, there is zero evidence to sup****t your claim in any known
>cir***stances,
It worked for me twenty years ago. That was evidence enough for me
to talk about it in the past tense.
> and plenty of evidence that the intrinsic differences
>between the algorithms are negligible (and swamped by differences in
>skill of implementors).
The implementations have been what I've been talking about all along.
>
>> FWIW, I've found several (recent) references where people are
>> currently building generalized transform hardware based on FHT engines
>> (kind of like Mayer did with his software). Since a generalized
>> compression engine may need to do FFTs, FCTs, and FSTs, there are
>> evidently advantages to just building a single FHT to do all of those
>> tasks. Here's one reference:
>>
>>
http://ieeexplore.ieee.org/Xplore/defdeny.jsp?url=/iel5/4061132/40611...
>
>(You can do all of those functions using a real-data FFT, too.)
>
>That paper doesn't actually try doing a real-data FFT and comparing it
>to an FHT. In fact, the papers it cites as explanation for why they
>chose to do an FHT are actually papers explaining just the opposite,
>that real-data FFTs are just as efficient as FHTs and have similar
>structure.
Looking at it now it's not clear what he compared. In section two he
talks very clearly about how the structure of the RFFT is more
complicated than the FHT (for his purposes)* and so that influenced
his direction. Since the context of his comparisons are for
compression of real-valued sequences I took all of that to mean that
his comparisons were against RFFTs, but looking at it now it's not
clear. He's explicit about comparing against a 1024-pt FFT, but
doesn't say how long any of the other transforms were. Hmmm...I
agree, on second look there's not much concrete there, but I do still
offer it as recent reference of someone who, for their particular
problem, claims to have found an advantage with the FHT. There's
just not much info to go with it.
* "Structure" for developing a hardware architecture is often not
equal to signal flow "structure".
>DHTs certainly have conceptual attractions. And the fact that the
>same routine is its own inverse is sometimes convenient (the one
>concrete reason cited by the paper authors for their choice).
That's the most blatant and obvious case where an advantage could be
expected with an FHT. If you need to do a forward and inverse
transform in the same hardware, there's reason to believe some
substantial advantages could be realized using an FHT over an RFFT.
The OP's case wouldn't benefit from that, and the compression cases
I've been seeing where people are claiming that an FHT helps generally
don't have a compressor and decompressor co-located. So in those
applications there may not be such a distinct expectation for an
advantage. Given all that I hadn't mentioned it since it didn't seem
to fit the applications at hand, despite the cited conference paper's
author's note of it.
>If you had said, "I find the DHT conceptually nicer, and the fact that
>it is its own inverse leads to some savings of code/gates in some
>applications," I wouldn't have any objection.
Oh, so it's the details of how I framed the argument that mattered. I
see. ;)
>What bothered me is
>your claiming "substantial" benefits in performance and that the
>"memory requirement is substantially smaller", both of which claims
>are contradicted by 20 years of work.
Which I didn't have access to twenty years ago, unfortunately. So my
past-tense claims of what happened in the past when I couldn't have
taken advantage of what were then future developments will, sadly,
have to remain unaltered. This means you will have to remain
bothered, I'm afraid.
>> For the constraints and optimizations in this case (and several are
>> actually shown), the author found using an FHT as the basis
>> advantageous.
>
>None are shown; are we reading the same 2-page conference paper? The
>timing numbers they re****t for FFT etc are worse than those for FHT,
>but that's because they implemented the other transforms using an FHT
>(hence with additional overhead). They didn't actually try comparing
>to any intrinsic real-data FFT algorithm, so that paper really isn't
>evidence one way or the other.
I don't think that's the case. Look more closely at the results
table. If the FFT under comparison were based on the FHT under
comparison, it would be at least the same size as the FHT. It is
not. It is significantly smaller. This suggests that it was not an
FHT-based FFT under comparison. Whether or not it was an RFFT is not
at all clear.
There is also a combined FHT+FFT case, which *may* have been built
around an FHT core, but, again, it's not clear.
In any case, this particular author, for this particular solution,
using whatever his methodology might have been, found an advantage
with the FHT for metrics that mattered to him (compression ratio in
particular, which, again, isn't well defined). It can be seen quite
readily, though, that the latency and power consumption were
significantly less for the FHT shown than the FFT shown, FWTW.
I merely offered it as an existence proof of a recent published case
where somebody found an advantage of an FHT over (what I thought was a
real-input) FFT. It's not clear now whether it was a real-input FFT
or not so who knows.
>And what, precisely, do you see as the substantial "performance" and
>"memory" advantages of an FHT in this cir***stance?
The advantage is having two transforms with some differences in
internal operation that can be traded off in the presence of
particular constraints in implementation details. How that tradeoff
comes out depends entirely on the fine details. If it were fifteen
years ago when I was a lot more on top of this stuff I could probably
cite some combinations of memory architecture, algorithm (DIT, DFT,
whatever), arithmetic availability or whatever that made a difference.
Today it's been too long ago and those brain cells are reluctant to
put down their well-deserved beer and give it a second look. I don't
blame them. Sue me.
>> Can you begin to see that each of these cases is quite different and
>> optimized solutions for each might look quite different from each
>> other?
>
>Of course they are different, but you still have yet to (correctly)
>explain a clear concrete advantage for FHTs in performance or memory
>in *any* cir***stance.
My Amiga twenty years ago doesn't count? Self-inverse doesn't count?
Phooey.
>> If memory is tight and only chunks of
>> certain sizes are available (and one is limited in how much area can
>> be used for memory), then damn straight tootin' an FHT can have an
>> advantage.
>
>Why do you think that an FHT requires smaller chunks of data than a
>real-data FFT?
Not necessarily smaller chunks, differently organized chunks. If you
want to re-use the same memory block for the input buffer, in-place
processing, and output buffer, for an FHT that's not that big of a
deal. For an RFFT (as I understand them), the addressing will
require more effort at some point or other (perhaps only at the
output) and perhaps some additional latency due to having to make
certain accesses wait. Whether or not the memory is multi-****t may
affect that. If one has different buffers for input, processing, and
output, then they can be organized differently (e.g., Nx1 input,
(N/2)x2 output) to help alleviate those complications. Whether or not
memory is available at N/2 length, or whether or not multi-****t memory
is available, would probably matter. If one only has a single
length-N memory, especially single-****t, that may well favor an FHT
over an RFFT. I couldn't say for certain without going through the
exercise, and as much as I'm enjoying the dialogue I'm just not that
motivated to do any actual work about it.
>> address generation can matter. If pipelining is required for repeated
>> transforms and multipliers are at a premium, then an FHT may have an
>> advantage since the first two stages don't require any multiplication.
>
>Clearly false. Real-data FFTs have been proven to require no more
>multiplications than FHTs, and fewer additions (at least, for all
>currently known algorithms).
It's not the quantity of multiplication operations I was getting at,
it's the availability for pipelining, i.e., how many discrete
multipliers would be needed in the engine to keep all stages of an
n-stage transformer working for repeated transforms. Since the FHT
does not require multipliers in the first two stages, none would be
needed to be instantiated for those pipelines. That could be an
advantage if only a certain number of multipliers were available (as
is the case with many FPGAs). Andrew has since indicated that he's
used FFTs that did not require multiplies in the first two stages (he
didn't clarify whether that included RFFTs, so I don't know), so there
may not be an advantage there, anyway.
>> I can go on and on, but I hope you're starting to get the drift.
>
>Nope, sorry, I'd like to see a concrete advantage explained in terms
>of the differences between the transforms that you claim are so
>significant. Not vague handwaving that "well, the two transforms
>aren't EXACTLY identical, so there must be SOME cir***stance where
>there is a substantial advantage for one or the other."
I can only offer what I've mentioned consistent with my experience. I
don't really care whether that satisfies you or not, I merely offer it
for consideration as a view from someone not entirely ignorant of the
issues. Like all things on teh intertubes, all are free to take it as
they will (or not, whatever).
>> almost always the way to go. Evidently the difference between you
>> and me is that I recognize that there are still those odd corner cases
>> once in a while where the tradeoffs can go the other way.
>
>I merely think one should have clear evidence before blithely
>contradicting 20 years of study in a field. An awareness of past work
>is also good to have.
>
>Regards,
>Steven G. Johnson
I merely think one should look beyond academic papers and software
library comparisons before concluding that all implementers everywhere
under all constraints will be bound to a certain outcome. I have
difficulty imagining why one would think that was how things work.
And I have a good awareness of the past work as much of it was
contem****ary to when I was actively doing this exact stuff. I don't
have much awareness of more recent work (say, much past Sundararajan's
1997 paper, although I seem to recall having read a few tidbits here
and there since then). I've already said why many times.
Eric Jacobsen
Minister of Algorithms
Abineau Communications
http://www.ericjacobsen.org
Blog: http://www.dsprelated.com/blogs-1/hf/Eric_Jacobsen.php


|