On Oct 21, 3:32=A0am, Eric Jacobsen <eric.jacob...@[EMAIL PROTECTED]
> wrote:
> The cas() function has identical symmetry to cos() with an offset of
> pi/4. =A0For a LUT that has a power-of-two length that offset is fairly
> trivial to work around as far as folding the LUT is concerned.
Except that between the cos and sin tables, for power-of-two sizes,
there is a lot of addtitional redundancy because cos(x) =3D sin(pi/2 -
x).
So, for example, for n=3D16, if you look at the cos and sin tables,
there are three distinct constants (not counting trivial constants
like 1 and 0, and sign changes, all of which can be optimized out).
For the cas table, there are four distinct nontrivial constants.
In general, you can show that the number of nontrivial constants is
the same, differing by at most one in favor of FFTs, just as I
claimed.
(For odd lengths, the pi/4 ****ft breaks the symmetry once you
discretize, and the cas table has no reflection symmetry at all,
unlike cos and sin, although in that case the cos and sin tables must
be distinct.)
> required for cas() compared to cos() and sin()? =A0 I don't agree,
> despite the fact that cas() can be folded the same as sin() and cos().
> I think the amount of storage is the same, but since FHT twiddles are
> always real-valued the accesses to that memory can be managed
> differently.
I'm glad you agree that the storage is the same; you had earlier
claimed that there was a "memory saving is in the twidlle factors",
which on its face sounded like you were claiming lower storage.
Regarding the twiddles being "real valued" and hence yielding some
im****tant difference in the access pattern, you are mistaken. If you
actually look at the structure of FHT algorithms, they still use pairs
of constants at a time. In fact, if you look at Bracewell's original
FHT algorithm and most subsequent FHT algorithms, they don't use the
cas function at all -- the twiddle factors are actually pairs of (cos,
sin) values, exactly as for an FFT.
> Note the operative use of past-tense, since the experience I was
> describing was in about 1988, which was pretty contem****ary with the
> early "Look how great the FHT is!" papers, as well as Sorensen's RFFT
First, you have used present and future tense many times in this
thread when describing the FHTs supposed substantial advantages in
performance and memory requirements.
Second, the fact that you cite your own 20-year-old results in sup****t
of your claim, from a your admittedly incomplete understanding at the
time of the differences (or lack thereof) between FHTs and real-input
FFTs, does not make your claims any more correct. It just means that
you have a better excuse for being wrong.
> In my experience, at that time, in my application, on my Amiga, with
> Sorensen's translated RFFT and my hand-coded FHT, I was able to get my
> algorithms working for my thesis using less memory (and about the same
> computation load) with the FHT than with the RFFT. =A0
Then you implemented the algorithms imperfectly. Both can operate in-
place, and both require the same size twiddle tables. Unless you are
talking about instruction memory, not data memory, which is negligible
here unless your dataset is very small, but if code size is the major
issue then there are other things you can do; see below.
> I don't think that's the case. =A0 Look more closely at the results
> table. =A0 If the FFT under comparison were based on the FHT under
> comparison, it would be at least the same size as the FHT. =A0 It is
> not. =A0It is significantly smaller. =A0 This suggests that it was not
an
> FHT-based FFT under comparison. =A0 Whether or not it was an RFFT is not
> at all clear.
The paper explicitly states, "FHT is taken as the basic VLSI
architecture. It is used as architecture for obtaining all other
transforms." Maybe they were quoting the additional gates to get an
FFT from an FHT?
In any case, I agree that the paper is very unclear -- it's not
something you can really draw any useful conclusions from regarding
FHT vs. RFFT.
> I merely offered it as an existence proof of a recent published case
> where somebody found an advantage of an FHT over (what I thought was a
> real-input) FFT. =A0 It's not clear now whether it was a real-input FFT
> or not so who knows.
Certainly, there are people still publi****ng the occasional paper on
FHT algorithms, and even making vague claims of performance benefits
(although nowadays they just cite conceptual preferences). I have no
disagreement on that. But for 20 years they haven't been able to back
it up with clear comparisons, and the (rather sketchy) paper you cite
is no different in this regard.
> >Of course they are different, but you still have yet to (correctly)
> >explain a clear concrete advantage for FHTs in performance or memory
> >in *any* cir***stance.
>
> My Amiga twenty years ago doesn't count? =A0 Self-inverse doesn't count?
Self-inverse is not a performance advantage, nor is it memory
advantage except for extremely small transforms -- i.e. it's only O(1)
savings at best. And in the latter case there are actually ways to
get an inverse RFFT from a forward RFFT. In practice, when one is
talking about memory advantages one is normally talking about the data
[O(n) memory]. You yourself kept talking about twiddle factors,
addressing, pipelining, data permutations, etcetera, which is what I
was disagreeing with -- for you to suddenly turn around and pretend
that you were only talking about code size from the self-inverse
property (which I was the one to point out, BTW) is disingenuous.
Your vague reference to unpublished results on your Amiga doesn't
count because there's no indication that your real-input FFT was
especially good -- it says more about you than about the algorithms.
It's trivial to find *some* real-input FFT code that is slower than
*some* FHT code, but that's not particularly meaningful. What is
meaningful is that people have been trying to implement FHTs for 20
years, and the best real-input FFTs have consistently been at least as
good. At a more fundamental level, no one has identified a difference
between the two algorithms that would seem to give a substantial
advantage one way or the other in performance---arithmetic, memory
access patterns, memory utilization, etcetera are all essentially
identical.
> Not necessarily smaller chunks, differently organized chunks. =A0If you
> want to re-use the same memory block for the input buffer, in-place
> processing, and output buffer, for an FHT that's not that big of a
> deal. =A0 For an RFFT (as I understand them), the addressing will
> require more effort at some point or other (perhaps only at the
> output)
Your understanding of real-input FFT algorithms is incorrect.
To quote Sorensen in 1987, "The resulting programs are nearly the
same, due to the fundamental similarity between the Hartley and RFFT
algorithms, so the indexing and other overhead costs are essentially
the same".
In the classic bit-reversal approach to in-place transforms, the FHT
and RFFT algorithms require the same permutation on the input or
output. As Sorensen wrote in another 1985 paper, "The permutation of
the time indices will be exactly the same as for a Cooley-Tukey
radix-2 FFT, because the same decomposition sequence is applied to
both."
I should also mention that there have been a few developments in FFT
algorithms since 1965, and in particular it is perfectly possible to
perform an in-place transform without any separate bit-reversal stage
at all. Basically, you mix some transpositions in with the other
computation stages in order to do both permutations and computations
at the same time, which reduces the memory pressure compared to a
separate bit-reversal pass. And on doesn't generally want to make
log2(n) separate p***** over the whole array either; you want to
reorder the computations to improve locality. But since FHTs and
RFFTs are essentially the same from the perspective of the
computational graph and memory access patterns, the same optimizations
apply to both (although in practice more work has been done on the
latter).
> is the case with many FPGAs). =A0 Andrew has since indicated that he's
> used FFTs that did not require multiplies in the first two stages (he
> didn't clarify whether that included RFFTs, so I don't know), so there
> may not be an advantage there, anyway.
The same is true for an RFFT: both a size-4 FFT and a size-4 RFFT
require no multiplications, for the same reason (the twiddle factors
are all +/-1 or +/-i).
(The first "two stages" of a radix-2 algorithm just compute a bunch of
size-4 transforms. In practice, the 1965 approach of dividing things
into a sequence of radix-2 p***** over the whole array is actually a
poor idea if you have any kind of memory cache, or a significant
number of registers, so you are arguing about algorithms that in many
cases are obsolete. But your arguments are still wrong.)
> I merely think one should look beyond academic papers and software
> library comparisons before concluding that all implementers everywhere
> under all constraints will be bound to a certain outcome. =A0I have
> difficulty imagining why one would think that was how things work.
The burden of proof is on the person claiming a substantial
performance/memory advantage one way or another, not on the person
pointing out that in 20 years no one has demonstrated such an
advantage. And whenever you have been pinned down to specifics on
naming a difference that could lead to a substantial performance
difference, you have been wrong.
Regards,
Steven G. Johnson


|