SETI Team 3DNow!
UD Cancer Team 3DNow!



Supported by

3DNow.Net
FullOn3D
AMD Insider
Italian 3DNow! Zone
AMD Appreciation
JC's PC News
Linux 3D Gaming
QuakeCity Network
BX Boards
Coyote's Den
PC Benchmarks
CPUSite
3DNow! UK
Dimension 128
Rapide PC
Siliconews
Generation AMD
AMD 3DNow! Poland
Savage Daily News
Quantum Studios
CombustiĆ³n
Ultimate 3D
Planet 3DNow!
Gameulation
TanWare
Savage News
BizNizzy
Mic's 3DZone
Tweeker Guides
Superclocking
3DNow.nu
The Upgradecenter
TechReview Malaysia
Buenos Aires Tweaking
Møller's Tech Page
Chip Geeks
Screaming 3D
Silicon Dreams
JSI Hardware
ParaKnowYa


The Campaign for the Proliferation of 3DNow!tm Support




08.04.99


Our first project offered itself quite naturally to our efforts: SETI@Home.

Launching our own Team 3DNow! to participate in this project, with the goal to make the Top 100, was one of the first steps of the campaign. Quickly, however, we realized that while the software was very floating point heavy it didn't harness the full horsepower of our CPUs: 3DNow!

After some wrestling, SETI@Home decided to give our improvement a shot and identified the performance critical subroutine.

Your comments please place on the Workshop Forum, the material that makes it into the final version will be added to this page as the workshop makes progress.

Steve describes the problem we face:

The FFT (Fast Fourier Transformation): This is perhaps the most used FFT in the world. It is the Four1() function from Numerical Recipes in C.

The Code at First Glance: The code in question is very well suited to 3DNow! conversion. A quick examination of the code shows a number of areas in which using the packed nature of 3DNow! we can effectively cut the number of operations in half.

This code alone, assuming no other bottlenecks, should be 3-4 times faster in 3DNow! than native x87.

The Catch: A quick examination of the routine shows that for larger data sets the code is not execution constrained, but becomes memory latency constrained (speak: hauling in the data from memory for processing is the critical point).

This is a very important point to remember. The main thrust of our attack on optimizing the performance of this code will not be the conversion to 3DNow! alone, but rather overcoming as much of the memory latency as possible to keep the improved algorithm well fed during execution time.

Our approach: The first thing in any code optimization problem is understanding the code, what it does and sometime even why it does it (note it is not ALWAYS important to know why, but it usually helps the human intuition to make better decisions).

Below (Snippet 1) is the code of the FFT. This code is from Numerical Recipes in C. If you program and dont own this book, you need help. Everyone should have at least one copy. It contains almost every routine you could ever want from a quick-sort to complex number routines.

Anyone who has this book, please feel free to proof read the C code below. I have made sure it compiles, but I havent yet tested it, as I will only need it for benchmark purposes.

The Code - Further Examination: While I'm not going to explain how an FFT works to everyone (look at the code long enough you'll get it, and it's good practice), I will point out a couple of very important things to you. First of all the most important thing with any code is knowing the usage.

If this code worked on small data sets, 3DNow! conversion would be the only problem. However, with larger values of nn, the code becomes memory constrained. So for the next couple of days (I only do this in my spare time) I will be looking at how to best partition the algorithm so that we can work on data in the L1 cache as often as possible.

The critical inner loops of the code (bold section in Snippet 1) is where initial attention should be focused as this is where the code spends the vast majority of execution time. Any of you out there that can take this code and partition it up so that the code runs faster and still yields the same results, feel free to . Once Im done with my solution we can compare notes.

Once we have fixed the memory problem (if we can) then we start to worry about scheduling issues, etc., but more on this at the appropriate time.

Progress

Luciano comments

The point is that a even good optimized code that runs 3x faster when fits the L1 cache, can run only marginaly faster when it get memory bounded... For example, that FFT routine is just about 16% faster when the data don't fits the L1, being slower then a x87 version on a P6, because the L2 cache of the P6 is much faster then the 100MHz L2 cache of my mobo (I have a K6-2)...

FFT size x FFT speed (FFT/s)

FFT Data Size PII-300 x87 K6-2 350 x87 K6-2 350 3DNow
1 K 2053 1021 3734
2 K 914 470 1749
4 K 192 213 767
8 K 87 61.9 71.8

At this size the data still fits the K6-2 L1 cache but exceeds the P6 L1 cache

As you can see for the results above because of the slower L2 cache, even with 3DNow! and higher clock the K6-2 is slower for FFT sizes above 4K.

To improve that, we must use the PREFETCH instruction, that can hide the L2/memory latency, if you schedule the data loading properly. So that, while you process some bunch of data, you tell the L1 cache to read the data you will need some iterations ahead.

Then, the 3DNow! code should give a good performance even on K6-2 and for data sets larger than the L1 cache.

Armin's explanation

As you can see from the table the data readiness in the L1 cache is essential for the K6 core, it is mandatory to prepare it in chunks of 4 K or smaller for the 3DNow! unit to optimize performance that could far exceed that of the Pentium II x87 unit.

Pentium III SSE is facing the same problems but still has to cope with a smaller L1 cache, so the prefetch for that platform would have to be even tighter for similar performance as 3DNow!

For the K6-III the problem will be lessened by the fast availability of the full-speed L2 cache on the chip but we of course want to optimize for the K6-2 primarily (and maybe later for K6-IIIs and K7, which should really rock on this algorithm)


Snippet 1

#define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr

void four1(float data[],int nn,int isign)

{

int n, mmax, m, j, istep, i;

double wtemp, wr, wpr, wpi, wi, theta;

float tempr, tempi;

n = nn << 1;

j=1;

for (i=1;i<n;i+=2){

if (j>i){

SWAP(data[j], data[i]);

SWAP(data[j+1], data[i+1]);

}

m = n >> 1;

while (m >= 2 && j > m){

j -= m;

m >>=1;

}

j+=m;

}

mmax =2;

while (n > mmax){

istep = 2 *mmax;

theta=6.28318530717959/(isign*mmax);

wtemp=sin(0.5*theta);

wpr = -2.0*wtemp*wtemp;

wpi = sin(theta);

wr = 1.0;

wi = 0.0;

for (m=1;m<mmax;m+=2){

for (i=m;i<=n;i+=istep){

j=i+mmax;

tempr=wr*data[j]-wi*data[j+1];

tempi=wr*data[j+1]+wi*data[j];

data[j]=data[i]-tempr;

data[j+1]=data[i+1]-tempi;

data[i] += tempr;

data[i+1] += tempi;

}

wr=(wtemp=wr)*wpr-wi*wpi+wr;

wi=wi*wpr+wtemp*wpi+wi;

}

mmax=istep;

}

}


rtwhite01.gif (96 bytes)

Related Links