on Apr 16th, 2010ECCp109
This is an analysis of a design method to reduce some of the ECCp109 software client to an FPGA assisted solution. It should be noted that it is possible to obtain PCI based FPGA evaluation boards for $130-200. Thus if this design could be reduced to practice, it would be possible to dramatically increase the performance of the ECCp109 SW client through the addition of a small investment.
FPGA Assist Analysis
There are three paths in the inner most SW loop. This point can be found by going to the very end of the addPointsS function in the addpoints.c source file.
The three logic paths are examined below. Each of the first two paths require 5 inputs with two of the inputs being constants. The third path requires 4 inputs with one being a constant. Thus each logic path requires in actuality a total of 3 (variable) inputs with each path producing a single result. The CPU would need to provide the initial input values and then select the logic path to execute. The FPGA would then reduce the input values to a single output value. I am now trying to decide whether I can learn enough about FPGAs to do the work myself, or whether I can find someone willing to donate the time in return for a couple of FPGA PCI boards. If you know of anyone with FPGA design background that would like to get involved, please point them in my direction!
Jay Berg
jberg@eCompute.org
Path 1 – Total of input parameters needed: 5
PY (constant value)
PX (constant value)
y[i]
x[i]
needInvert [i<<2]
submod_p109 (lambda, PY, y[i]);
mulmod_p109 (lambda, lambda, &needInvert [i << 2]);
addmod_p109 (temp_ul, x[i], PX);
mulmod_p109 (temp2_ul, lambda, lambda);
submod_p109 (tempx, temp2_ul, temp_ul);
submod_p109 (temp_ul, x[i], tempx);
mulmod_p109 (temp_ul, lambda, temp_ul);
submod_p109 (res_list[i].y, temp_ul, y[i]);
Path 2 – Total of input parameters needed: 5
QY (constant value)
QX (constant value)
y[i]
x[i]
needInvert [i<<2]
submod_p109 (lambda, QY, y[i]);
mulmod_p109 (lambda, lambda, &needInvert [i << 2]);
addmod_p109 (temp_ul, x[i], QX);
mulmod_p109 (temp2_ul, lambda, lambda);
submod_p109 (tempx, temp2_ul, temp_ul);
submod_p109 (temp_ul, x[i], tempx);
mulmod_p109 (temp_ul, lambda, temp_ul);
submod_p109 (res_list[i].y, temp_ul, y[i]);
Path 3 – Total of input parameters needed: 4
A (constant value)
y[i]
x[i]
needInvert [i<<2]
mulmod_p109 (temp_ul, x[i], x[i]);
addmod_p109 (temp2_ul, temp_ul, temp_ul);
addmod_p109 (temp2_ul, temp2_ul, temp_ul);
addmod_p109 (lambda, temp2_ul, A);
mulmod_p109 (lambda, lambda, &needInvert [i << 2]);
mulmod_p109 (temp_ul, lambda, lambda);
submod_p109 (temp_ul, temp_ul, x[i]);
submod_p109 (tempx, temp_ul, x[i]);
submod_p109 (temp_ul, x[i], tempx);
mulmod_p109 (temp_ul, lambda, temp_ul);
submod_p109 (res_list[i].y, temp_ul, y[i]);
- – - – -
A few side notes at this point:
- – - – -
1. The following values are constants.
PX
000004CC974EBBCBFDC3636FEB9F11C7
PY
000007611B0EB1229C0BFC5F35521692
QX
00000233857E4E8B5F0055126E7D7B7C
QY
000019C8C91063EB4276371D68B6B4D9
A
00000FD4C926FD178E9805E663021744
P*
00001BD579792B380B5B521E6D9FB599
* Note that P is the modulo value that all functions use in reducing results.
2. All math is 109-bits.
3. All routines reduce the result by modulo P, prior to storing the result.
4. All functions are in the form of (result, op1, op2). Where the result of the operation is stored to ‘result’.
5. Each of the three paths result in a single value.
6. The math functions each require between 25 (add and subtract) and 325 (multiply) CPU instructions. Using that estimate of function lengths, the three paths are each approximately 1,000 CPU instructions.
7. The SW seems to lend itself well to parallelism. Currently it appears that the SW is setup to provide calculations in groups of 128 at a time. I’ve been told that this would aid in the pipelining within the FPGA. I am still waiting for final confirmation from the SW author as to the exact behavior of the SW.