CMSIS DSP Library from CMSIS 2.0. See http://www.onarm.com/cmsis/ for full details

Dependents:   K22F_DSP_Matrix_least_square BNO055-ELEC3810 1BNO055 ECE4180Project--Slave2 ... more

Committer:
simon
Date:
Thu Mar 10 15:07:50 2011 +0000
Revision:
0:1014af42efd9

        

Who changed what in which revision?

UserRevisionLine numberNew contents of line
simon 0:1014af42efd9 1 /* ----------------------------------------------------------------------------
simon 0:1014af42efd9 2 * Copyright (C) 2010 ARM Limited. All rights reserved.
simon 0:1014af42efd9 3 *
simon 0:1014af42efd9 4 * $Date: 29. November 2010
simon 0:1014af42efd9 5 * $Revision: V1.0.3
simon 0:1014af42efd9 6 *
simon 0:1014af42efd9 7 * Project: CMSIS DSP Library
simon 0:1014af42efd9 8 * Title: arm_correlate_f32.c
simon 0:1014af42efd9 9 *
simon 0:1014af42efd9 10 * Description: Correlation for floating-point sequences.
simon 0:1014af42efd9 11 *
simon 0:1014af42efd9 12 * Target Processor: Cortex-M4/Cortex-M3
simon 0:1014af42efd9 13 *
simon 0:1014af42efd9 14 * Version 1.0.3 2010/11/29
simon 0:1014af42efd9 15 * Re-organized the CMSIS folders and updated documentation.
simon 0:1014af42efd9 16 *
simon 0:1014af42efd9 17 * Version 1.0.2 2010/11/11
simon 0:1014af42efd9 18 * Documentation updated.
simon 0:1014af42efd9 19 *
simon 0:1014af42efd9 20 * Version 1.0.1 2010/10/05
simon 0:1014af42efd9 21 * Production release and review comments incorporated.
simon 0:1014af42efd9 22 *
simon 0:1014af42efd9 23 * Version 1.0.0 2010/09/20
simon 0:1014af42efd9 24 * Production release and review comments incorporated
simon 0:1014af42efd9 25 *
simon 0:1014af42efd9 26 * Version 0.0.7 2010/06/10
simon 0:1014af42efd9 27 * Misra-C changes done
simon 0:1014af42efd9 28 *
simon 0:1014af42efd9 29 * -------------------------------------------------------------------------- */
simon 0:1014af42efd9 30
simon 0:1014af42efd9 31 #include "arm_math.h"
simon 0:1014af42efd9 32
simon 0:1014af42efd9 33 /**
simon 0:1014af42efd9 34 * @ingroup groupFilters
simon 0:1014af42efd9 35 */
simon 0:1014af42efd9 36
simon 0:1014af42efd9 37 /**
simon 0:1014af42efd9 38 * @defgroup Corr Correlation
simon 0:1014af42efd9 39 *
simon 0:1014af42efd9 40 * Correlation is a mathematical operation that is similar to convolution.
simon 0:1014af42efd9 41 * As with convolution, correlation uses two signals to produce a third signal.
simon 0:1014af42efd9 42 * The underlying algorithms in correlation and convolution are identical except that one of the inputs is flipped in convolution.
simon 0:1014af42efd9 43 * Correlation is commonly used to measure the similarity between two signals.
simon 0:1014af42efd9 44 * It has applications in pattern recognition, cryptanalysis, and searching.
simon 0:1014af42efd9 45 * The CMSIS library provides correlation functions for Q7, Q15, Q31 and floating-point data types.
simon 0:1014af42efd9 46 * Fast versions of the Q15 and Q31 functions are also provided.
simon 0:1014af42efd9 47 *
simon 0:1014af42efd9 48 * \par Algorithm
simon 0:1014af42efd9 49 * Let <code>a[n]</code> and <code>b[n]</code> be sequences of length <code>srcALen</code> and <code>srcBLen</code> samples respectively.
simon 0:1014af42efd9 50 * The convolution of the two signals is denoted by
simon 0:1014af42efd9 51 * <pre>
simon 0:1014af42efd9 52 * c[n] = a[n] * b[n]
simon 0:1014af42efd9 53 * </pre>
simon 0:1014af42efd9 54 * In correlation, one of the signals is flipped in time
simon 0:1014af42efd9 55 * <pre>
simon 0:1014af42efd9 56 * c[n] = a[n] * b[-n]
simon 0:1014af42efd9 57 * </pre>
simon 0:1014af42efd9 58 *
simon 0:1014af42efd9 59 * \par
simon 0:1014af42efd9 60 * and this is mathematically defined as
simon 0:1014af42efd9 61 * \image html CorrelateEquation.gif
simon 0:1014af42efd9 62 * \par
simon 0:1014af42efd9 63 * The <code>pSrcA</code> points to the first input vector of length <code>srcALen</code> and <code>pSrcB</code> points to the second input vector of length <code>srcBLen</code>.
simon 0:1014af42efd9 64 * The result <code>c[n]</code> is of length <code>2 * max(srcALen, srcBLen) - 1</code> and is defined over the interval <code>n=0, 1, 2, ..., (2 * max(srcALen, srcBLen) - 2)</code>.
simon 0:1014af42efd9 65 * The output result is written to <code>pDst</code> and the calling function must allocate <code>2 * max(srcALen, srcBLen) - 1</code> words for the result.
simon 0:1014af42efd9 66 *
simon 0:1014af42efd9 67 * <b>Fixed-Point Behavior</b>
simon 0:1014af42efd9 68 * \par
simon 0:1014af42efd9 69 * Correlation requires summing up a large number of intermediate products.
simon 0:1014af42efd9 70 * As such, the Q7, Q15, and Q31 functions run a risk of overflow and saturation.
simon 0:1014af42efd9 71 * Refer to the function specific documentation below for further details of the particular algorithm used.
simon 0:1014af42efd9 72 */
simon 0:1014af42efd9 73
simon 0:1014af42efd9 74 /**
simon 0:1014af42efd9 75 * @addtogroup Corr
simon 0:1014af42efd9 76 * @{
simon 0:1014af42efd9 77 */
simon 0:1014af42efd9 78 /**
simon 0:1014af42efd9 79 * @brief Correlation of floating-point sequences
simon 0:1014af42efd9 80 * @param[in] *pSrcA points to the first input sequence.
simon 0:1014af42efd9 81 * @param[in] srcALen length of the first input sequence.
simon 0:1014af42efd9 82 * @param[in] *pSrcB points to the second input sequence.
simon 0:1014af42efd9 83 * @param[in] srcBLen length of the second input sequence.
simon 0:1014af42efd9 84 * @param[out] *pDst points to the location where the output result is written. Length 2 * max(srcALen, srcBLen) - 1.
simon 0:1014af42efd9 85 * @return none.
simon 0:1014af42efd9 86 */
simon 0:1014af42efd9 87
simon 0:1014af42efd9 88 void arm_correlate_f32(
simon 0:1014af42efd9 89 float32_t * pSrcA,
simon 0:1014af42efd9 90 uint32_t srcALen,
simon 0:1014af42efd9 91 float32_t * pSrcB,
simon 0:1014af42efd9 92 uint32_t srcBLen,
simon 0:1014af42efd9 93 float32_t * pDst)
simon 0:1014af42efd9 94 {
simon 0:1014af42efd9 95 float32_t *pIn1; /* inputA pointer */
simon 0:1014af42efd9 96 float32_t *pIn2; /* inputB pointer */
simon 0:1014af42efd9 97 float32_t *pOut = pDst; /* output pointer */
simon 0:1014af42efd9 98 float32_t *px; /* Intermediate inputA pointer */
simon 0:1014af42efd9 99 float32_t *py; /* Intermediate inputB pointer */
simon 0:1014af42efd9 100 float32_t *pSrc1; /* Intermediate pointers */
simon 0:1014af42efd9 101 float32_t sum, acc0, acc1, acc2, acc3; /* Accumulators */
simon 0:1014af42efd9 102 float32_t x0, x1, x2, x3, c0; /* temporary variables for holding input and coefficient values */
simon 0:1014af42efd9 103 uint32_t j, k = 0u, count, blkCnt, outBlockSize, blockSize1, blockSize2, blockSize3; /* loop counters */
simon 0:1014af42efd9 104 int32_t inc = 1; /* Destination address modifier */
simon 0:1014af42efd9 105
simon 0:1014af42efd9 106
simon 0:1014af42efd9 107 /* The algorithm implementation is based on the lengths of the inputs. */
simon 0:1014af42efd9 108 /* srcB is always made to slide across srcA. */
simon 0:1014af42efd9 109 /* So srcBLen is always considered as shorter or equal to srcALen */
simon 0:1014af42efd9 110 /* But CORR(x, y) is reverse of CORR(y, x) */
simon 0:1014af42efd9 111 /* So, when srcBLen > srcALen, output pointer is made to point to the end of the output buffer */
simon 0:1014af42efd9 112 /* and the destination pointer modifier, inc is set to -1 */
simon 0:1014af42efd9 113 /* If srcALen > srcBLen, zero pad has to be done to srcB to make the two inputs of same length */
simon 0:1014af42efd9 114 /* But to improve the performance,
simon 0:1014af42efd9 115 * we include zeroes in the output instead of zero padding either of the the inputs*/
simon 0:1014af42efd9 116 /* If srcALen > srcBLen,
simon 0:1014af42efd9 117 * (srcALen - srcBLen) zeroes has to included in the starting of the output buffer */
simon 0:1014af42efd9 118 /* If srcALen < srcBLen,
simon 0:1014af42efd9 119 * (srcALen - srcBLen) zeroes has to included in the ending of the output buffer */
simon 0:1014af42efd9 120 if(srcALen >= srcBLen)
simon 0:1014af42efd9 121 {
simon 0:1014af42efd9 122 /* Initialization of inputA pointer */
simon 0:1014af42efd9 123 pIn1 = pSrcA;
simon 0:1014af42efd9 124
simon 0:1014af42efd9 125 /* Initialization of inputB pointer */
simon 0:1014af42efd9 126 pIn2 = pSrcB;
simon 0:1014af42efd9 127
simon 0:1014af42efd9 128 /* Number of output samples is calculated */
simon 0:1014af42efd9 129 outBlockSize = (2u * srcALen) - 1u;
simon 0:1014af42efd9 130
simon 0:1014af42efd9 131 /* When srcALen > srcBLen, zero padding has to be done to srcB
simon 0:1014af42efd9 132 * to make their lengths equal.
simon 0:1014af42efd9 133 * Instead, (outBlockSize - (srcALen + srcBLen - 1))
simon 0:1014af42efd9 134 * number of output samples are made zero */
simon 0:1014af42efd9 135 j = outBlockSize - (srcALen + (srcBLen - 1u));
simon 0:1014af42efd9 136
simon 0:1014af42efd9 137 while(j > 0u)
simon 0:1014af42efd9 138 {
simon 0:1014af42efd9 139 /* Zero is stored in the destination buffer */
simon 0:1014af42efd9 140 *pOut++ = 0.0f;
simon 0:1014af42efd9 141
simon 0:1014af42efd9 142 /* Decrement the loop counter */
simon 0:1014af42efd9 143 j--;
simon 0:1014af42efd9 144 }
simon 0:1014af42efd9 145
simon 0:1014af42efd9 146 }
simon 0:1014af42efd9 147 else
simon 0:1014af42efd9 148 {
simon 0:1014af42efd9 149 /* Initialization of inputA pointer */
simon 0:1014af42efd9 150 pIn1 = pSrcB;
simon 0:1014af42efd9 151
simon 0:1014af42efd9 152 /* Initialization of inputB pointer */
simon 0:1014af42efd9 153 pIn2 = pSrcA;
simon 0:1014af42efd9 154
simon 0:1014af42efd9 155 /* srcBLen is always considered as shorter or equal to srcALen */
simon 0:1014af42efd9 156 j = srcBLen;
simon 0:1014af42efd9 157 srcBLen = srcALen;
simon 0:1014af42efd9 158 srcALen = j;
simon 0:1014af42efd9 159
simon 0:1014af42efd9 160 /* CORR(x, y) = Reverse order(CORR(y, x)) */
simon 0:1014af42efd9 161 /* Hence set the destination pointer to point to the last output sample */
simon 0:1014af42efd9 162 pOut = pDst + ((srcALen + srcBLen) - 2u);
simon 0:1014af42efd9 163
simon 0:1014af42efd9 164 /* Destination address modifier is set to -1 */
simon 0:1014af42efd9 165 inc = -1;
simon 0:1014af42efd9 166
simon 0:1014af42efd9 167 }
simon 0:1014af42efd9 168
simon 0:1014af42efd9 169 /* The function is internally
simon 0:1014af42efd9 170 * divided into three parts according to the number of multiplications that has to be
simon 0:1014af42efd9 171 * taken place between inputA samples and inputB samples. In the first part of the
simon 0:1014af42efd9 172 * algorithm, the multiplications increase by one for every iteration.
simon 0:1014af42efd9 173 * In the second part of the algorithm, srcBLen number of multiplications are done.
simon 0:1014af42efd9 174 * In the third part of the algorithm, the multiplications decrease by one
simon 0:1014af42efd9 175 * for every iteration.*/
simon 0:1014af42efd9 176 /* The algorithm is implemented in three stages.
simon 0:1014af42efd9 177 * The loop counters of each stage is initiated here. */
simon 0:1014af42efd9 178 blockSize1 = srcBLen - 1u;
simon 0:1014af42efd9 179 blockSize2 = srcALen - (srcBLen - 1u);
simon 0:1014af42efd9 180 blockSize3 = blockSize1;
simon 0:1014af42efd9 181
simon 0:1014af42efd9 182 /* --------------------------
simon 0:1014af42efd9 183 * Initializations of stage1
simon 0:1014af42efd9 184 * -------------------------*/
simon 0:1014af42efd9 185
simon 0:1014af42efd9 186 /* sum = x[0] * y[srcBlen - 1]
simon 0:1014af42efd9 187 * sum = x[0] * y[srcBlen-2] + x[1] * y[srcBlen - 1]
simon 0:1014af42efd9 188 * ....
simon 0:1014af42efd9 189 * sum = x[0] * y[0] + x[1] * y[1] +...+ x[srcBLen - 1] * y[srcBLen - 1]
simon 0:1014af42efd9 190 */
simon 0:1014af42efd9 191
simon 0:1014af42efd9 192 /* In this stage the MAC operations are increased by 1 for every iteration.
simon 0:1014af42efd9 193 The count variable holds the number of MAC operations performed */
simon 0:1014af42efd9 194 count = 1u;
simon 0:1014af42efd9 195
simon 0:1014af42efd9 196 /* Working pointer of inputA */
simon 0:1014af42efd9 197 px = pIn1;
simon 0:1014af42efd9 198
simon 0:1014af42efd9 199 /* Working pointer of inputB */
simon 0:1014af42efd9 200 pSrc1 = pIn2 + (srcBLen - 1u);
simon 0:1014af42efd9 201 py = pSrc1;
simon 0:1014af42efd9 202
simon 0:1014af42efd9 203 /* ------------------------
simon 0:1014af42efd9 204 * Stage1 process
simon 0:1014af42efd9 205 * ----------------------*/
simon 0:1014af42efd9 206
simon 0:1014af42efd9 207 /* The first stage starts here */
simon 0:1014af42efd9 208 while(blockSize1 > 0u)
simon 0:1014af42efd9 209 {
simon 0:1014af42efd9 210 /* Accumulator is made zero for every iteration */
simon 0:1014af42efd9 211 sum = 0.0f;
simon 0:1014af42efd9 212
simon 0:1014af42efd9 213 /* Apply loop unrolling and compute 4 MACs simultaneously. */
simon 0:1014af42efd9 214 k = count >> 2u;
simon 0:1014af42efd9 215
simon 0:1014af42efd9 216 /* First part of the processing with loop unrolling. Compute 4 MACs at a time.
simon 0:1014af42efd9 217 ** a second loop below computes MACs for the remaining 1 to 3 samples. */
simon 0:1014af42efd9 218 while(k > 0u)
simon 0:1014af42efd9 219 {
simon 0:1014af42efd9 220 /* x[0] * y[srcBLen - 4] */
simon 0:1014af42efd9 221 sum += *px++ * *py++;
simon 0:1014af42efd9 222 /* x[1] * y[srcBLen - 3] */
simon 0:1014af42efd9 223 sum += *px++ * *py++;
simon 0:1014af42efd9 224 /* x[2] * y[srcBLen - 2] */
simon 0:1014af42efd9 225 sum += *px++ * *py++;
simon 0:1014af42efd9 226 /* x[3] * y[srcBLen - 1] */
simon 0:1014af42efd9 227 sum += *px++ * *py++;
simon 0:1014af42efd9 228
simon 0:1014af42efd9 229 /* Decrement the loop counter */
simon 0:1014af42efd9 230 k--;
simon 0:1014af42efd9 231 }
simon 0:1014af42efd9 232
simon 0:1014af42efd9 233 /* If the count is not a multiple of 4, compute any remaining MACs here.
simon 0:1014af42efd9 234 ** No loop unrolling is used. */
simon 0:1014af42efd9 235 k = count % 0x4u;
simon 0:1014af42efd9 236
simon 0:1014af42efd9 237 while(k > 0u)
simon 0:1014af42efd9 238 {
simon 0:1014af42efd9 239 /* Perform the multiply-accumulate */
simon 0:1014af42efd9 240 /* x[0] * y[srcBLen - 1] */
simon 0:1014af42efd9 241 sum += *px++ * *py++;
simon 0:1014af42efd9 242
simon 0:1014af42efd9 243 /* Decrement the loop counter */
simon 0:1014af42efd9 244 k--;
simon 0:1014af42efd9 245 }
simon 0:1014af42efd9 246
simon 0:1014af42efd9 247 /* Store the result in the accumulator in the destination buffer. */
simon 0:1014af42efd9 248 *pOut = sum;
simon 0:1014af42efd9 249 /* Destination pointer is updated according to the address modifier, inc */
simon 0:1014af42efd9 250 pOut += inc;
simon 0:1014af42efd9 251
simon 0:1014af42efd9 252 /* Update the inputA and inputB pointers for next MAC calculation */
simon 0:1014af42efd9 253 py = pSrc1 - count;
simon 0:1014af42efd9 254 px = pIn1;
simon 0:1014af42efd9 255
simon 0:1014af42efd9 256 /* Increment the MAC count */
simon 0:1014af42efd9 257 count++;
simon 0:1014af42efd9 258
simon 0:1014af42efd9 259 /* Decrement the loop counter */
simon 0:1014af42efd9 260 blockSize1--;
simon 0:1014af42efd9 261 }
simon 0:1014af42efd9 262
simon 0:1014af42efd9 263 /* --------------------------
simon 0:1014af42efd9 264 * Initializations of stage2
simon 0:1014af42efd9 265 * ------------------------*/
simon 0:1014af42efd9 266
simon 0:1014af42efd9 267 /* sum = x[0] * y[0] + x[1] * y[1] +...+ x[srcBLen-1] * y[srcBLen-1]
simon 0:1014af42efd9 268 * sum = x[1] * y[0] + x[2] * y[1] +...+ x[srcBLen] * y[srcBLen-1]
simon 0:1014af42efd9 269 * ....
simon 0:1014af42efd9 270 * sum = x[srcALen-srcBLen-2] * y[0] + x[srcALen-srcBLen-1] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
simon 0:1014af42efd9 271 */
simon 0:1014af42efd9 272
simon 0:1014af42efd9 273 /* Working pointer of inputA */
simon 0:1014af42efd9 274 px = pIn1;
simon 0:1014af42efd9 275
simon 0:1014af42efd9 276 /* Working pointer of inputB */
simon 0:1014af42efd9 277 py = pIn2;
simon 0:1014af42efd9 278
simon 0:1014af42efd9 279 /* count is index by which the pointer pIn1 to be incremented */
simon 0:1014af42efd9 280 count = 1u;
simon 0:1014af42efd9 281
simon 0:1014af42efd9 282 /* -------------------
simon 0:1014af42efd9 283 * Stage2 process
simon 0:1014af42efd9 284 * ------------------*/
simon 0:1014af42efd9 285
simon 0:1014af42efd9 286 /* Stage2 depends on srcBLen as in this stage srcBLen number of MACS are performed.
simon 0:1014af42efd9 287 * So, to loop unroll over blockSize2,
simon 0:1014af42efd9 288 * srcBLen should be greater than or equal to 4, to loop unroll the srcBLen loop */
simon 0:1014af42efd9 289 if(srcBLen >= 4u)
simon 0:1014af42efd9 290 {
simon 0:1014af42efd9 291 /* Loop unroll over blockSize2, by 4 */
simon 0:1014af42efd9 292 blkCnt = blockSize2 >> 2u;
simon 0:1014af42efd9 293
simon 0:1014af42efd9 294 while(blkCnt > 0u)
simon 0:1014af42efd9 295 {
simon 0:1014af42efd9 296 /* Set all accumulators to zero */
simon 0:1014af42efd9 297 acc0 = 0.0f;
simon 0:1014af42efd9 298 acc1 = 0.0f;
simon 0:1014af42efd9 299 acc2 = 0.0f;
simon 0:1014af42efd9 300 acc3 = 0.0f;
simon 0:1014af42efd9 301
simon 0:1014af42efd9 302 /* read x[0], x[1], x[2] samples */
simon 0:1014af42efd9 303 x0 = *(px++);
simon 0:1014af42efd9 304 x1 = *(px++);
simon 0:1014af42efd9 305 x2 = *(px++);
simon 0:1014af42efd9 306
simon 0:1014af42efd9 307 /* Apply loop unrolling and compute 4 MACs simultaneously. */
simon 0:1014af42efd9 308 k = srcBLen >> 2u;
simon 0:1014af42efd9 309
simon 0:1014af42efd9 310 /* First part of the processing with loop unrolling. Compute 4 MACs at a time.
simon 0:1014af42efd9 311 ** a second loop below computes MACs for the remaining 1 to 3 samples. */
simon 0:1014af42efd9 312 do
simon 0:1014af42efd9 313 {
simon 0:1014af42efd9 314 /* Read y[0] sample */
simon 0:1014af42efd9 315 c0 = *(py++);
simon 0:1014af42efd9 316
simon 0:1014af42efd9 317 /* Read x[3] sample */
simon 0:1014af42efd9 318 x3 = *(px++);
simon 0:1014af42efd9 319
simon 0:1014af42efd9 320 /* Perform the multiply-accumulate */
simon 0:1014af42efd9 321 /* acc0 += x[0] * y[0] */
simon 0:1014af42efd9 322 acc0 += x0 * c0;
simon 0:1014af42efd9 323 /* acc1 += x[1] * y[0] */
simon 0:1014af42efd9 324 acc1 += x1 * c0;
simon 0:1014af42efd9 325 /* acc2 += x[2] * y[0] */
simon 0:1014af42efd9 326 acc2 += x2 * c0;
simon 0:1014af42efd9 327 /* acc3 += x[3] * y[0] */
simon 0:1014af42efd9 328 acc3 += x3 * c0;
simon 0:1014af42efd9 329
simon 0:1014af42efd9 330 /* Read y[1] sample */
simon 0:1014af42efd9 331 c0 = *(py++);
simon 0:1014af42efd9 332
simon 0:1014af42efd9 333 /* Read x[4] sample */
simon 0:1014af42efd9 334 x0 = *(px++);
simon 0:1014af42efd9 335
simon 0:1014af42efd9 336 /* Perform the multiply-accumulate */
simon 0:1014af42efd9 337 /* acc0 += x[1] * y[1] */
simon 0:1014af42efd9 338 acc0 += x1 * c0;
simon 0:1014af42efd9 339 /* acc1 += x[2] * y[1] */
simon 0:1014af42efd9 340 acc1 += x2 * c0;
simon 0:1014af42efd9 341 /* acc2 += x[3] * y[1] */
simon 0:1014af42efd9 342 acc2 += x3 * c0;
simon 0:1014af42efd9 343 /* acc3 += x[4] * y[1] */
simon 0:1014af42efd9 344 acc3 += x0 * c0;
simon 0:1014af42efd9 345
simon 0:1014af42efd9 346 /* Read y[2] sample */
simon 0:1014af42efd9 347 c0 = *(py++);
simon 0:1014af42efd9 348
simon 0:1014af42efd9 349 /* Read x[5] sample */
simon 0:1014af42efd9 350 x1 = *(px++);
simon 0:1014af42efd9 351
simon 0:1014af42efd9 352 /* Perform the multiply-accumulates */
simon 0:1014af42efd9 353 /* acc0 += x[2] * y[2] */
simon 0:1014af42efd9 354 acc0 += x2 * c0;
simon 0:1014af42efd9 355 /* acc1 += x[3] * y[2] */
simon 0:1014af42efd9 356 acc1 += x3 * c0;
simon 0:1014af42efd9 357 /* acc2 += x[4] * y[2] */
simon 0:1014af42efd9 358 acc2 += x0 * c0;
simon 0:1014af42efd9 359 /* acc3 += x[5] * y[2] */
simon 0:1014af42efd9 360 acc3 += x1 * c0;
simon 0:1014af42efd9 361
simon 0:1014af42efd9 362 /* Read y[3] sample */
simon 0:1014af42efd9 363 c0 = *(py++);
simon 0:1014af42efd9 364
simon 0:1014af42efd9 365 /* Read x[6] sample */
simon 0:1014af42efd9 366 x2 = *(px++);
simon 0:1014af42efd9 367
simon 0:1014af42efd9 368 /* Perform the multiply-accumulates */
simon 0:1014af42efd9 369 /* acc0 += x[3] * y[3] */
simon 0:1014af42efd9 370 acc0 += x3 * c0;
simon 0:1014af42efd9 371 /* acc1 += x[4] * y[3] */
simon 0:1014af42efd9 372 acc1 += x0 * c0;
simon 0:1014af42efd9 373 /* acc2 += x[5] * y[3] */
simon 0:1014af42efd9 374 acc2 += x1 * c0;
simon 0:1014af42efd9 375 /* acc3 += x[6] * y[3] */
simon 0:1014af42efd9 376 acc3 += x2 * c0;
simon 0:1014af42efd9 377
simon 0:1014af42efd9 378
simon 0:1014af42efd9 379 } while(--k);
simon 0:1014af42efd9 380
simon 0:1014af42efd9 381 /* If the srcBLen is not a multiple of 4, compute any remaining MACs here.
simon 0:1014af42efd9 382 ** No loop unrolling is used. */
simon 0:1014af42efd9 383 k = srcBLen % 0x4u;
simon 0:1014af42efd9 384
simon 0:1014af42efd9 385 while(k > 0u)
simon 0:1014af42efd9 386 {
simon 0:1014af42efd9 387 /* Read y[4] sample */
simon 0:1014af42efd9 388 c0 = *(py++);
simon 0:1014af42efd9 389
simon 0:1014af42efd9 390 /* Read x[7] sample */
simon 0:1014af42efd9 391 x3 = *(px++);
simon 0:1014af42efd9 392
simon 0:1014af42efd9 393 /* Perform the multiply-accumulates */
simon 0:1014af42efd9 394 /* acc0 += x[4] * y[4] */
simon 0:1014af42efd9 395 acc0 += x0 * c0;
simon 0:1014af42efd9 396 /* acc1 += x[5] * y[4] */
simon 0:1014af42efd9 397 acc1 += x1 * c0;
simon 0:1014af42efd9 398 /* acc2 += x[6] * y[4] */
simon 0:1014af42efd9 399 acc2 += x2 * c0;
simon 0:1014af42efd9 400 /* acc3 += x[7] * y[4] */
simon 0:1014af42efd9 401 acc3 += x3 * c0;
simon 0:1014af42efd9 402
simon 0:1014af42efd9 403 /* Reuse the present samples for the next MAC */
simon 0:1014af42efd9 404 x0 = x1;
simon 0:1014af42efd9 405 x1 = x2;
simon 0:1014af42efd9 406 x2 = x3;
simon 0:1014af42efd9 407
simon 0:1014af42efd9 408 /* Decrement the loop counter */
simon 0:1014af42efd9 409 k--;
simon 0:1014af42efd9 410 }
simon 0:1014af42efd9 411
simon 0:1014af42efd9 412 /* Store the result in the accumulator in the destination buffer. */
simon 0:1014af42efd9 413 *pOut = acc0;
simon 0:1014af42efd9 414 /* Destination pointer is updated according to the address modifier, inc */
simon 0:1014af42efd9 415 pOut += inc;
simon 0:1014af42efd9 416
simon 0:1014af42efd9 417 *pOut = acc1;
simon 0:1014af42efd9 418 pOut += inc;
simon 0:1014af42efd9 419
simon 0:1014af42efd9 420 *pOut = acc2;
simon 0:1014af42efd9 421 pOut += inc;
simon 0:1014af42efd9 422
simon 0:1014af42efd9 423 *pOut = acc3;
simon 0:1014af42efd9 424 pOut += inc;
simon 0:1014af42efd9 425
simon 0:1014af42efd9 426 /* Update the inputA and inputB pointers for next MAC calculation */
simon 0:1014af42efd9 427 px = pIn1 + (count * 4u);
simon 0:1014af42efd9 428 py = pIn2;
simon 0:1014af42efd9 429
simon 0:1014af42efd9 430 /* Increment the pointer pIn1 index, count by 1 */
simon 0:1014af42efd9 431 count++;
simon 0:1014af42efd9 432
simon 0:1014af42efd9 433 /* Decrement the loop counter */
simon 0:1014af42efd9 434 blkCnt--;
simon 0:1014af42efd9 435 }
simon 0:1014af42efd9 436
simon 0:1014af42efd9 437 /* If the blockSize2 is not a multiple of 4, compute any remaining output samples here.
simon 0:1014af42efd9 438 ** No loop unrolling is used. */
simon 0:1014af42efd9 439 blkCnt = blockSize2 % 0x4u;
simon 0:1014af42efd9 440
simon 0:1014af42efd9 441 while(blkCnt > 0u)
simon 0:1014af42efd9 442 {
simon 0:1014af42efd9 443 /* Accumulator is made zero for every iteration */
simon 0:1014af42efd9 444 sum = 0.0f;
simon 0:1014af42efd9 445
simon 0:1014af42efd9 446 /* Apply loop unrolling and compute 4 MACs simultaneously. */
simon 0:1014af42efd9 447 k = srcBLen >> 2u;
simon 0:1014af42efd9 448
simon 0:1014af42efd9 449 /* First part of the processing with loop unrolling. Compute 4 MACs at a time.
simon 0:1014af42efd9 450 ** a second loop below computes MACs for the remaining 1 to 3 samples. */
simon 0:1014af42efd9 451 while(k > 0u)
simon 0:1014af42efd9 452 {
simon 0:1014af42efd9 453 /* Perform the multiply-accumulates */
simon 0:1014af42efd9 454 sum += *px++ * *py++;
simon 0:1014af42efd9 455 sum += *px++ * *py++;
simon 0:1014af42efd9 456 sum += *px++ * *py++;
simon 0:1014af42efd9 457 sum += *px++ * *py++;
simon 0:1014af42efd9 458
simon 0:1014af42efd9 459 /* Decrement the loop counter */
simon 0:1014af42efd9 460 k--;
simon 0:1014af42efd9 461 }
simon 0:1014af42efd9 462
simon 0:1014af42efd9 463 /* If the srcBLen is not a multiple of 4, compute any remaining MACs here.
simon 0:1014af42efd9 464 ** No loop unrolling is used. */
simon 0:1014af42efd9 465 k = srcBLen % 0x4u;
simon 0:1014af42efd9 466
simon 0:1014af42efd9 467 while(k > 0u)
simon 0:1014af42efd9 468 {
simon 0:1014af42efd9 469 /* Perform the multiply-accumulate */
simon 0:1014af42efd9 470 sum += *px++ * *py++;
simon 0:1014af42efd9 471
simon 0:1014af42efd9 472 /* Decrement the loop counter */
simon 0:1014af42efd9 473 k--;
simon 0:1014af42efd9 474 }
simon 0:1014af42efd9 475
simon 0:1014af42efd9 476 /* Store the result in the accumulator in the destination buffer. */
simon 0:1014af42efd9 477 *pOut = sum;
simon 0:1014af42efd9 478 /* Destination pointer is updated according to the address modifier, inc */
simon 0:1014af42efd9 479 pOut += inc;
simon 0:1014af42efd9 480
simon 0:1014af42efd9 481 /* Update the inputA and inputB pointers for next MAC calculation */
simon 0:1014af42efd9 482 px = pIn1 + count;
simon 0:1014af42efd9 483 py = pIn2;
simon 0:1014af42efd9 484
simon 0:1014af42efd9 485 /* Increment the pointer pIn1 index, count by 1 */
simon 0:1014af42efd9 486 count++;
simon 0:1014af42efd9 487
simon 0:1014af42efd9 488 /* Decrement the loop counter */
simon 0:1014af42efd9 489 blkCnt--;
simon 0:1014af42efd9 490 }
simon 0:1014af42efd9 491 }
simon 0:1014af42efd9 492 else
simon 0:1014af42efd9 493 {
simon 0:1014af42efd9 494 /* If the srcBLen is not a multiple of 4,
simon 0:1014af42efd9 495 * the blockSize2 loop cannot be unrolled by 4 */
simon 0:1014af42efd9 496 blkCnt = blockSize2;
simon 0:1014af42efd9 497
simon 0:1014af42efd9 498 while(blkCnt > 0u)
simon 0:1014af42efd9 499 {
simon 0:1014af42efd9 500 /* Accumulator is made zero for every iteration */
simon 0:1014af42efd9 501 sum = 0.0f;
simon 0:1014af42efd9 502
simon 0:1014af42efd9 503 /* Loop over srcBLen */
simon 0:1014af42efd9 504 k = srcBLen;
simon 0:1014af42efd9 505
simon 0:1014af42efd9 506 while(k > 0u)
simon 0:1014af42efd9 507 {
simon 0:1014af42efd9 508 /* Perform the multiply-accumulate */
simon 0:1014af42efd9 509 sum += *px++ * *py++;
simon 0:1014af42efd9 510
simon 0:1014af42efd9 511 /* Decrement the loop counter */
simon 0:1014af42efd9 512 k--;
simon 0:1014af42efd9 513 }
simon 0:1014af42efd9 514
simon 0:1014af42efd9 515 /* Store the result in the accumulator in the destination buffer. */
simon 0:1014af42efd9 516 *pOut = sum;
simon 0:1014af42efd9 517 /* Destination pointer is updated according to the address modifier, inc */
simon 0:1014af42efd9 518 pOut += inc;
simon 0:1014af42efd9 519
simon 0:1014af42efd9 520 /* Update the inputA and inputB pointers for next MAC calculation */
simon 0:1014af42efd9 521 px = pIn1 + count;
simon 0:1014af42efd9 522 py = pIn2;
simon 0:1014af42efd9 523
simon 0:1014af42efd9 524 /* Increment the pointer pIn1 index, count by 1 */
simon 0:1014af42efd9 525 count++;
simon 0:1014af42efd9 526
simon 0:1014af42efd9 527 /* Decrement the loop counter */
simon 0:1014af42efd9 528 blkCnt--;
simon 0:1014af42efd9 529 }
simon 0:1014af42efd9 530 }
simon 0:1014af42efd9 531
simon 0:1014af42efd9 532 /* --------------------------
simon 0:1014af42efd9 533 * Initializations of stage3
simon 0:1014af42efd9 534 * -------------------------*/
simon 0:1014af42efd9 535
simon 0:1014af42efd9 536 /* sum += x[srcALen-srcBLen+1] * y[0] + x[srcALen-srcBLen+2] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
simon 0:1014af42efd9 537 * sum += x[srcALen-srcBLen+2] * y[0] + x[srcALen-srcBLen+3] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
simon 0:1014af42efd9 538 * ....
simon 0:1014af42efd9 539 * sum += x[srcALen-2] * y[0] + x[srcALen-1] * y[1]
simon 0:1014af42efd9 540 * sum += x[srcALen-1] * y[0]
simon 0:1014af42efd9 541 */
simon 0:1014af42efd9 542
simon 0:1014af42efd9 543 /* In this stage the MAC operations are decreased by 1 for every iteration.
simon 0:1014af42efd9 544 The count variable holds the number of MAC operations performed */
simon 0:1014af42efd9 545 count = srcBLen - 1u;
simon 0:1014af42efd9 546
simon 0:1014af42efd9 547 /* Working pointer of inputA */
simon 0:1014af42efd9 548 pSrc1 = pIn1 + (srcALen - (srcBLen - 1u));
simon 0:1014af42efd9 549 px = pSrc1;
simon 0:1014af42efd9 550
simon 0:1014af42efd9 551 /* Working pointer of inputB */
simon 0:1014af42efd9 552 py = pIn2;
simon 0:1014af42efd9 553
simon 0:1014af42efd9 554 /* -------------------
simon 0:1014af42efd9 555 * Stage3 process
simon 0:1014af42efd9 556 * ------------------*/
simon 0:1014af42efd9 557
simon 0:1014af42efd9 558 while(blockSize3 > 0u)
simon 0:1014af42efd9 559 {
simon 0:1014af42efd9 560 /* Accumulator is made zero for every iteration */
simon 0:1014af42efd9 561 sum = 0.0f;
simon 0:1014af42efd9 562
simon 0:1014af42efd9 563 /* Apply loop unrolling and compute 4 MACs simultaneously. */
simon 0:1014af42efd9 564 k = count >> 2u;
simon 0:1014af42efd9 565
simon 0:1014af42efd9 566 /* First part of the processing with loop unrolling. Compute 4 MACs at a time.
simon 0:1014af42efd9 567 ** a second loop below computes MACs for the remaining 1 to 3 samples. */
simon 0:1014af42efd9 568 while(k > 0u)
simon 0:1014af42efd9 569 {
simon 0:1014af42efd9 570 /* Perform the multiply-accumulates */
simon 0:1014af42efd9 571 /* sum += x[srcALen - srcBLen + 4] * y[3] */
simon 0:1014af42efd9 572 sum += *px++ * *py++;
simon 0:1014af42efd9 573 /* sum += x[srcALen - srcBLen + 3] * y[2] */
simon 0:1014af42efd9 574 sum += *px++ * *py++;
simon 0:1014af42efd9 575 /* sum += x[srcALen - srcBLen + 2] * y[1] */
simon 0:1014af42efd9 576 sum += *px++ * *py++;
simon 0:1014af42efd9 577 /* sum += x[srcALen - srcBLen + 1] * y[0] */
simon 0:1014af42efd9 578 sum += *px++ * *py++;
simon 0:1014af42efd9 579
simon 0:1014af42efd9 580 /* Decrement the loop counter */
simon 0:1014af42efd9 581 k--;
simon 0:1014af42efd9 582 }
simon 0:1014af42efd9 583
simon 0:1014af42efd9 584 /* If the count is not a multiple of 4, compute any remaining MACs here.
simon 0:1014af42efd9 585 ** No loop unrolling is used. */
simon 0:1014af42efd9 586 k = count % 0x4u;
simon 0:1014af42efd9 587
simon 0:1014af42efd9 588 while(k > 0u)
simon 0:1014af42efd9 589 {
simon 0:1014af42efd9 590 /* Perform the multiply-accumulates */
simon 0:1014af42efd9 591 sum += *px++ * *py++;
simon 0:1014af42efd9 592
simon 0:1014af42efd9 593 /* Decrement the loop counter */
simon 0:1014af42efd9 594 k--;
simon 0:1014af42efd9 595 }
simon 0:1014af42efd9 596
simon 0:1014af42efd9 597 /* Store the result in the accumulator in the destination buffer. */
simon 0:1014af42efd9 598 *pOut = sum;
simon 0:1014af42efd9 599 /* Destination pointer is updated according to the address modifier, inc */
simon 0:1014af42efd9 600 pOut += inc;
simon 0:1014af42efd9 601
simon 0:1014af42efd9 602 /* Update the inputA and inputB pointers for next MAC calculation */
simon 0:1014af42efd9 603 px = ++pSrc1;
simon 0:1014af42efd9 604 py = pIn2;
simon 0:1014af42efd9 605
simon 0:1014af42efd9 606 /* Decrement the MAC count */
simon 0:1014af42efd9 607 count--;
simon 0:1014af42efd9 608
simon 0:1014af42efd9 609 /* Decrement the loop counter */
simon 0:1014af42efd9 610 blockSize3--;
simon 0:1014af42efd9 611 }
simon 0:1014af42efd9 612
simon 0:1014af42efd9 613 }
simon 0:1014af42efd9 614
simon 0:1014af42efd9 615 /**
simon 0:1014af42efd9 616 * @} end of Corr group
simon 0:1014af42efd9 617 */