c - 128-bit rotation using ARM Neon intrinsics -
i'm trying optimize code using neon intrinsics. have 24-bit rotation on 128-bit array (8 each uint16_t).
here c code:
uint16_t rotated[8]; uint16_t temp[8]; uint16_t j; for(j = 0; j < 8; j++) { //rotation <<< 24 on 128 bits (x << shift) | (x >> (16 - shift) rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff); } i've checked gcc documentation neon intrinsics , doesn't have instruction vector rotations. moreover, i've tried using vshlq_n_u16(temp, 8) bits shifted outside uint16_t word lost.
how achieve using neon intrinsics ? way there better documentation gcc neon intrinsics ?
after reading on arm community blogs, i've found :

vext: extract vext extracts new vector of bytes pair of existing vectors. bytes in new vector top of first operand, , bottom of second operand. allows produce new vector containing elements straddle pair of existing vectors. vext can used implement moving window on data 2 vectors, useful in fir filters. for permutation, can used simulate byte-wise rotate operation, when using same vector both input operands.
the following neon gcc intrinsic same assembly provided in picture :
uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int) so the 24bit rotation on full 128bit vector (not on each element) done following:
uint16x8_t input; uint16x8_t t0; uint16x8_t t1; uint16x8_t rotated; t0 = vextq_u16(input, input, 1); t0 = vshlq_n_u16(t0, 8); t1 = vextq_u16(input, input, 2); t1 = vshrq_n_u16(t1, 8); rotated = vorrq_u16(t0, t1);
Comments
Post a Comment