c - 128-bit rotation using ARM Neon intrinsics -

- July 15, 2010

i'm trying optimize code using neon intrinsics. have 24-bit rotation on 128-bit array (8 each uint16_t).

here c code:

uint16_t rotated[8]; uint16_t temp[8]; uint16_t j; for(j = 0; j < 8; j++) {      //rotation <<< 24  on 128 bits (x << shift) | (x >> (16 - shift)      rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff); }

i've checked gcc documentation neon intrinsics , doesn't have instruction vector rotations. moreover, i've tried using vshlq_n_u16(temp, 8) bits shifted outside uint16_t word lost.

how achieve using neon intrinsics ? way there better documentation gcc neon intrinsics ?

after reading on arm community blogs, i've found :

neon arm bitwise rotation

vext: extract vext extracts new vector of bytes pair of existing vectors. bytes in new vector top of first operand, , bottom of second operand. allows produce new vector containing elements straddle pair of existing vectors. vext can used implement moving window on data 2 vectors, useful in fir filters. for permutation, can used simulate byte-wise rotate operation, when using same vector both input operands.

the following neon gcc intrinsic same assembly provided in picture :

uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)

so the 24bit rotation on full 128bit vector (not on each element) done following:

uint16x8_t input; uint16x8_t t0; uint16x8_t t1; uint16x8_t rotated;  t0 = vextq_u16(input, input, 1); t0 = vshlq_n_u16(t0, 8); t1 = vextq_u16(input, input, 2); t1 = vshrq_n_u16(t1, 8); rotated = vorrq_u16(t0, t1);

Search This Blog

Convert PH

c - 128-bit rotation using ARM Neon intrinsics -

Comments

Post a Comment

Popular posts from this blog

All overlapping substrings matching a java regex -

c++ - Using OpenSSL in a multi-threaded application -

php - Deleting/Renaming a locked file -