grin's various stupids of doom [brains needed]

Place to talk about all that new hardware and decaying software you have.

Moderator: General Mods

Post Reply
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

grin's various stupids of doom [brains needed]

Post by grinvader »

It's about code, but not zsnes (or any other emu) code so tech talk should work.

I'd like other programmers' approach on how to do integer log10s with basic ops.

Right now, there's the classic

Code: Select all

uint8_t log10(uint64_t input)
{
  uint8_t out = 255; // since log10(MAXUINT64) == 19, log10(0) == 255 is fine enough

  for (; input; out++) { input /= 10; } // bleh, div sucks

  return (out);
}
but it doesn't exactly reach the level of badass I'd like.

Any ideas ?

(computing the log from the integral of 1/x is slower than this snippet, obviously)

[edit: aside from doing a binary search for the heaviest decimal digit to get O(log n) instead of O(n), I'm stumped]
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

Code: Select all

IntegerLogBase2(v)
Instant fail. Next ?

(they use multiply AND a lookup instead of plain bit search, which is retarded for log2)
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

Well, you can always do a benchmark. Maybe they are faster? :wink:
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

I'd take the second code (the ? nesting) over that one any day.

It's such a pain when branchless code is slower than branching code...
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
funkyass
"God"
Posts: 1128
Joined: Tue Jul 27, 2004 11:24 pm

Post by funkyass »

If you tried that on a 386 you'd say differently.
Does [Kevin] Smith masturbate with steel wool too?

- Yes, but don’t change the subject.
byuu

Post by byuu »

grinvader wrote:It's such a pain when branchless code is slower than branching code...
I hate it, too. But all of our bit-manipulation tricks are often ending up even slower than conditionals due to stronger and stronger branch-prediction units.
Francis64
Lurker
Posts: 160
Joined: Mon Jun 02, 2008 9:31 am
Location: UK

Post by Francis64 »

Maybe this take a few nanoseconds off?

Code: Select all

out = 0;

while(input /= 10) {
   ++out;
}

return out;
spiller
JSNES Developer
JSNES Developer
Posts: 43
Joined: Sun Mar 15, 2009 11:09 pm
Location: Ireland

Post by spiller »

The bottleneck in that algorithm isn't the branching; it's the division. Since each loop iteration has a conditional anyway, scrap the division and do out each level explicitly:

Code: Select all

uint8_t log10(uint64_t input)
{
  
  if (input == 0) return 255;
  if (input < 10) return 0;
  if (input < 100) return 1;
  if (input < 1000) return 2;
  if (input < 10000) return 3;
  if (input < 100000) return 4;
  if (input < 1000000) return 5;
  if (input < 10000000) return 6;
  if (input < 100000000) return 7;
  if (input < 1000000000) return 8;
  if (input < 10000000000ULL) return 9;
  if (input < 100000000000ULL) return 10;
  if (input < 1000000000000ULL) return 11;
  if (input < 10000000000000ULL) return 12;
  if (input < 100000000000000ULL) return 13;
  if (input < 1000000000000000ULL) return 14;
  if (input < 10000000000000000ULL) return 15;
  if (input < 100000000000000000ULL) return 16;
  if (input < 1000000000000000000ULL) return 17;
  if (input < 10000000000000000000ULL) return 18;
  return 19;
  
}
I did a benchmark of this, if you'll forgive the Windows-specific timer code I pulled from another file:

Code: Select all


#include <iostream>
#include <windows.h>

using namespace std;

typedef unsigned long long uint64_t;
typedef unsigned char uint8_t;

static uint64_t TimerFreq;

uint64_t Timer() {
	if (TimerFreq) {
		uint64_t t;
		QueryPerformanceCounter((LARGE_INTEGER*)&t);
		return t / TimerFreq;
	} else {
		return GetTickCount();
	}
}


#if 1
uint8_t log10(uint64_t input)
{
  uint8_t out = 255; // since log10(MAXUINT64) == 19, log10(0) == 255 is fine enough

  for (; input; out++) { input /= 10; } // bleh, div sucks

  return (out);
}

#else
uint8_t log10(volatile uint64_t input)
{
  
  if (input == 0) return 255;
  if (input < 10) return 0;
  if (input < 100) return 1;
  if (input < 1000) return 2;
  if (input < 10000) return 3;
  if (input < 100000) return 4;
  if (input < 1000000) return 5;
  if (input < 10000000) return 6;
  if (input < 100000000) return 7;
  if (input < 1000000000) return 8;
  if (input < 10000000000ULL) return 9;
  if (input < 100000000000ULL) return 10;
  if (input < 1000000000000ULL) return 11;
  if (input < 10000000000000ULL) return 12;
  if (input < 100000000000000ULL) return 13;
  if (input < 1000000000000000ULL) return 14;
  if (input < 10000000000000000ULL) return 15;
  if (input < 100000000000000000ULL) return 16;
  if (input < 1000000000000000000ULL) return 17;
  if (input < 10000000000000000000ULL) return 18;
  return 19;
  
}
#endif



int main() {
	
	if (QueryPerformanceFrequency((LARGE_INTEGER*)&TimerFreq)) {
		TimerFreq /= 1000;
	} else {
		TimerFreq = 0;
	}
	
	
	uint64_t t1 = Timer();
	uint64_t checksum = 0;
	
	for (int i = 0; i < 100000000; i++) {
		checksum += log10(0);
		checksum += log10(1);
		checksum += log10(5);
		checksum += log10(10);
		checksum += log10(20);
		checksum += log10(50);
		checksum += log10(100);
		checksum += log10(1250);
		checksum += log10(5000);
		checksum += log10(100000);
		checksum += log10(932840234);
		checksum += log10(1209831902830ULL);
		checksum += log10(12093123898298ULL);
		checksum += log10(459485998398929ULL);
		checksum += log10(30424083249889324ULL);
		checksum += log10(3248204849028409023ULL);
		checksum += log10(18446744073709551615ULL);
	}
	
	
	t1 = Timer() - t1;
	cout << "Checksum: " << checksum << "\n";
	cout << "Done in " << t1 << " milliseconds\n";
		
	
	
}
The volatile declaration in the benchmark is necessary to stop it computing the results at compile time and finishing in zero milliseconds.

Change the #if 1 to switch between the two algorithms.

With the division algorithm it took about ~7300 milliseconds on my machine. (Division on the few really large numbers seemed to get exponentially slow, I guess because of whatever algorithm G++ had to use to approximate it for 32-bit.)

The division-free algorithm returned exactly the same checksum result, and a time of about ~500 milliseconds. It might not be faster on a system with proper 64-bit division though.
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

@Francis64: log(0) = 0 ? nice dood.
spiller wrote:The bottleneck in that algorithm isn't the branching; it's the division. Since each loop iteration has a conditional anyway, scrap the division and do out each level explicitly:

Code: Select all

uint8_t log10(uint64_t input)
{
  
  if (input == 0) return 255;
  if (input < 10) return 0;
  if (input < 100) return 1;
  if (input < 1000) return 2;
  if (input < 10000) return 3;
  if (input < 100000) return 4;
  if (input < 1000000) return 5;
  if (input < 10000000) return 6;
  if (input < 100000000) return 7;
  if (input < 1000000000) return 8;
  if (input < 10000000000ULL) return 9;
  if (input < 100000000000ULL) return 10;
  if (input < 1000000000000ULL) return 11;
  if (input < 10000000000000ULL) return 12;
  if (input < 100000000000000ULL) return 13;
  if (input < 1000000000000000ULL) return 14;
  if (input < 10000000000000000ULL) return 15;
  if (input < 100000000000000000ULL) return 16;
  if (input < 1000000000000000000ULL) return 17;
  if (input < 10000000000000000000ULL) return 18;
  return 19;
  
}
Which is exactly like the ? nested thing. That doesn't mean it's bad, btw.
The goal is tricky - badass code doesn't always mean fastest. It's kind of a code style that blows brains up.
Trading speed for entertainment value, some would say. >:3

Anyway, likelihood-ordered ? nesting, one-line div loop, bit O(log n) scan * constant + offset or just plain digit O(log n) scan, the badass gauge isn't going anywhere.
Conclusion: log sucks.

Thanks for the input everybody.
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
funkyass
"God"
Posts: 1128
Joined: Tue Jul 27, 2004 11:24 pm

Post by funkyass »

Code: Select all

x = 782588
num = 10
power = 0

while( x > num):
    num = (num << 1) +(num << 3)
    power += 1

print power
Does [Kevin] Smith masturbate with steel wool too?

- Yes, but don’t change the subject.
Francis64
Lurker
Posts: 160
Joined: Mon Jun 02, 2008 9:31 am
Location: UK

Post by Francis64 »

funkyass wrote:

Code: Select all

x = 782588
num = 10
power = 0

while( x > num):
    num = (num << 1) +(num << 3)
    power += 1

print power
Problem: does it work for x > 2^63 ?
funkyass
"God"
Posts: 1128
Joined: Tue Jul 27, 2004 11:24 pm

Post by funkyass »

Yes.

Mind you, that's in python.
Does [Kevin] Smith masturbate with steel wool too?

- Yes, but don’t change the subject.
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

bah

Code: Select all

template<unsigned pow> struct Pow10    { enum { R = Pow10<pow-1>::R * 10ULL }; };
template<>             struct Pow10<0> { enum { R = 1ULL                    }; };

unsigned log10(const uint_fast64_t in) {
#define POW10M1_4(n) Pow10<n>::R - 1, Pow10<n+1>::R - 1, Pow10<n+2>::R - 1, Pow10<n+3>::R - 1
#define POW10M1_8(n) POW10M1_4(n), POW10M1_4(n+4)
	static const uint_fast64_t cmpt[] = { POW10M1_8(0), POW10M1_8(8), POW10M1_4(16), ~0ULL, ~0ULL, ~0ULL, ~0ULL, ~0ULL };
#undef POW10M1_8
#undef POW10M1_4
	
	unsigned i = 16;
	unsigned out = 0;
	
	do {
		if (in > cmpt[out + i])
			out = out + i;
	} while (i >>= 1);
	
	return in ? out : 0xFF;
}
tukuyomi
Rookie
Posts: 39
Joined: Mon Aug 02, 2004 5:14 am
Contact:

Post by tukuyomi »

I don't know if it is possible, nor optimized but here is my idea:
why don't you convert your number to a string and count the number of characters - 1?
I know nothing in C, but in basic (sorry, I only know this language :p), it gives:

Code: Select all

PRINT LEN(STR$(your_value))-1
creaothceann
Seen it all
Posts: 2302
Joined: Mon Jan 03, 2005 5:04 pm
Location: Germany
Contact:

Post by creaothceann »

Converting to a string already includes the loop and the division from grin's first post, so it wouldn't give any speed benefits.
vSNES | Delphi 10 BPLs
bsnes launcher with recent files list
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

slightly moar evil / less suck

Code: Select all

template<unsigned pow> struct Pow10    { enum { R = Pow10<pow-1>::R * 10ULL }; };
template<>             struct Pow10<0> { enum { R = 1ULL                    }; };

unsigned log10(const uint_fast64_t in) {
#define POW10M1_4(n) Pow10<n>::R - 1, Pow10<n+1>::R - 1, Pow10<n+2>::R - 1, Pow10<n+3>::R - 1
#define POW10M1_8(n) POW10M1_4(n), POW10M1_4(n+4)
	static const uint_fast64_t cmpt[] = { POW10M1_8(0), POW10M1_8(8), POW10M1_4(16), ~0ULL, ~0ULL, ~0ULL, ~0ULL };
#undef POW10M1_8
#undef POW10M1_4
	
	unsigned i = 16;
	unsigned out = 0U - 1U;
	
	do {
		if (in > cmpt[out + i])
			out = out + i;
	} while (i >>= 1);
	
	return out;
}
Rashidi
Trooper
Posts: 515
Joined: Fri Aug 18, 2006 2:45 pm

Post by Rashidi »

so LUT is the speediest implementation?
sinamas
Gambatte Developer
Gambatte Developer
Posts: 157
Joined: Fri Oct 21, 2005 4:03 pm
Location: Norway

Post by sinamas »

If you want speed, you should at least unroll it.

Code: Select all

unsigned log10(const uint_fast64_t in) {
#define POW10M1_4(n) Pow10<n>::R - 1, Pow10<n+1>::R - 1, Pow10<n+2>::R - 1, Pow10<n+3>::R - 1
#define POW10M1_8(n) POW10M1_4(n), POW10M1_4(n+4)
	static const uint_fast64_t cmpt[] = { POW10M1_8(0), POW10M1_8(8), POW10M1_4(16) };
#undef POW10M1_8
#undef POW10M1_4
	
	unsigned out = in > Pow10<7>::R - 1 ? 7 : 0U - 1U;
	
	if (in > cmpt[out + 5]) out = out + 5;
	if (in > cmpt[out + 4]) out = out + 4;
	if (in > cmpt[out + 2]) out = out + 2;
	if (in > cmpt[out + 1]) out = out + 1;
	
	return out;
}
This ugly blob may be pretty fast too:

Code: Select all

unsigned log10(const uint_fast64_t in) {
	if (in < Pow10<10>::R) {
		if (in < Pow10<5>::R) {
			if (in < Pow10<3>::R) {
				if (in < Pow10<1>::R)
					return in < Pow10<0>::R ? -1 : 0;
				
				return in < Pow10<2>::R ? 1 : 2;
			}
			
			return in < Pow10<4>::R ? 3 : 4;
		}
		
		if (in < Pow10<8>::R)
			return in < Pow10<7>::R ? (in < Pow10<6>::R ? 5 : 6) : 7;
		
		return in < Pow10<9>::R ? 8 : 9;
	}
	
	if (in < Pow10<15>::R) {
		if (in < Pow10<13>::R)
			return in < Pow10<12>::R ? (in < Pow10<11>::R ? 10 : 11) : 12;
		
		return in < Pow10<14>::R ? 13 : 14;
	}
	
	if (in < Pow10<18>::R)
		return in < Pow10<17>::R ? (in < Pow10<16>::R ? 15 : 16) : 17;
	
	return in < Pow10<19>::R ? 18 : 19;
}
Which solution is faster in the average case depends on the probability distribution of the numbers. These mainly aim for decent worst case performance.
AamirM
Regen Developer
Regen Developer
Posts: 533
Joined: Sun Feb 17, 2008 8:01 am
Contact:

Post by AamirM »

I guess I am a bit late :P . You want bad-ass, eh? Try the following, 64-bit only:

Code: Select all

uint8_t log10(uint64_t input)
{
	uint8_t out = 255;

	for (; input; out++)
	{
		input = (input * 0xCCCCCCCCCCCCCCCD) >> 67;
	}

	return (out);
}
For 32-bit (means max answer should be 9):

Code: Select all

uint8_t log10(uint64_t input)
{
	uint8_t out = 255;

	for (; input; out++)
	{
		input = (input * 0xCCCCCCCD) >> 35;
	}

	return (out);
}
Can someone compile and test the 64-bit one please? (I haven't checked that)

EDIT:
Ok, this should be fastest. I am 99% sure this can be extended to 64-bit using RAX, RCX etc..

Using MSVC calling convention, first param is in ecx. (Yes, I use MSVC :P)

Code: Select all

unsigned int __fastcall log10(unsigned int number)
{
	_asm
	{
		bsr ecx, ecx
		mov eax, 1233
		add ecx, 1
		mul ecx
		shr eax, 12
	}
}
Return is in eax (again using MSVC convention). People with brains will get how to adapt to other conventions. :P

Bad ass now grinvader? :D
Exophase
Hazed
Posts: 77
Joined: Mon Apr 28, 2008 10:54 pm

Post by Exophase »

AamirM wrote:Using MSVC calling convention, first param is in ecx. (Yes, I use MSVC :P)

Code: Select all

unsigned int __fastcall log10(unsigned int number)
{
	_asm
	{
		bsr ecx, ecx
		mov eax, 1233
		add ecx, 1
		mul ecx
		shr eax, 12
	}
}
Return is in eax (again using MSVC convention). People with brains will get how to adapt to other conventions. :P

Bad ass now grinvader? :D
Congratulations, it took this long to rewrite verbatim PART of the suggestion given in the very first reply. It seems to me to be the best approach in this thread, unfortunately you missed half of it in your code (meaning you wouldn't get a precise answer).

grinvader; You were incorrect to dismiss this function just because you saw IntegerLogBase2 then assumed that it must use a multiply/LUT just because that was the only implementation you read. If you would have looked further above you would have seen several other implementations, including the "plain bit search" you mentioned, which they called the obvious way. Of course, they wouldn't have given the multiply/LUT method you dismissed as "retarded" if it weren't faster in any situations.

Nonetheless, many platforms implement counting bits in hardware (either FPU and/or integer instruction) and by these days it's usually a constant time operation so of course you can put for IntegerLog2 something very fast.

By the way, GCC also has a builtin for this operation. Check out _builtin_clz - now you won't have to rely on x86 ASM. Use it instead of IntegerLogBase2 in the example given.

Also, you usually shouldn't use sub-int types for locals (and function parameters, and return values) unless you specifically need for the value to not exceed sub-int range. Actual end-result will depend on platform but it'll often be slower.

Finally, we can't really tell what kind of performance characteristics the algorithm will have when we don't know what kind of input you'll be giving it. If you're not mainly using large numbers then the if chain could win, although there are probably combinations/compromises you can use between the constant time and log10(N) time algorithms.
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Post by grinvader »

Exophase wrote:Congratulations, it took this long to rewrite verbatim PART of the suggestion given in the very first reply. It seems to me to be the best approach in this thread, unfortunately you missed half of it in your code (meaning you wouldn't get a precise answer).
We kinda covered that in IRC.
grinvader; You were incorrect (...)
This code isn't satisfying. It didn't fill my (very demanding, I admit) request, hence I was correct in dismissing it. 'Incorrect' is not the right qualifier for this situation. You want 'hard to please' or similar.
Finally, we can't really tell what kind of performance characteristics the algorithm will have when we don't know what kind of input you'll be giving it. If you're not mainly using large numbers then the if chain could win, although there are probably combinations/compromises you can use between the constant time and log10(N) time algorithms.
'Any'. As in, any repartition. From 0 to max, from 0 to less than max, from almost max to max, from A to B that are neither 0 nor max and far apart, close apart, with normal distribution, with average distribution, whatever.
Which is why there was no such indication to begin with.

You're welcome to spend more time on this, but I'm not exactly expecting anything.
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
AamirM
Regen Developer
Regen Developer
Posts: 533
Joined: Sun Feb 17, 2008 8:01 am
Contact:

Post by AamirM »

We kinda covered that in IRC.
Yep, we (me and grinvader) were looking for some other (read: badass :P) way to "fix" it (it works in most cases). Unfortunately, I couldn't find any way so I left it where it was.
Exophase
Hazed
Posts: 77
Joined: Mon Apr 28, 2008 10:54 pm

Post by Exophase »

I guess in that case you can make it very slightly more badass with very slightly better x86 ASM:

Code: Select all

unsigned int __fastcall log10(unsigned int number)
{
   _asm
   {
      bsr ecx, ecx
      inc ecx
      imul eax, ecx, 1233
      shr eax, 12
   }
}
grinvader
ZSNES Shake Shake Prinny
Posts: 5632
Joined: Wed Jul 28, 2004 4:15 pm
Location: PAL50, dood !

Re: grin's various stupids of doom [brains needed]

Post by grinvader »

heya doods, i come to you with my latest random bullshit from hell


here's a little logic block that obviously demands a bitchslap

Code: Select all

      t3 | false | true
---------+-------+-------
t1 > t2  |   A   |   A
t1 < t2  |       | B,C,D
t1 == t2 |  B,A  | B,C,E
t1 and t2 are uint64, t3 is a bool.
A B C D and E are independent code statements unrelated to each other (i.e. no using one to have the other + 1 or anything silly like that).
In the same vein, t3 is unrelated to both t1 and t2, and the empty spot in the table [if t1 < t2 then there is no possible way for t3 to be false] only stems from maths being what they are.


Are you a bad enough dood to dropkick the fool ?


Edit: fixed the last entry, stupidly forgot that one

Deadit: lawl never mind doods, i pwned it with sheer awesome
皆黙って俺について来い!!

Code: Select all

<jmr> bsnes has the most accurate wiki page but it takes forever to load (or something)
Pantheon: Gideon Zhi | CaitSith2 | Nach | kode54
Post Reply