[Vm-dev] String hash function

Discussion:

Clément Bera

2017-04-13 21:19:07 UTC

tim Rowledge

2017-04-13 21:49:04 UTC

This means that if a hash of a 1MB String is computed it takes a hell of a time.
I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash.

Yay! We haven’t had a good hash function argument in *ages*!

tim
--
tim Rowledge; ***@rowledge.org; http://www.rowledge.org/tim
Klingon Code Warrior:- 5) "Specs are for the weak and tim

Levente Uzonyi

2017-04-13 22:28:56 UTC

Permalink

Ken.Dickey

2017-04-14 16:09:15 UTC

Permalink

So the number of different hash values went down from the optimal 10000 to

100 - aka 1%. What's even worse is that the problem exists if you take any
small consecutive range, only the ratio changes.

The speed tradeoff seems most acute for large chunks of text.

If the original strategy were used for strings of size less than (say) 50 and the "sampling" strategy were used for longer strings, with the string length included in the hash, then a large chunk of text with one character added would likely not hash collide.

A large string scale app could tune the hash function as has been suggested if performance were poor and tuning would be done by someone likely to be oriented to the problem, while performance would be good i

Max Leske

2017-04-14 04:05:50 UTC

Permalink

Nicolas Cellier

2017-04-14 06:03:31 UTC

Permalink

Nicolas Cellier

2017-04-14 06:04:47 UTC

Permalink

Ralph Boland

2017-04-14 06:37:14 UTC

Permalink

I had an application where I needed to store arrays into sets
and performance depended heavily on this operation.
Standard array classes were unacceptable because hashing was either
too poor (then) or too expensive (now).

So I created my own array class (say HashArray, I forget the name now).
HashArray stored a hash value rather than compute one.
The hash value of an array was the XOR of its elements (which in my
application have good hash values).
This worked very well but came at a price: each time an element was
added or removed or changed
for a HashArray instance the hash value had to be recomputed (O(1) cost).
For my application this wasn't a problem so my solution worked well.
Notes:
1) In my case two arrays containing the same elements should have
the same hash value.
Your situation might be different. If so you could also add the
the hash value some value
associated with the array itself.
2) While contained in a set the array couldn't be changed (didn't
happen in my application) but this
is true for most objects anyway since its has value cannot change;
just something to be aware of.
3) In my application the elements of the array also stored their
hash values. The has values were generated using
a random number generator. This could be done in my application
but obviously can't always be done.
This gave very good hashing for my arrays.

If you are having performance problems with array classes because of
hashing perhaps you should try a scheme
similar to mine. Whether or not such a class should be part of
Squeak, I don't know. Perhaps there should be
a package of HashArray like clas

Martin McClure

2017-04-14 19:05:29 UTC

Permalink

I wonder why we don't compute the string hash from a small number of
inner characters taken randomly but deterministically from the String.
From example, for all strings longer than 5 characters, one could pick
only 5 characters in the String to compute its hash. The hash function
is this way decent enough and the performance does not drop for large
strings.
1 to: stringSize do: [:pos |
1 to: stringSize by: stringSize // 5 do: [:pos |
ByteString>>stringHash: aString initialHash: speciesHash
Then we need to rehash the whole image.
What do you think ? Is there a hash expert around that can confirm this
is a good idea ?

This kind of thing has been tried before, and failed miserably. One of
the biggest breaking incompatibilities, IIRC, between Java 1 and Java 2
was that they had to change their string hash function to hash all
characters, because their scheme of hashing only a subset of characters
had a huge number of collisions in real-world string like URLs, which
have a lot of characters in common.

I think it's worth the time to hash all charact

Andres Valloud

2017-04-14 19:34:39 UTC

Permalink

IME, being smart about which characters to hash quickly runs into data
sets being smarter about hiding variation in the unsampled characters.

Slower hash functions aren't necessarily bad. Practical experience:

1. Investment bank end-of-day report takes 30 minutes to run. The time
is spent mostly in linear search because of hash collisions. Introduced
a 5x slower hash function that had a collision rate of about 2.5 instead
of 20+. The report now takes 90 seconds, and hashing is nowhere to be
seen in the profiler output. No need to optimize hashing further.

2. VisualWorks string / symbol hashes were also "smart". Replaced with
a multiplicative hash along the lines of the Horner rule, which is
slower for large enough strings. But interning all symbols into a new
symbol table now takes 7x less time, and one of the k-nucleotide-related
Computer Language Shootout benchmarks now runs in half the time.

Regarding huge strings, surely a 1mb string has a special meaning that
could be encapsulated within an object, and it is *this* object that
could provide a hash function suitable for the purpose at hand. In
fact, especially in those cases, the returned hash value might not even
need to be based on the 1mb string at all.

Hi all,
I am trying to investigate performance overhead in benchmarks to improve
the VM performance. In a benchmark strings are used as dictionary keys,
which seems to be an OK pattern to me that may be actually used in real
code from time to time. Something stroke me: _the hash function of a
string is actually iterating over the whole string. _
#(
'String size: 0, Hash function performance: 17,817,776 per second'
'String size: 1, Hash function performance: 15,395,388 per second'
'String size: 3, Hash function performance: 14,853,109 per second'
'String size: 4, Hash function performance: 15,151,954 per second'
'String size: 5, Hash function performance: 14,139,881 per second'
'String size: 10, Hash function performance: 11,545,749 per second'
'String size: 65, Hash function performance: 3,431,067 per second'
'String size: 130, Hash function performance: 1,879,047 per second'
'String size: 520, Hash function performance: 511,934 per second'
)
This means that if a hash of a 1MB String is computed it takes a hell of
a time.
I wonder why we don't compute the string hash from a small number of
inner characters taken randomly but deterministically from the String.
From example, for all strings longer than 5 characters, one could pick
only 5 characters in the String to compute its hash. The hash function
is this way decent enough and the performance does not drop for large
strings.
1 to: stringSize do: [:pos |
1 to: stringSize by: stringSize // 5 do: [:pos |
ByteString>>stringHash: aString initialHash: speciesHash
Then we need to rehash the whole image.
What do you think ? Is there a hash expert around that can confirm this
is a go

Ben Coman

2017-04-14 23:12:05 UTC

Permalink

Andres Valloud

2017-04-14 23:38:04 UTC

Permalink

If it were up to me, I'd leave that decision to the applications. No
need to optimize unknown future problems. Also keep in mind
applications likely should refine strings by composition and delegation
rather than by inheritance in this particular case. Otherwise,

String
Base64EncodedMP3File
PostgresqlDataBlob
SmalltalkMethodSourceCode

... and it stops making sense quickly.

What are opinions on having a HashedString subclassed from String to
optimise calculating hash only once?
a. In the standard Image or leave it specific to applications?
b. Obviously an ivar to store the hash computed at instantiation.
c. Maybe have an ivar storing a custom hash-block? Equalit

tim Rowledge

2017-04-14 23:54:37 UTC

Permalink

If it were up to me, I'd leave that decision to the applications. No need to optimize unknown future problems.

Exactly; remember the rules for optimising -

1) Don’t.
2) (for experts only) Don’t, *yet*.

tim
--
tim Rowledge; ***@rowledge.org; http://www.rowledge.org/tim
Useful Latin Phrases:- O! Plus! Perge! Aio! Hui! Hem! = Oh! More! Go on! Yes! Ooh!