It's interesting to me that he refers to association arrays as "newish", when th...

jerf · on July 17, 2024

Two things to keep in mind: Pre-internet, it was quite easy to be skilled and knowledgeable and yet never have been exposed to something another community somewhere would consider basic, or for it to take many many years to percolate across. I remember getting into programming even in the 1990s and I can tell you a lot of the C world would still consider associative data structures like a hash table something to be a bit exotic and only to be pulled out if you had no other choice. It was very easy to not be very up to date on these things.

Second, while the programming world fancies itself a very fast mover, it really isn't, and it especially wasn't back then. 20 years isn't necessarily as long as you would think in 2024. We're still trying to convince people in 2024 that you really don't need a language that will happily index outside of arrays, and how long has that been a staple in existing languages? (You may think that's silly, but we've had that fight on HN, yes, yea verily, this very year. Granted, this is winding down, but it's still a live argument.)

There's a lot of things that showed up in some language in the 1960s or 1970s and wasn't widely adopted for another few decades.

(Though I suspect one problem with Lisp is that back in the day, its performance was pretty awful relative to the other things of the day. It was not always the best advertisement for its features. A bit too ahead of its time sometimes.)

vram22 · on July 17, 2024

>I remember getting into programming even in the 1990s and I can tell you a lot of the C world would still consider associative data structures like a hash table something to be a bit exotic and only to be pulled out if you had no other choice. It was very easy to not be very up to date on these things.

That's a little surprising, considering that the classic first book about C, K&R was first published in 1978 and again in 1988, the first and second edition, respectively.

One of the editions, maybe the first, had a simple hash table implementation, that just summed up the ASCII values of the characters in the given string, modulo some value, and used that as the key for the string being hashed.

.

https://en.m.wikipedia.org/wiki/The_C_Programming_Language

xarope · on July 18, 2024

I remember being taught about associative caches in CPU architecture, and later on programming the same in C. So I'm not sure about the 90's and hash tables being "exotic".

My first real use of hashes was in perl4; with perl5 released in 1996 (https://dev.perl.org/perl5/news/), surely it meant people were also well aware of the usefulness of hash arrays even back then.

cancerhacker · on July 18, 2024

It was a common practical interview question in 1996 to ask a candidate to implement a hash table in C.

satiated_grue · on July 24, 2024

I recall implementing hash tables in COBOL on a Univac. Not even kidding a little.

kragen · on July 18, 2024

the picture you're painting here is absurdly far from reality

— kernighan definitely did not think hash tables were 'a bit exotic' in 01978 —

hash tables (or sometimes other associative retrieval structures) were standard fare in computer science degree programs and were a standard feature of compilers, assemblers, and linkers, of which kernighan had written several at that point; associative retrieval structures are also fundamental to databases and filesystems. in 01979, 7th edition unix (in which kernighan published awk) barely had date formatting but included an on-disk hash table library called 'dbm'

hash tables are the exclusive topic of pp. 506–549 of knuth's taocp volume 3, a standard textbook which came out in 01973. after recommending you some hash functions, on p. 512 (525/739) knuth recommends you check out a survey of hash functions published in cacm by lum, yuen, and dodd in 01971, and he cites papers on hashing in cacm as early as 01959 and in an ibm journal as early as 01957. in the 'history' section (pp. 540–542 (553–555/739)) he traces it back to an internal ibm report by luhn in 01953, and its first report in the open literature to a paper by dumey in 01956. most of the rest of the 739-page book is about how to implement associative arrays of various kinds, although some parts instead concern the string search problem, and the first half is about sorting, and a sorted table can be used for purposes other than associative retrieval

knuth also mentions a 'very influential survey' of hashing published in cacm in 01968 by robert morris, kernighan's coworker at bell labs, and one of the very first users of unix and a major contributor to it

but perhaps kernighan hadn't read cacm? on the contrary, he and his colleagues regularly published in it; he published in cacm once in 01972, once in 01973, and once in 01975, and his colleagues such as stephen johnson, mike lesk, al aho (his coauthor on awk), and, as i said above, robert morris, published there regularly as well

in the practice of programming, kernighan and pike claim that almost all of the time, the only data structures you need are arrays, linked lists, trees, and hash tables (though they wrote this in 01999)

what kernighan is talking about here is including them as a basic language construct, a design choice his 01978 tech report on awk likens to that of snobol4, which was a well-known language but not a mainstream one. in the context of the above cacm articles, you can perhaps see that the daring choice was to offer a single 'good enough' hash table implementation rather than one optimized for a given purpose—and to use it even in place of conventional arrays

— kernighan was not isolated like you were —

kernighan was at bell labs, which was probably the most connected place in the software world at the time

when awk was released in 01979 the internet was already ten years old, and at&t (the parent company of bell labs) had installed all its long-distance links

in 01976 they'd also developed their own 'internet' called uucp, which at the time of awk's release was interconnecting 80 different unix sites, interchanging software updates on a daily basis, over 300-baud and 1200-baud modems. they were also in contact with ken thompson's alma mater berkeley; thompson took a sabbatical year to teach there in 01975. berkeley got a darpa contract to add tcp/ip support to the bell labs operating system 'unix', which would become the majority of the internet during the 01980s; berkeley participated actively in the uucp network

bell labs was also working on the so-called 'advanced communication system', its own misconceived version of the internet, where data would be routed using phone numbers, because data communications had become the fastest-growing source of call traffic. (colin berkshire has written a fantastic overview of this forgotten history.)

jerf · on July 18, 2024

"Your comments are ridiculous! One of the smartest, most connected people in the world was smart and connected! And papers were being published all over the place!"

The computer world of the time was not clones of Kernighan a thousand times over, nor was everyone reading every single paper any more than people do today. You can't prove the normal world of computing was all aware of all these things by pointing at the very pinnacle of it. I'm talking about the real world, not the best of the best of the best.

abecedarius · on July 18, 2024

But, like, as a teen in the 80s by the time I learned C from Kernighan & Ritchie (not an exotic path back then), where a hashmap was one of the examples and the subject of a couple of the problems, it wasn't remotely new to me. The real world had plenty of others learning these things from Byte magazine and such even with no CS degree.

brianmurphy · on July 21, 2024

The big difference was in C you had to implement it yourself. Choosing a hash function and managing memory with malloc() was a hang up for a lot of people. It was quite common to say hashes and the various tree data structures weren't worth the complication if you didn't absolutely need to have a hash table or tree.

In later languages like perl, the syntax for hashes was built into the language.

That was a huge step forward.

Perl was slowly adopted for other reasons (illegible write once syntax). Other popular scripting languages like PHP adopted hashes from perl. This greatly helped make associative arrays common knowledge among the majority of programmers.

kragen · on July 19, 2024

also dr. dobbs. https://archive.org/details/dr_dobbs_journal_vol_03/page/n29... is a presentation of 'sam76', an 8080 reimplementation of mcilroy & morris's m6, probably built around an efficient associative array of defined macros. and assemblers built around symbol tables were commonplace at the time, too

kragen · on July 19, 2024

in the interview, kernighan said, 'The main idea in Awk was associative arrays, which were newish at the time, but which now show up in most languages either as library functions (hashmaps in Java or C++) or directly in the language (dictionaries in Perl and Python). Associative arrays are a very powerful construct, and can be used to simulate lots of other data structures.'

zambyte expressed surprise that kernighan referred to 'associative arrays' as 'newish' and said that lisp had had them 20 years before. what lisp had 20 years before (presumably they meant in https://www-formal.stanford.edu/jmc/recursive.pdf in 01959, not 01957, when mccarthy was trying to figure out how to add linked lists to fortran) was this line of code:

    assoc[x; y] = eq[caar[y]; x] → cadar[y]; T → assoc[x; cdr[y]]

which we can translate to python as

    assoc = lambda x, y: y[0][1] if y[0][0] == x else assoc(x, y[1:])

and which allows you to use a list of 2-tuples such as [('red', '#f00'), ('green', '#0f0'), ('white', '#fff')] as an associative array

you replied, saying, 'it was (...) easy to be (...) knowledgeable and yet never have been exposed (...) in the 1990s (...) I can tell you a lot of the C world would still consider associative data structures like a hash table something to be a bit exotic (...). It was very easy to not be very up to date on these things.'

that comment is only relevant to zambyte's comment—to which you posted it as a reply—if you had the implicit assumption that what kernighan meant by 'associative arrays' was 'associative data structures like a hash table'. but associative data structures like hash tables were not newish; hash tables in particular were 22 years old in 01979, and were already the most commonplace of all data structures except for strings and arrays. i listed 13 articles in cacm in 01975 and 01976 about associative data structures. cacm only published about 6 articles per issue (excluding the news and opinion departments), so that's roughly 10% of everything they published

you say 'you can't prove the normal world of computing was all aware of all these things', but the everyday world of computing was, if anything, even more preoccupied with associative data structures than the academic world. ibm's most lucrative product was ims, an online database, which is to say, a large on-disk associative data structure. filesystems (including dectapes), variable namespaces in interpreters and compilers, symbol tables (in assemblers, linkers, and compilers), and databases are associative arrays. even in 01979 it was already hard to find programmers who did not use any of filesystems, assemblers, interpreters, or databases (though they did exist)

so it is completely implausible that what kernighan was describing as 'newish' (in 01979) was associative data structures like a hash table. i'm not doubting you when you say you knew c programmers in the 90s who considered them 'exotic', perhaps the province of the wizards who wrote compilers and filesystems, but those c programmers were not 'the normal world of computing' or 'experienced and knowledgeable'; they were beginners

the 01978 edition of k&r presents binary search (using a sorted array as an efficient associative array) in chapter 3 on p. 54 (62/236), before presenting such advanced language features as the 'break' statement, '#define', structs, non-integer function return values, and pointers. in chapter 6 on p. 125 (133/235), they use such an associative array to count the number of times each c keyword occurs in the input. §6.5 (pp. 130–134 (138–142/235)) demonstrates a more scalable associative array using a binary search tree. §6.6 (pp. 134–136 (142–144/235)) gives another program using a hash table

so for k&r in 01978, hash tables were more basic than features such as unions, typedef, string formatting (%.10s, etc.), opening files, and listing directories, all of which are relegated to later in the book

please don't tell me 'a lot of the c world' in the 01990s hadn't heard of k&r. next you're going to be telling me a lot of the c world hadn't heard of c

getting back to the interview, what kernighan was evidently referring to was not associative array data structures in general, but having a general-purpose associative data structure library always at hand. lisp's assoc (or assq) isn't general-purpose because it's too slow for large associative data structures. lisp plists aren't general-purpose because they leak memory (you can't remove a property from all atoms, and each atom you put a property on continues to occupy memory even if you remove the property from it) and because they get too slow if you have more than a few properties. ims isn't general-purpose because you have to go talk to the database administrator to create a new file; you can't create a temporary one in ram. the example hash table in k&r isn't general-purpose because it uses a static array, a fixed size of 100 hash buckets, has no deletion, associates specifically a string value with each (string) key, and is quite inefficient; it's a reasonable starting point, but you have to tweak it for your application

java hashmaps, awk arrays, and python dicts are general-purpose as long as your data fits in ram. (you could gripe that awk arrays aren't first-class objects and can only have string and numeric keys, but that's just because awk doesn't have first-class objects at all in the lisp sense.)

precisely as a result of this, hashing (like sorting) has ceased to be a beginner-programmer topic; if you need an associative array or a sorted list, you can get quite far just by invoking the built-in facilities of your programming language. sooner or later you will probably want one optimized for some specific application, but it's no longer something you have to do all the time. but to any working programmer in the 01970s, it would have seemed absurd to propose linear searches or hash tables were 'newish'

Zambyte · on July 17, 2024

I totally get the first point, but this comment from Brian was made when the internet had been available for many, many years. It seems more like it felt newish at the time, and that is how he has seen it since then. I guess that's more why I find it to be an interesting comment.

Regarding the second point, when I hear the term "associative arrays" or "associative lists" I think of them in a pretty high level of abstraction. Like really: a C struct is an association abstraction. A simple array of structs is not often considered an "associative array", but it can easily be used for the same things.

Maybe it would be more accurate to say that the way they were using associative arrays way newish, rather than the construct itself.

chasil · on July 17, 2024

This is from the first edition free PDF online of The AWK Programming Language:

"Associative arrays were inspired by SNOBOL4 tables, although they are not as general. Awk was born on a slow machine with a small memory, and the properties of arrays were a result of that environment. Restricting subscripts to be strings is one manifestation, as is the restriction to a single dimension (even with syntactic sugar). A more general implementation would allow multi-dimensional arrays, or at least allow arrays to be array elements."

Stable release - SNOBOL4 / 1967

https://en.wikipedia.org/wiki/SNOBOL

shawn_w · on July 17, 2024

Common Lisp, of course, has built in hash tables. MacLisp doesn't seem to - I didn't see anything about them in the Pitmanual. ZetaLisp/LML and InterLisp have them. So it seems like built in lisp associative arrays (I'm not counting alists, or property lists, though the latter goes back to the early days and maybe should be) are roughly contemporary to the birth of awk. Maybe a bit newer.

kragen · on July 17, 2024

in a sense lisp alists are 'associative arrays', sure. but, although they were used in the metacircular interpreter that bootstrapped lisp and still inspires us today, they aren't a language feature, and the lisp ones aren't efficient enough for the things you use them for in awk

you can of course build an associative array in any language, especially if you're satisfied with the linear search used by lisp alists. you can hack one together in a few minutes almost without thinking:

            .intel_syntax noprefix
    lookup: push rbx
            push rbp
            mov rbx, rdi    # pointer to alist node pointer
            mov rbp, rsi    # key
        2:  mov rdi, [rbx]  # load pointer to alist node
            test rdi, rdi   # is it null?
            jnz 3f          # if not null, skip not-found case
            mov rax, rbx    # pointer to null pointer is return value
            xor rdx, rdx    # but associated value is null (sets zero flag)
            jmp 4f          # jump to shared epilogue
        3:  mov rdi, [rdi]  # load pointer from key field of alist node
            mov rsi, rbp    # reload key from callee-saved register
            call comkey     # sets zero flag if rsi and rdi are equal keys
            jne 1f          # skip following return-value case code if !=
            mov rax, rbx    # pointer to node pointer is return value
            mov rdx, [rbx]  # also let’s follow that pointer to the node
            mov rdx, [rdx + 16]    # and return its value field too in rdx
            test rax, rax   # clear zero flag (valid pointer is not null)
        4:  pop rbp
            pop rbx
            ret
        1:  mov rbx, [rbx]  # load pointer to alist node again
            add rbx, 8      # load pointer to next-node pointer field
            jmp 2b

(untested, let me know if you find bugs)

but that's very different from having them built in as a language feature, like snobol4, mumps, awk, perl, python, tcl, lua, and js do. the built-in language feature lisp had for structuring data 20 years earlier was cons, car, and cdr, which is not at all the same thing

other programs on unix that contained associative arrays prior to awk included the linker (for symbols), the c compiler, the assembler, the kernel (in the form of directories in the filesystem), and the shell, for shell variables. none of these had associative arrays as a programming language feature, though. the bourne shell came closest in that you could concatenate variable names and say things like

     eval myarray_$key=$val

here's the implementation of binary tree traversal for setting shell variables in the v7 bourne shell from 01979

    NAMPTR          lookup(nam)
            REG STRING      nam;
    {
            REG NAMPTR      nscan=namep;
            REG NAMPTR      *prev;
            INT             LR;

            IF !chkid(nam)
            THEN    failed(nam,notid);
            FI
            WHILE nscan
            DO      IF (LR=cf(nam,nscan->namid))==0
                    THEN    return(nscan);
                    ELIF LR<0
                    THEN    prev = &(nscan->namlft);
                    ELSE    prev = &(nscan->namrgt);
                    FI
                    nscan = *prev;
            OD

            /* add name node */
            nscan=alloc(sizeof *nscan);
            nscan->namlft=nscan->namrgt=NIL;
            nscan->namid=make(nam);
            nscan->namval=0; nscan->namflg=N_DEFAULT; nscan->namenv=0;
            return(*prev = nscan);
    }

if you're not sure, that's c (now you know where the ioccc came from)

foobarian · on July 17, 2024

I love that Bourne abused C macros to make sh source look like Algol. Lots of references to this elsewhere but this is my favorite:

https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/sh...

kazinator · on July 18, 2024

The "killer feature" of alists is that they are persistent data structures; when we extend a shorter alist to make a longer one, the shorter one is unaffected. Moreover, if the program has only a reference to the tail portion of a longer alist, the head portion can be garbage-collected.

With this, we can correctly implement lexical closures: we simply take the current environment (head of the alist) ssociate it with the lambda body and arguments, and we are done.

kragen · on July 18, 2024

yes! also when you only have two or three things in them, they are faster than hashing, especially on machines without a data cache. clojure uses hamts for this, thus getting this fp-persistence† as well as efficient lookup in large dictionaries, and chris wellons posted a concise, fast c hamt implementation last year at https://nullprogram.com/blog/2023/09/30/

______

† 'persistent' unfortunately also means 'stored in nonvolatile memory', so i feel the need to disambiguate

chuckadams · on July 18, 2024

I’d say plists have a stronger claim on being “real” associative arrays. Lisp never made much distinction between “built in” and “library” to begin with, you're using the same syntax to use the feature either way.

kragen · on July 19, 2024

i think it's true that, much of the time that you'd use a dict in python, you'd use a property in 01960s lisp, because it let you take advantage of the read-time implicit hashing of symbols; you probably only had one or two properties on most of your symbols, so getting from the interned symbol to the property was a matter of a handful of instructions---probably faster than a hash function. but conceptually i feel like they're pretty different ideas