Hacker Newsnew | past | comments | ask | show | jobs | submit | annowiki's commentslogin

How do you get around 403/401's from WSJ/Reuters/Axios? Because I've tried user agent manipulation and it seems like I'd have to use selenium and headless to deal with them.


Sometimes you also need "Accept: html" I have noticed.


If curl-impersonate works, it's probably TLS fingerprinting.


> Another area is with software we’ve had to build (instead of buy). When we started out, we strongly preferred buying software over building it because a team of only a few engineers can’t afford the time cost of building everything. That was the right choice at the time even though the “buy” option generally gives you tools that don’t work. In cases where vendors can’t be convinced to fix showstopping bugs that are critical blockers for us, it does make sense to build more of our own tools and maintain in-house expertise in more areas, in contradiction to the standard advice that a company should only choose to “build” in its core competency. Much of that complexity is complexity that we don’t want to take on, but in some product categories, even after fairly extensive research we haven’t found any vendor that seems likely to provide a product that works for us. To be fair to our vendors, the problem they’d need to solve to deliver a working solution to us is much more complex than the problem we need to solve since our vendors are taking on the complexity of solving a problem for every customer, whereas we only need to solve the problem for one customer, ourselves.

This is more and more my philosophy. I've been working on a data science project with headline scraping (I want to do topic modeling on headlines during the course of the election) and kept preferring roll your own solutions to off the shelf ones.

For instance, instead of using flask (as I did in a previous iteration of this project a few years ago) I went with Jinja2 and rolled my own static site generator. For scraping I used scrapy on my last project, on this one I wrote my own queue and scraper class. It works fantastically.


Actually it seems pretty accurate. Novelty-seeking is a well known phenomenon in curious individuals. https://en.wikipedia.org/wiki/Novelty_seeking

Literally getting dopamine rewards for seeing something new is what keeps people glued to tik tok feeds and twitter.

I tend to get bored halfway through a book if it is predictable.


This is not really "sieve-ing" per the article, but what prevents me from running another process that periodically queries the data in a cache? Like just running a celery queue in Python that continually checks the cache for out of date information constantly updating it? Is there a word for this? Is this a common technique?


I think this is not as simple, because to achieve good metrics (latency, cache hit) you will need to be predicting the actual incoming query load, which is quite hard. Letting the query load itself set the values is the state of the art.

In some ways, caching can be seen a prediction problem. And the cache hit rate is the error as we lag the previous history at time T. Blending load over time is effectively what these various cache algorithms do to avoid overfitting.


If you have an idea of what you need to cache or can fit everything into the cache it's extremely effective.

Tho potentially just refreshing out of date data in the cache could increase effectiveness given that general assumption of the cache is whats in the cache will probably be used again.

I called it a periodically refreshing cache when I wrote one. Not sure if there is a more formal name.


You might call that prefetching. That's what unbound calls it when it returns a near expired cached entry to a client and also contacts upstream to update its cache. I remember having a similar option in squid, but it might have been only in an employer's branch (there were a lot of nice extensions that unfortunately didn't make it upstream)


You're describing cache maintenance (and cache eviction), a practice for which there are many algorithms (FIFO, LRU, LFU, etc.), including the algorithm the article describes (SIEVE)


I think this is orthogonal to cache maintenance and cache eviction. Instead this is having a background process periodically refreshing the data in the cache to keep it hot.


Refreshing the cache to keep it hot, and deciding how to do it, e.g. which parts do we do with the caching layer directly, which parts do we do with an external process, what to evict to make room, etc, are subtopics of cache management.

If I understand you correctly, you're asking if this is different because an external process is involved. I don't see a use in drawing a distinction, and as far as I know, there's no special term for that pattern.

Update: after looking into it, it looks like this cache/architecture pattern is called "refresh-ahead"


There's a piece of graffiti from the May 1968 Paris riots that goes "I have something to say but I don't know what." Reminds me of this. I can't find the original French but I think it's something like "J'ai quelque chose a dire mais je ne sais quoi."

Found it: https://www.dicocitations.com/citations/citation-67489.php

They use the pas


This is a great historical slogan, thanks for bringing it up.

There is a big difference between "... je ne sais quoi" and "... je ne sais pas quoi".

"... je ne sais pas quoi" means "I don't know what". The slogan is indeed "I have something to say but I don't know what".

"je ne sais quoi" is a weird French construction which is actually a name. It takes a masculin determinant and means "a little something".

Il manque à ce plat un petit je ne sais quoi = this meal is missing a little something


Ah, nice! I'll compile a page of related projects on the website (with all the interesting links from this discussion), and that will go on it, too, of course! Thanks!


I started as a python programmer and was very used to package managers. I believed in them, I championed them. When I switched to C++ for work I was very disheartened that there wasn't a standard.

Conan obviously has promise, I haven't spent much time with it, most of my experience with C++ package managers is with nuget and vcpkg. However, my attitude toward package managers is changing.

I increasingly like _not_ using package managers because it makes me (and my company) way way way less likely to bloat our software with unnecessary third party dependencies.

I wrote this in another thread: I never believed you should write something yourself if you can find a package for it. My boss told me I should write it all myself, I could probably write it to be faster. I encountered a case where I needed to compare version numbers in python. For the heck of it I wrote the simplest, quickest, most naive solution I could come up with and then timed it against the most recommended version comparison package in python. I blew it away by 20x throughput.

I don't believe in package managers anymore. Obviously I'll keep using pip and sqlalchemy in Python, but I'll happily spend the 20-30 minutes it takes adding something like nlohmann-json or md4c to my project over worrying about maintaining a package manager for c++ these days. Precisely because it makes me think twice about adding another dependency.


Sometimes you have dependencies with actual value add that you really don't want to replicate. No, I'm totally not writing a yaml parser, thank you very much. I can probably write a good yaml parser, possibly even better than some 3rd party stuff, but yaml parsing is simply not our business.

And yaml parsing is probably on the simpler side of things. We need to run torch models, we do need libtorch. We are not rewriting libtorch, that would be silly.


Yeah, this is something that bugs me about the rust ecosystem. Just to use a random number generator you need to pull in like 15 dependencies. In just a simple learning project that would have had about 1 dependency in c++, I ended up with like 75 for rust. I guess I'm old, but that seems like madness. Cargo being easy and simple is not all upside.


If you just want random numbers, the getrandom crate has only three direct dependencies, one of which is the libc library bindings. I’d you don’t need everything that’s in rand, you don’t have to use rand.


A Programmer's Introduction to Mathematics https://pimbook.org/

It introduces math from a mathematician's point of view (complete with proofs, etc.) rather than rote memorization and exercises, but it does so from the perspective of a programmer.


You have me thinking about kind of a cool board idea. 150 person twitter boards. Cap it at 150. People in that group can all vote on their own moderation, they can't interact with groups in other boards through quote tweeting or voting, though obviously they can copy paste.

You might get racist boards, but then its easy to get rid of all of them at once.

150 being https://en.wikipedia.org/wiki/Dunbar%27s_number of course.

I have no way to distribute anything. I tried to do my own annotations board on literature but no one joined. I just think it sounds cool to be in a personable board like that.


Inventing some protocol around the Dunbar number is interesting.

There was something similar in the Weatherford's book on Genghis Khan [1][2]. This system was described to be very effective for communicating and coordinating the huge military.

> In Genghis Khan's military system, a tumen was recursively built from units of 10 (aravt), 100 (zuut) and 1,000 (mingghan), each with a leader reporting to the next higher level.

Note: I am not aware of how good the Weatherford book is, it felt one-sided to me. So I am not sure how good the civic system that depended on the Tumen was in the mongol era.

[1] https://en.wikipedia.org/wiki/Mingghan

[2] https://en.wikipedia.org/wiki/Tumen_(unit)#Genghis_Khan's_or...


Pretty much every effective military in history has had this kind of hierarchical structure, by both imitation and convergent evolution. I'm sure there's a post on https://acoup.blog/ about it.


Makes sense that effective militaries are well-organized like this. Militaries have conducted big engineering projects for civic purposes, all thru history - don't remember the exact anecdote. But found a wiki[1] by quick google search.

I will checkout this blog, maybe it has some posts about non-military initiatives also.

[1] https://en.wikipedia.org/wiki/Category:United_States_Army_Co...


Dunbar’s number is discredited reactionary nonsense, see wengrow/graeber research


Is there some research on a law of 5? As in 5 is the max amount of connections and permutations in a group of people which it's possible for a member to work out the permutations?

For example in a group of 3 me bill and Alice I can model bills view of me and Alice alone, me and Alice together, me Alice and bill together, etc etc

Beyond a certain number it's not really possible.


no


As far as I can tell the vast majority of the scientific community still consider it valid.

I also discovered in searching that you're talking about David Graeber. I recently read his "Bullshit Jobs" book because someone on here cited it. It was one of the worst books I've ever read. It was clearly a contrived political manifesto (I suppose for "anarchy") with the thinnest veneer of popular science wrapped around it. I think ancient aliens probably got more anthropology correct.

So if you're going to appeal to an authority instead of actually transmitting the argument yourself then David Graeber seems like probably one of the worst you could pick to cite.


oh so you didn’t read the research. googled one of the non graeber papers for ya since you decided you didn’t like the guy for his politics https://twitter.com/davidwengrow/status/1116786595351470080?... you can find the full contents on scihub. I trust you can overcome your appeal to the authority of popularly engrained opinions


> You might get racist boards, but then its easy to get rid of all of them at once.

You don't have to shut them down, you know. The British Government did this all throughout the 1970s to 1990s, where pubs (and later online services) where Republican terrorists hung out were very much left alone. They could have swooped in and scooped the lot up, but they didn't.

Because if they ever did want to scoop them all up, they knew exactly where to look, and why would you disturb that?


I've always thought that was the right way to handle it -- allow people to self-express, however abhorrent they may be. For the worst offenders, dedicate resources to make sure no harm is done (for example, monitor these watering-wells for any activity indicative of planning a terrorist attack, etc).


It also turns into a game of whack-a-mole. You ban the boards and they just immediately come back under a different name.


But that is also a recipe for echo chambers.

Anyway, from a technical point of view, this is what Mastodon instances already can offer.


You can't fight echo chamber effect with technical measures, forget about it.


When I run ls I don't want to see all the configuration files. Just my files. I think that's the point of hidden files.

> One more example. Imagine if you have a project and want to edit an .env file. But as dotfiles are hidden in Linux you don't see this file and cannot open it.

How likely is it that someone is going to want to edit a .env file and not know how to view hidden files?


> When I run ls I don't want to see all the configuration files. Just my files.

Well then configuration files can simply not be in the same directory as your files. Easy fix.


Your user doesn't have permissions to write to any directory that isn't yours. It doesn't mean configs have to be in your "documents" directory, but they do have to be in with your files!


The comment I was responding to was talking about the utility of files being hidden.

But they don't need to be hidden by being special, they can just be "hidden" by living in a subdirectory that you don't look inside.


Its usually not so much "a range of wheels optimized for specific roles" and more "one size fits all."

You want a wheel that fits your cart. You have a lathe. Why make do with a wheel someone else made, standardized, and sells to fit a wide range of carts when you can make the perfect wheel for _your_ cart?

Most of the functionality in libraries use standard algorithms anyway. I doubt anyone thinks its a good idea to write your own cryptography or markdown processor, but why do you need a library to left pad a string with zeroes?

if statements and flags take time to process. You only have one way you need to do something. Do you want to take that much more compute just checking flags that will never change so you can get the library function to do what you always want it to do?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: