Hacker Newsnew | past | comments | ask | show | jobs | submit | more CaptainOfCoit's commentslogin

> Even if you subscribed to them all you'd still not have everything

And even if you could get all the video itself, it's not guaranteed you'd get the right video+audio+subtitles combination that you want, as everything seems to be negotiated separately.

So while one service could offer the right audio and the right video but not the subtitles you want, another service could have the right video and the right subtitles but instead be dubbed without original audio.

It became a whole mess for people and eventually it was again simpler to just resort to piracy for the even the slightly technical consumers.


Man, jellyfin is shockingly absolutely killer for subtitles. I don't remember if it's a plugin or built in, but there's a subtitle search option that cross-references your video's filename into some database that usually gives you a workable set of subs.

Plus it respects your options to default subs on or off, in a language you choose, in a style you like to see. I don't think any streaming services do it this well honestly


Plex also has this


Kodi has this as well. I wouldn’t be surprised if similar functionality was also in plex/emby too


"Some database", could it be opensubtitles.org ? Sigh, if I were them I'd be annoyed that my work is hidden away behind the words "some database".


The plugin does name them, but that's the nature of work so good as to become invisible, you don't actually see it unless you already know (Or care/need to look into it)


> The cost of creating new computers has got to be pretty high to the environment

But aren't those made regardless if the people with old computers upgrade to them or not? I guess over time, they'll make less if people buy less, but the ones we'd purchase today has already been made, and might as well replace less energy efficient devices than just being added to the global count.


I think you answered your own question here.


And I thought it was about a stick with flammable material on top. Probably the truth sits somewhere in the middle and this is a stick for communication.


Ok, chatgpt


LED walls are cool (and cheap via China) otherwise, and you can start small and then expand if you want since it's relatively modular, just a bunch of square LED panels linked together. You would need a driver though which you may or may not be able to hide behind/somewhere else, makes it kind of bulky compared to just a vertical TV :)


> Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.

Do you have any links to public thoughts about this? As if it was true, could mean a lot of research could be invalidated, so obviously would make huge news.

Also feels like something that would be relatively easy to make reproducible test cases from, so easy to prove if that's true or not.

And finally if something is easy to validate, and would make huge news, I feel like someone would already have attempted to prove this, and if it was true, would have published something a long time ago.


Could this really invalidate research? Managing to produce a model that works (assuming you check all of the myriad modeling correctness checkboxes) is sufficient on its own. The fact that the modeling process itself was broken in some way — but not the assumptions made of the model inputs, or data leakage assumptions, or anything that fundamentally undermines any model produced — has no bearing on the outcome, which is the fact that you got a model that evidently did make accurate predictions.


> Could this really invalidate research? Managing to produce a model that works (assuming you check all of the myriad modeling correctness checkboxes) is sufficient on its own.

In the academic sense, a model that happens to work isn't research; the product of research should be a technique or insight that generalizes.

"Standard technique X doesn't work in domain Y, so we developed modified technique X' that does better" is the fundamental storyline of many machine learning papers, and that could be 'invalidated' if the poor performance of X was caused by a hidden correctness bug avoided by X'.


a lot of research could be invalidated, so obviously would make huge news.

A lot of research is unreproducible crap. That’s not news to anyone. Plus, bugs usually make results worse, not better.


There are many more ways to degrade model performance than to enhance it, so I would expect the vast majority of bugs to lead to artificially reduced accuracy, not artificially increased accuracy.

So if PyTorch is full of numerical flaws, that would likely mean many models with mediocre/borderline performance were discarded (never published) because they just failed to meet the threshold where the authors felt it was worth their time to package it up for a mid-tier conference. A finding that many would-be mediocre papers are actually slightly less mediocre than believed would be an utterly unremarkable conclusion and I believe that's why we haven't seen a bombshell analysis of PyTorch flaws and reproducibility at NeurIPS.

A software error in, say, a stats routine or a data preprocessing routine would be a different story because the degrees of freedom are fewer, leaving a greater probability of an error hitting a path that pushes a result to look artificially better as opposed to artificially worse


Check their Twitter, I saw something either yesterday or earlier today iirc


> CL saves me round after round of bugs that in clojure aren't found until you run the code

It is true that a lot of things don't surface until you run the code, but the way you run code in Clojure is also different than other languages (but similar to CL) where this isn't that big of a problem.

Usually you evaluate the very code you're working on, as an isolated unit, after every change, with just a key stroke. Again, not different than CL, but very different than the rest of the Algol-like languages where you'd put the code you're editing under unit tests or worse, manually run the full program after each change.


To be fair, you can easily click to hide those expanded sections. I found it a neat compromise between "Link to (usually) obtuse Wikipedia article" which aren't usually written for laypersons, and forcing me to read through stuff I already know about, I just hid the sections I already understood but found value in the others.


Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening.

Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...


They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).

Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.

Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.

It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.


Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.

I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.

Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.


Beyond that, the tips get less general-purpose. The two big over-arching ideas are:

1. Numerical code is the canonical example of "functional" code. If you prove all the pieces correct then the result is also correct. If you prove one wrong then you know why your overall code is wrong. As such, focusing more heavily than normal on proving each piece correct is prudent. Use automated techniques (like numerical gradient checking), and use randomized inputs. It's easier than you'd think for your favorite special cases to be correct in both right and wrong algorithms. Your eyes will deceive you, so use the computer to do your spot checks.

2. I lied in (1). Especially when you start involving GPUs, it's easy to have to start worrying about variable lifetimes, UAF, double-free, un-initialized memory, accidental clobberings, and other ways in which an innocent "functional" computation can stomp on something else you're doing. Still start with all the checks from (1), and if the parts are correct and the whole is broken then you're messing up global state somewhere. Tracking that down is more art than science, but one technique is adding a "poison" field, tracking deinit count, and otherwise exposing metrics regarding those failure modes. Panic/crash when you hit an invalid state, and once you figure out where the issue happens you can triage as normal (working backward from the broken state to figure out how you got there). With a solid memory management strategy up-front you'll not see this sort of thing, but if it's not something you've thought about then I wouldn't rule it out.

3. Not really another point, just an extension of (2), corruption can show up in subtle ways (like stack-copied pointers inside a paused async function closure which occasionally gets copied by your event loop). If global state is the issue, it's worth a full audit of the application.


You may be running into jensen (huang)’s inequality,

E(loss).cuda() <= E(loss.cuda())


Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.

But then this joke might be flying above my head as well.


Tensor cores use lower precision, so small numerical differences should be expected.


Consumer-visible hardware bugs are extremely uncommon nowadays. There's approximately 10x as many people working in design verification as actual hardware design.

I say "consumer-visible" because the bugs still exist and people who can catch them early get promoted quickly and paid a lot. It's very exciting work if you can get it, since you really have to understand the full GPU to break it.

Good luck!!


How big is the numerical difference? If it's small it might be within the precision of the operation itself.


Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...


We all have our writing quirks, like how some people use shorthand for words where there is only a marginal difference (like "people" => "ppl"), or even people who capitalize the start of sentences, but not the start of their whole text.

Some thoughts maybe should remain internal :)


> Can bot writers overcome this if they know the credentials?

Yes, instead of doing just a HTTP request, do a HTTP request with authentication, trivial really. Probably the reason they "can't" do that now is because they haven't came across "public content behind Basic Auth with known correct credentials", so the behavior hasn't been added. But it's literally loading http://username:password@example.com instead of http://example.com to use Basic Auth, couldn't be simpler :)


The technical side is straightforward but the legal implications of trying passwords to try to scrape content behind authentication could pose a barrier. Using credentials that aren't yours, even if they are publicly known, is (in many jurisdictions) a crime. Doing it at scale as part of a company would be quite risky.


The people in the mad dash to AGI are either driven by religious conviction, or pure nihilism. Nobody doing this seriously considers the law a valid impediment. They justify (earnestly or not) companies doing things like scraping independent artist’s bread and butter work to create commercial services that tank their market with garbage knockoffs by claiming we’re moving into a post-work society. Meanwhile, the US government is moving at a breakneck pace to dismantle the already insufficient safety nets we do have. None of them care. Ethical roadblocks seem to be a solved problem in tech, now.


The legal implications of torrenting giant ebook collections didn't seem to stop them, not sure why this would


The law doesn't directly stop anyone from doing anything, it acts much differently from a technical control. The law provides recourse to people hurt by violations and enables law enforcement action. I suspect Meta has since stopped their torrenting, and may lose the lawsuit they current face. Anyone certainly could log in to any site with credentials that are not their own, but fear of legal action may deter them.


Not criminal law

There is independent enforcement that should apply


Going back to Napster hasn't the gray area always been in downloading versus uploading?

If anyone could show that LLM companies have been uploading torrents then they really would be in trouble. If they are only proven to have downloaded torrents they're walking the line.


> but the legal implications of trying passwords to try to scrape content behind authentication could pose a barrier

If you're doing something alike to cracking then yeah. But if the credentials are right there on the landing page, and visible to the public, it's not really cracking anymore since you already know the right password before you try it, and the website that put up the basic auth is freely sharing the password, so you aren't really bypassing anything, just using the same access methods as everyone else.

Again, if you're stumbling upon basic auth and you try to crack them, I agree it's at least borderline illegal, but this was not the context in the parent comment.


> freely sharing the password

It doesn't have to be so free. It can be shared with the stipulation that it's not used in a bot.

https://www.law.cornell.edu/uscode/text/17/1201

  (a) Violations Regarding Circumvention of Technological Measures.—
    (1)
      (A) No person shall circumvent a technological measure that effectively controls access to a work protected under this title.
This has been used by car manufacturers to deny diagnostic information even though the encryption key needed to decrypt the information is sitting on disk next to the encrypted data. That's since been exempted for vehicle repairs but only because they're vehicle repairs, not because the key was left in plain view.

If you are only authorized to access it under certain conditions, trying to access it outside those conditions is illegal (in the US, minimally). Gaining knowledge of a password does not grant permission to use it.


If I was assigned the task of arguing that in court (though it would be really stupid to assign me, a non-lawyer, that task), I'd probably argue that it's not circumventing a locked door when you use the actual key in the lock; "circumventing" refers to picking the lock. It could still be unauthorized access if you stole the key, but that's a different thing than circumventing, and this law forbids circumventing.

Likewise, if the encryption key is sitting on disk next to the encrypted data, it's not "circumventing" the encryption to use that key. And if you handed me the disk without telling me "Oh, you're only allowed to use certain files on the disk" then it's fair to assume that I'm allowed to use all the files that you put on the disk before handing it to me, therefore not unauthorized access.

That argument might fail depending on what's in the EULA for the car's diagnostic software (which I haven't seen), but I feel it would be worth trying. Especially if you think you can get a sympathetic jury.


Huh, that's interesting, I'm not too familiar with US law, so not surprising I didn't know that :) Time to lookup if it works similarly in my country today, last time I was involved with anything slightly related to it was almost two decades ago, and at that point we (as a company with legal consul) made choices that assumed public info was OK to use, as it was public (paraphrased from memory), but might look differently today.

Thanks for adding the additional context!


To be fair, even ignoring the Robots.txt is illegal in most western countries. I was a technical witness a while back, for a case about a bot ignoring the robots.txt. I said it was akin to a peeping tom ignoring a "no trespassing" sign, creeping into someones backyard, and looking through their window. Yes, they actually did bypass security controls, and therefore illegally "hacked" the site by ignoring it.


How is this different than skipping the password and leaving the same terms of use for the content itself?


Otoh if, as a human, you use a known (even leaked on the website) password to "bypass the security" in order to "gain access to content you're not authorized to see", I think you'd get in trouble. I'd like if the same logic aplied to bots - implement basic (albeit weak) security and only allow access to humans. This way bots have to _hack you_ to read the content


> you use a known (even leaked on the website) password to "bypass the security" in order to "gain access to content you're not authorized to see", I think you'd get in trouble

I agree, but if someone has a website that says "This isn't the real page, go to /real.html and when authentication pops up, enter user:password", then I'd argue that is no longer "gaining access to content you're not authorized to see", the author of the page shared the credentials themselves, and acknowledged they aren't trying to hide anything, just providing a non-typical way of accessing the (for all intents and purposes, public) content.


Sure, it’s a crime for the bots, but it would also be a crime for the ordinary users that you want to access the website.

Or if you make it clear that they’re allowed, I’m not sure you can stop the bots then.


I don't think it'd be illegal for anyone.

The (theoretical) scenario is: There is a website (example.com) that publishes the correct credentials, and tells users to go to example.com/authenticate and put those there.

At no point is a user (or bot) bypassing anything that was meant to stop them, they're following what the website is telling them publicly.


I think this analysis is correct. The part you're missing from my comment is "at scale", which means trying to apply this scraping technique to other sites. As a contract security engineer I've found all kinds of accidentally leaked credentials; knowing if a set of credentials is accidentally leaked or are being intentionally disclosed to the public feels like a human-in-the-loop kind of thing. Getting it wrong, especially when automated at scale, is the context the bot writer needs to consider.


There’s hundreds of billions of dollars behind these guys. Not only that, but they also have institutional power backing them. The laws don’t really matter to the worst offenders.

Similar to OPs article, trying to find a technical solution here is very inefficient and just a bandaid. The people running our society are on the whole corrupt and evil. Much simpler (not easier) and more powerful to remove them.


Same goes for human users. The real way to avoid bots is actual login credentials.


The bot protection on low traffic sites can be hilarious in how simple and effective it can be. Just click this checkbox. That's it. But it's not a check box matching a specific pattern provided by a well-known service, so until the bot writer inspects the site and adds the case it'll work. A browser running openai operator or whatever its called would immediately figure it out though.


> A browser running openai operator or whatever its called would immediately figure it out though.

But running that costs money, which is a disincentive. (How strong of a disincentive depends on how much it costs vs. the estimated value of a scraped page, but I think it would 100x the per-page cost at least.)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: