> figure out a way where we can do things like randomizing questions but preserving the integrity of it to ensure a fair evaluation, etc
How does any of these have anything to do with copyright infringement in the context of DMCA takedown? Do you even own the copyright to the alleged leaked solution?
> The wiki listing of packages acts as the "registry" and populates the clib-search(1) results.
This seems to be an extremely bad idea. Since everyone can edit the GitHub wiki pages, it means its registry is vulnerable to all kinds of malicious attack.
> Importantly, you must do this for every open branch in your repo. It is not enough to do so for only your default branch since a malicious PR can target any of your open branches. That is, if you have an open branch that uses a vulnerable version of check-spelling then a malicious PR targeting that branch can leak a GITHUB_TOKEN which can then be used to impact any of your branches, including your default branch.
I think this is a big design flaw in GitHub Actions. Whenever there is a security patch, you have to make sure to apply them in every branch. This includes all the historical branches and stale branches which the repo owners forget to delete.
Hard to follow this because I'm mostly on the consuming end of CIs or occasionally do some basic things. Although I've recently tried GHA, setting it up from scratch even for complex setups seems almost trivial. But the security of GHA seems more than shaky.
> I think this is a big design flaw in GitHub Actions. Whenever there is a security patch, you have to make sure to apply them in every branch.
On the other hand I think every action needs to be initialized once on the main branch.
If it’s pulling the actions from git using a fixed commit, then a workaround could be to break history from before the vuln was introduced then it wouldn’t be possible to pull the vulnerable actions. GitHub does GC the unreachable commits quite aggressively.
Compared with Rust’s Result monad, which allows developers to clearly see the effects of error handling, there are two other hidden fallible effects in Rust that are much harder to tackle:
* Panic rewinding. I am not sure how to ensure your Rust function being panic safe. It is quite easy to cause soundness issue if some invariants no longer hold due to panic. I see `PanicGuard` sometimes used in Rust std library.
* Future cancellation. `tokio::select` is one of the infamous examples, where it is quite easy to introduce bug if the future cannot handle cancellation gracefully.
When trying to handle them properly, it feels more like writing traditional C code than Rust.
>I am not sure how to ensure your Rust function being panic safe.
You could use the linking trick in which your panic handler uses non-existent extern fn. For example, this approach is used in the no-panic crate. Of course, this approach is nothing more than a clever hack with several significant limitations.
>Future cancellation
I would say it's a more general problem of Rust lacking linear types.
IMHO both panics and async are a mistake. The latter should be a standard library with some macros and the former should be replaced by proper error handling everywhere.
Yes out of memory errors should be able to be handled. You might not care when writing web backend slop designed to run under orchestrators but for systems code it is necessary sometimes and Rust is a systems language.
It’s possible to work around both shortcomings but they contradict the languages mission and are warts. Async is a big “excessive cleverness” mistake.
I’d go further and say it’s not possible to fully implement async/await without compiler help.
I got really far with stateful, back in 2016 [1]. Stateful was an attempt to write a coroutine library in a proc macro, which generated state machines, as opposed to using os primitives like green threading. This was back before the rust community really started working in this space. I ended up extracting the type system from rustc to do much of the analysis, but it ultimately failed due to how difficult it was to output rust code that respected the borrow checker rules. I also didn’t have anything like the pinning system, so I couldn’t catch move issues either.
It was a much better idea to just implement this in the compiler.
Personally I just loathe async in general. Go has it right, but I understand why you can’t do that in Rust. Async is an ugly workaround for the inefficiency of OS threads, and I wish they would just fix that so we can stop all this madness.
May I ask what missing fields you are referring to? Why @online/@software/@dataset type in biblatex [1] cannot do the job?
That being said, I think GitHub should acknowledge that it is common for authors to want people cite their paper (or multiple papers) rather than simply the source code. Because this is what counts to the citation in academic. At the same time, there is no reason to not support bibtex/biblatex in addition to the cff.
@software is simply an alias for the fallback @misc, i.e., semantics are lost, no fields like different URLs for different software media (code, build artifacts, etc.), no software identifier support, etc.
Also, you can have people cite your paper on GitHub by giving them it as a preferred citation in CFF, and GitHub will render that instead of the source code.
Which is, btw, against the software citation principles [1], but caters to people who need time adapting and want traditional credit now.
In addition to the attacks, such as converting legit image to be detected as CSAM (false positive) or circumventing detection of the real CSAM image (false negative), which have been widely discussed in HN, I think this can also be used to mount a DOS attack or to censor any images.
It works like this. First, found your target images, which are either widely available like internet memes for DOS attack or images you want to censor. Then, compute their Neuralhash. Next, use the hash collision tool to turn real CSAM images to have the same NeuralHash as the target images. Finally, report these adversarial CSAM images to the government. The result is that the attackers would successfully add the targeted NeuralHash into the CSAM database. And people who store these legit image will then be flagged.
You can technically hide an adversarial collision inside a complete legit normal image. It won’t be seen by human eyes but it will trigger a detection. In addition, you can do the complete opposite by perturbing a CSAM to output complete different hash to circumvent the detection. All of these vulnerabilities are well known for perceptual hash.
So right there seems to be an issue to me. It seems like if you were trading in CSAM, you would run CLEANER -all on anything and everything. Because you know someone has already written that as proof of concept here.
> Neural hash generated here might be a few bits off from one generated on an iOS device. This is expected since different iOS devices generate slightly different hashes anyway. The reason is that neural networks are based on floating-point calculations. The accuracy is highly dependent on the hardware. For smaller networks it won't make any difference. But NeuralHash has 200+ layers, resulting in significant cumulative errors.
This is a little unexpected. I'm not sure whether this has any implication on CSAM detection as whole. Wouldn't this require Apple to add multiple versions of NeuralHash of the same image (one for each platform/hardware) into the database to counter this issue? If that is case, doesn't this in turn weak the threshold of the detection as the same image maybe match multiple times in different devices?
This may explain why they (weirdly), only announced it for iOS and iPadOS, as far as I can tell they didn't announce it for macOS.
My first thought was that they didn't want to make the model too easily accessible by putting it on macOS, in order to avoid adversarial attacks.
But knowing this now, Intel Macs are an issue as (not as I previously wrote because they differ in floating point implementation to ARM, thanks my123 for the correction) they will have to run the network on a wide variety of GPUs (at the very least multiple AMD archs and Intel's iGPU), so maybe that also factored in their decision ? They would have had to deploy multiple models and (I believe, unless they could make the models exactly converge ?) multiple distinct database server side to check back.
To people knowledgeable on the topic, would having two versions of the models increase the attack surface ?
Edit: Also, I didn't realise that because of how perceptual hashes worked, they would need to have their own threshold to matching, independent of the "30 pictures matched to launch a human review". Apple's communication push implied exact matches. I'm not sure they used the right tool here (putting aside the fact for now that this is running client side).
Is it ? I checked your link and they separate clearly which features comes to which OS, here's how I read it :
- Communication safety in Messages
> "This feature is coming in an update later this year to accounts set up as families in iCloud for iOS 15, iPadOS 15, and macOS Monterey."
- CSAM detection
> "To help address this, new technology in iOS and iPadOS"
- Expanding guidance in Siri and Search
> "These updates to Siri and Search are coming later this year in an update to iOS 15, iPadOS 15, watchOS 8, and macOS Monterey."
So while the two other features are coming, the CSAM detection is singled out as not coming to macOS.
But ! At the same time, and I saw that after the editing window closed, the GitHub repo clearly states that you can get the models from macOS builds 11.4 onwards :
> If you have a recent version of macOS (11.4+) or jailbroken iOS (14.7+) installed, simply grab these files from /System/Library/Frameworks/Vision.framework/Resources/ (on macOS) or /System/Library/Frameworks/Vision.framework/ (on iOS).
So my best guess is, they trialed it on macOS as they did in iOS (and put the model there contrary to what I had assumed) but choose not to enable it yet, perhaps because of the rounding error issue, or something else.
Edit : This repo by KhaosT refers to 11.3 for the API availability but it's the same ballpark, Apple is already shipping it as part of their Vision framework, under an obfuscated class name, and the code samples runs the model directly on macOS : https://github.com/KhaosT/nhcalc/blob/5f5260295ba584019cbad6...
Ah good catch and write up. I believe you’re right and likely a matter of time for Mac. Hard to tell if this means it’s shipping with MacOS but just not enabled yet.
My bad, I edited the previous post, thanks for this. Assuming this runs on Intel's iGPU, they would still need the ability to run on AMD's GPU for the iMac Pro and Mac Pro, so that's at least two extra separate cases.
This basically invalidates any claims Apple made about accuracy, and brings up an interesting point about the hashing mechanism: it seems two visually similar images will also have similar hashes. This is interesting because humans quickly learn such patterns: for example, many here will know what dQw4w9WgXcQ is without thinking about it at all.
> it seems two visually similar images will also have similar hashes
This is by-design - The whole idea of a perceptual hash is that the more similar the two hashes are, the more similar the two images are, so I don't think it invalidates any claims.
Perceptual hashes are different to a cryptographic hash, where any change in the message would completely change the hash.
Hash is applied correctly here. A hash function is "any function that can be used to map data of arbitrary size to fixed-size values." The properties of being a(n) (essentially) unique fingerprint, or of small changes in input causing large changes in output, are properties of cryptographic hashes. Perceptual hashes do not have those properties.
Good explanation, thanks. I only knew about cryptographic hashes, or those that are used for hash tables where you absolutely do not want to have collisions. Anyhow, I'm not really comfortable with this usage of the word "hash". It is completely opposite of the meaning I'm used to.
> The whole idea of a perceptual hash is that the more similar the two hashes are, the more similar the two images are
This is already proven to be inaccurate. There are adversarial hashes and collisions possible in the system. You don’t have to be very skeptically-minded to think that this is intentional. Links to examples of this already posted in this thread.
You are banking on an ideal scenario of this technology not the reality.
> Wouldn't this require Apple to add multiple versions of NeuralHash of the same image (one for each platform/hardware) into the database to counter this issue?
Not if their processor architectures are all the same, or close enough that they can write (and have written) an emulation layer to get bit-identical behaviour.
I think it would just require generating the table of hashes once on each type of hardware in use (whether CPU or GPU), then doing the lookup only in the table that matches the hardware that generated it.
To re-do the hashes, you would need to run it on the original offending photo database, which -- as an unofficial party doing so -- could land you in trouble, wouldn't it?
And what if you re-do the hashes on a Mac with auto-backup to iCloud -- next think you know the entire offending database has been sync'd into your iCloud account :-/
You're thinking of cryptographic hashes. There are many kinds of hash (geographic, perceptual, semantic, etc), many of which are designed to only be slightly different.
See https://hackertimes.com/item?id=28105849, which shows a POC to generate adversarial collisions for any neural network based perceptual hash scheme. The reason it works is because "(the network) is continuous(ly differentiable) and vulnerable to (gradient-)optimisation based attack".
How does any of these have anything to do with copyright infringement in the context of DMCA takedown? Do you even own the copyright to the alleged leaked solution?