If Clearview AI, which is based in New York, hadn’t granted his lawyer special access to a facial recognition database of 20 billion faces, Mr. Conlyn might have spent up to 15 years in prison because the police believed he had been the one driving the car.
Clearview allowed use of their facial recognition service to identify a good samaritan who had pulled Mr. Conlyn from the passenger side of the vehicle, thereby providing evidence he was not the driver.
Like most modern AI systems, Stable Diffusion is trained on a vast dataset that it mines for patterns and learns to replicate. In this case, that core of the training data is a huge package of 5 billion-plus pairs of images and text tags known as LAION-5B, all of which have been scraped from the public web. . . .
We know for certain that LAION-5B contains a lot of copyrighted content. An independent analysis of a 12 million-strong sample of the dataset found that nearly half the pictures contained were taken from just 100 domains. The most popular was Pinterest, constituting around 8.5 percent of the pictures sampled, while the next-biggest sources were sites known for hosting user-generated content (like Flickr, DeviantArt, and Tumblr) and stock photo sites like Getty Images and Shutterstock. In other words: sources that contain copyrighted content, whether from independent artists or professional photographers.
Ben Thompson at Stratechery points out that new deep learning models are not requiring access to curated data in a way that would advantage large companies:
If not just data but clean data was presumed to be a prerequisite, then it seemed obvious that massively centralized platforms with the resources to both harvest and clean data — Google, Facebook, etc. — would have a big advantage.
. . . . .
To the extent that large language models (and I should note that while I’m focusing on image generation, there are a whole host of companies working on text output as well) are dependent not on carefully curated data, but rather on the Internet itself, is the extent to which AI will be democratized, for better or worse.
Bruce Schneier, linking to an article in The Intercept about a court hearing in the Cambridge Analytica suit:
Facebook’s inability to comprehend its own functioning took the hearing up to the edge of the metaphysical. At one point, the court-appointed special master noted that the “Download Your Information” file provided to the suit’s plaintiffs must not have included everything the company had stored on those individuals because it appears to have no idea what it truly stores on anyone. Can it be that Facebook’s designated tool for comprehensively downloading your information might not actually download all your information? This, again, is outside the boundaries of knowledge.
“The solution to this is unfortunately exactly the work that was done to create the DYI file itself,” noted Zarashaw. “And the thing I struggle with here is in order to find gaps in what may not be in DYI file, you would by definition need to do even more work than was done to generate the DYI files in the first place.”
None of this is surprising to people familiar with modern data center services at scale. Twitter allegedly doesn’t know how to restart its services if they really go down:
The company also lacks sufficient redundancies and procedures to restart or recover from data center crashes, Zatko’s disclosure says, meaning that even minor outages of several data centers at the same time could knock the entire Twitter service offline, perhaps for good.
Microsoft trained an excellent 3D face reconstruction model using synthetic data.
Synthetic (i.e. computer generated) data is helpful because it takes a long time for humans to look at many faces and label all of their features. But synthetic data arrives already labeled. And that allows for good and fast training:
Can we keep things simple by just using more landmarks?
In answer, we present the first method that accurately predicts ten times as many landmarks as usual, covering the whole head, including the eyes and teeth. This is accomplished using synthetic training data, which guarantees perfect landmark annotations.
Some cities and states that were early to ban law enforcement from using facial recognition software appear to be having second thoughts, which privacy advocates with the Electronic Frontier Foundation (EFF) and other organizations largely attribute to an uptick in certain types of urban crime.
Google has released the Patent Phrase Similarity dataset, intended to help AI models better understand the somewhat odd world of patent language:
The process of using traditional patent search methods (e.g., keyword searching) to search through the corpus of over one hundred million patent documents can be tedious and result in many missed results due to the broad and non-standard language used. For example, a “soccer ball” may be described as a “spherical recreation device”, “inflatable sportsball” or “ball for ball game”.