AI that learns from the internet

Ben Thompson at Stratechery points out that new deep learning models are not requiring access to curated data in a way that would advantage large companies:

If not just data but clean data was presumed to be a prerequisite, then it seemed obvious that massively centralized platforms with the resources to both harvest and clean data — Google, Facebook, etc. — would have a big advantage.

. . . . .

To the extent that large language models (and I should note that while I’m focusing on image generation, there are a whole host of companies working on text output as well) are dependent not on carefully curated data, but rather on the Internet itself, is the extent to which AI will be democratized, for better or worse.

The AI Unbundling

This means the new AI models are relatively cheap and also more a reflection of internet content itself, “for better or worse.”

4.2 gigabytes of pure knowledge

Andy Salerno created a digital painting with Stable Diffusion, a new open source image synthesis model that allows anyone with a PC and a decent GPU to create almost any image they can describe:

Andy Salerno’s masterpiece created “with literally dozens of minutes of experience”

Salerno’s step-by-step guide is very straightforward and worth the review.

Most remarkable perhaps is that Stable Diffusion is small enough to be used by almost anyone:

4.2 gigabytes.

That’s the size of the model that has made this recent explosion possible.

4.2 gigabytes of floating points that somehow encode so much of what we know.

4.2 Gigabytes, or: How to Draw Anything

Stable Diffusion was trained on about 5 billion images and cost about $600k on GPUs leased from AWS.

Learning from synthetic data

Microsoft trained an excellent 3D face reconstruction model using synthetic data.

Synthetic (i.e. computer generated) data is helpful because it takes a long time for humans to look at many faces and label all of their features. But synthetic data arrives already labeled. And that allows for good and fast training:

Can we keep things simple by just using more landmarks?

In answer, we present the first method that accurately predicts ten times as many landmarks as usual, covering the whole head, including the eyes and teeth. This is accomplished using synthetic training data, which guarantees perfect landmark annotations.

3D Face Reconstruction with Dense Landmarks

More data for AI interpretation of patents

Google has released the Patent Phrase Similarity dataset, intended to help AI models better understand the somewhat odd world of patent language:

The process of using traditional patent search methods (e.g., keyword searching) to search through the corpus of over one hundred million patent documents can be tedious and result in many missed results due to the broad and non-standard language used. For example, a “soccer ball” may be described as a “spherical recreation device”, “inflatable sportsball” or “ball for ball game”.

Announcing the Patent Phrase Similarity Dataset

The dataset was used in the U.S. Patent Phrase to Phrase Matching Kaggle competition with some close-to-human results.

Commercial (legal) limitations of DALL-E 2

Louise Matsakis reporting for The Information:

At least one major brand has already tried incorporating Dall-e 2 into an advertising campaign, inadvertently demonstrating how legal snafus could arise. When Heinz’s marketing team fed Dall-e 2 “generic ketchup-related prompts,” the program almost exclusively produced images closely resembling the company’s trademarked condiment bottle. “We ultimately found that no matter how we were asking, we were still seeing results that looked like Heinz,” a company representative told AdWeek.

Can Creatives Survive the Future War Against Dall-e 2?

The image generation AI’s are remarkable, but they do still have significant technical limitations as well, particularly an inability to generate unusual images (“a cup on a spoon”).

Is ShotSpotter AI?

 A federal lawsuit filed Thursday alleges Chicago police misused “unreliable” gunshot detection technology and failed to pursue other leads in investigating a grandfather from the city’s South Side who was charged with killing a neighbor.

. . . . .

ShotSpotter’s website says the company is “a leader in precision policing technology solutions” that help stop gun violence by using sensors, algorithms and artificial intelligence to classify 14 million sounds in its proprietary database as gunshots or something else.

Lawsuit: Chicago police misused ShotSpotter in murder case

Some commentators (e.g., link) have jumped on this story as an example of someone (allegedly) being wrongly imprisoned due to AI.

But maybe ShotSpotter is just bad software that is used improperly? Does it matter?

The definition of AI is so difficult that we may soon find ourselves regulating all software.

AI discoveries in chess

AlphaZero shocked the chess world in 2018.

Now an economics paper is trying to quantify the effect of this new chess knowledge:

[W]e are not aware of any previously documented evidence comparing human performance before and after the introduction of an AI system, showing that humans have learned from AI’s ideas, and that this has pushed the frontier of our understanding.

AlphaZero Ideas

The paper shows that the top-ranked chess player in the world, Magnus Carlsen, meaningfully altered his play and incorporated ideas from AlphaZero on openings, sacrifices, and the early advance of the h-pawn.

Carlsen himself acknowledged the influence:

Question: We are really curious about the influence of AlphaZero in your game.

Answer: Yes, I have been influenced by my hero AlphaZero recently. In essence, I have become a very different player in terms of style than I was a bit earlier and it’s been a great ride.”

Id. at 25 (citing a June 14, 2019 interview in Norway Chess 2019).

Bias mitigations for the DALL-E 2 image generation AI

OpenAI has a post explaining the three main techniques it used to “prevent generated images from violating our content policy.”

First, they filtered out violent and sexual images from the training dataset:

[W]e prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.

Second, they found that the filtering can actually amplify bias because the smaller remaining datasets may be less diverse:

We hypothesize that this particular case of bias amplification comes from two places: first, even if women and men have roughly equal representation in the original dataset, the dataset may be biased toward presenting women in more sexualized contexts; and second, our classifiers themselves may be biased either due to implementation or class definition, despite our efforts to ensure that this was not the case during the data collection and validation phases. Due to both of these effects, our filter may remove more images of women than men, which changes the gender ratio that the model observes in training.

They fix this by re-weighting the training dataset so that the categories of filtered data are as balanced as the categories of unfiltered data.

Third, they needed to prevent image regurgitation to avoid IP and privacy issues. They found that most regurgitated images (a) were simple vector graphics; and (b) had many near-duplicates in the training set. As a result, these images were easier for the model to memorize. So they de-duplicated images with a clustering algorithm.

To test the effect of deduplication on our models, we trained two models with identical hyperparameters: one on the full dataset, and one on the deduplicated version of the dataset. . . . Surprisingly, we found that human evaluators slightly preferred the model trained on deduplicated data, suggesting that the large amount of redundant images in the dataset was actually hurting performance.

Given the obviously impressive results, this is an instructive set of techniques for AI model bias mitigation.