Google’s Automated Image Captioning & the Key to Artificial “Vision”

Google’s Automated Image Captioning & the Key to Artificial “Vision” Featured Image

It’s no secret that Google has been getting more active in research in recent years, especially since it re-organized itself significantly back in 2015. On 22nd September 2016 it announced the open-source release of a piece of software that can detect the objects and setting of an image to automatically generate a caption describing it. Of course, it doesn’t have the same level of creativity as human beings do in creating the prose within the captions, but the image encoder otherwise known as Inception V3 should have captured attention for reasons that transcend the superficial “look at the captions it can make” motive. Software like this, in fact, may be a stepping stone towards something greater on the road to more advanced artificial intelligence.

Eyes Can See, but Intelligence “Perceives”

aivision-perception

Artificial sight has been with us for more than a century. Anything with a camera can see. It’s a very basic sort of thing. But even a blind man can surpass the camera’s understanding of what it is looking at. Until very recently, computers were not able to easily and accurately name the objects found in pictures without very specific parameters. To truly say that a man-made object has “vision” would mean that it at least has a concrete ability to specify what it is looking at, rather than just simply looking at it without gathering any context. This way, the device could potentially react to its environment based on sight, just like we do. Perception is an absolute necessity. Without it, every sense we have is useless.

Perception Through Automatic Image Captioning

aivision-captioning

Although we generally believe that every picture is worth a thousand words, Inception V3 doesn’t necessarily share that opinion. The automatic image captioning software has very few things to say about what it sees, but it at least has a basic concrete understanding of what is contained within the frame presented to it.

With this rudimentary information we have taken a step towards the ability of software to understand visual stimuli. Giving a robot this kind of power would allow it to react to such stimuli, bringing its intelligence to just under the level of most basic aquatic animals. That may not sound like much, but if you take a look at how robots are doing right now (when tested outside their highly restrictive parameters), you’ll find that this would be quite a leap in intelligence compared to the amoebic way in which they can perceive their own surroundings.

What This Means for AI (And Why It’s Far From Perfect)

The fact that we now have software that (with 93 percent accuracy) can caption images means that we have somewhat overcome the obstacle of getting computers to make sense of their environments. Of course, that doesn’t mean we’re anywhere near finished in that department. It’s also worth mentioning that the Inception V3 was trained by humans over time and uses the information it “learned” to decipher other images. To have true understanding of one’s environment, one must be able to achieve a more abstract level of perception. Is the person in the image angry? Are two people fighting? What is the woman on the bench crying about?

The above questions represent the kinds of things we ask ourselves when we encounter other human beings. It’s the kind of abstract inquiry that requires us to extrapolate more information than what an image captioning doohickey can do. Let’s not forget that icing on the cake we like to call an emotional (or “irrational”) reaction to what we see. It’s why we consider flowers beautiful, sewers disgusting, and french fries tasty. It’s something we are still wondering whether we will ever achieve on a machine level without actually hard-coding it. The truth is that this kind of “human” phenomenon is likely impossible without restrictive programming. Of course, that doesn’t mean we won’t stop trying. We are, after all, human.

Do you think that our robot overlords will ever learn to appreciate the intricacy of a rose petal under a microscope? Tell us in a comment!

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Miguel Leiva-Gomez Avatar

Read next

ARPANET sent its first message on 29 October 1969 from a lab at UCLA to a machine at Stanford, and the message was supposed to read ‘LOGIN’ — but the system crashed after the L and the O, meaning the first word ever transmitted over the network that became the internet was, by accident, ‘LO’.
In 1995, Microsoft shipped a cartoon-house interface called Bob, led by Melinda French, who married Bill Gates while it was in development — it demanded twice the memory of a typical home PC, sold roughly 30,000 copies, and was dead within a year, leaving behind the font Comic Sans and the animated assistant that became Clippy.
The Greenland shark grows about one centimetre a year, does not reach sexual maturity until around age 150, and a specimen carbon-dated by Danish researchers in 2016 was estimated to be at least 272 years old, meaning it was already swimming the North Atlantic when Mozart was composing symphonies.
When Apple shipped iOS 12 in June 2018, a small feature called Screen Time slipped onto every iPhone with a counter nobody had quite prepared for — a tally of pickups — and within a day Tim Cook was telling CNN the number of times he picked up his own phone was simply too many
When NASA lost contact with the IMAGE satellite in 2005, an amateur radio operator in Canada named Scott Tilley picked up its signal in January 2018 while hunting for a classified spy satellite, and the spacecraft turned out to be still spinning, still powered, and still trying to phone home after 13 years of silence.
The original iPhone Steve Jobs unveiled in January 2007 could not record video, could not copy and paste text, could not run a single third-party app, and could only reach the internet over 2G — and Jobs spent ninety minutes on stage at Macworld arguing, one missing feature at a time, that every absence was actually a design decision.
In 1965, Joe Sutter’s Boeing team began shaping the 747 around a future they thought would belong to supersonic jets, lifting the cockpit onto a hump so the nose could open for cargo once the giant subsonic passenger plane had outlived its brief moment
Apple’s original 1984 Macintosh keyboard had no arrow keys, no function keys, and no numeric pad because Steve Jobs wanted users to reach for the mouse first. Then Apple quietly sold the missing keys as an accessory.