8 Key Lessons From Working On Secret Computer Vision Products

5 min readOct 20, 2018

Last month, Google publicly released a stealth project called Shoppable Images. With this product, websites can grant Google permission to crawl images on their site via Javascript, provide visually similar product matches, and allow users to buy products in as little as 3 clicks. This is a play on distributed commerce, meaning that technology should allow people to buy anything, anytime, and anywhere. Advanced natural language processing should allow you to shop via a voice assistant integrated into your car, improved unstructured data clustering should provide better product recommendations as you browse on your phone, and in my computer vision work, any ordinary image should serve as inspiration to buy. However, labeling data, training models, and ensuring both speed and accuracy is extremely challenging. Here’s what I learned on this project:

Real-time computer vision is still slow: Before Shoppable Images was even built, we tried to find a way to identify products in images in real-time. That way, if publishers changed the content on their page, all images would be shoppable immediately. However, we came to the conclusion that we needed to pre-crawl images at least a few hours in advance to avoid page latency. Even with the most advanced computing, real-time computer vision is still slow. It’s better to identify the products in images on the back-end ahead of time so that front-end load times remain unchanged.
Don’t overlook the infrastructure stack that will affect your product: One of the benefits of working at Google is that you don’t have to think about computing power or computing costs for your products. However, if you’re selling into other companies, their infrastructure stack can make a big difference. For example, even though Shoppable Images ran on our servers and internally hosted images, we later realized that many publishers were using external image hosting services. (You can tell if a site is using external image hosting if the Image URL includes ?x=y.) If we had thought about the infrastructure of a typical user ahead of time, we could have saved a lot of time later.
Data labeling is expensive and hard to manage: While I can’t disclose the amount of money that Google spends on labeling data, the first time I heard this number I couldn’t get it out of my head. In other words, it’s a large sum of money. I’ve spent some time thinking about how to find creative ways to train data in a low-cost way — with crowdsourcing being the obvious one. Amazon has tried to solve this problem through Mechanical Turks, and Google uses a number of 3rd party vendors. I’ve also spent time labeling and relabeling some data, so it’s really all hands on deck! Figure out a plan for creating training data in-house or externally, and think about how to handle costs and quality-assurance ahead of time.
APIs can fall short for vertically-specific businesses: APIs are appealing for their ease of use. However, plug-and-play APIs are not always the best for vertically-specific businesses. Even for us, we realized that, while our model could recognize apparel, it could not distinguish plus-size apparel from the average. For most use cases (e.g. using computer vision to identify fashion trends), this distinction would have been irrelevant. But to enable all people to shop through images, we needed more precision than a basic API could provide.
Choose your framework carefully: The list of neural network frameworks in 2018 is overwhelming, with different frameworks focusing on different use cases. Please do your own research to determine your best choice, but here are four open-source, highly usable frameworks to get you down the right path:

TensorFlow: Since I work at Google, TensorFlow was the obvious choice. The focus on parallelism and integration with Google’s CloudML makes TensorFlow networks highly scalable. TensorFlow is getting better at mobile performance and enabling high-performance, in-browser models with the new TensorFlow.js. TensorFlow also has a strong developer community and lots of documentation.
PyTorch: PyTorch is loved by researchers for the ability to create exotic, experimental neural networks. If you have a complex idea that you’re not sure how to implement with other frameworks, give PyTorch a look. It is great for Recurrent Neural Networks (RNNs), and many recent, successful research papers release their code in PyTorch. Don’t let all the research talk scare you though: PyTorch still has an accessible API, great performance, and documentation.
Caffe2: Caffe2 offers extremely high performance with a focus on large scale deployments and mobile models. Caffe2 also hopes to retain the dominance of Caffe in Convolutional Neural Network (CNN) applications, such as computer vision.
Keras: The Keras library is loved for its readability and ability to rapidly create neural networks. It works by using a framework in its backend (e.g. TensorFlow) to do all the neural network math and wraps different functionalities for simple usage. Keras gives you high level functions to do things like create layers, apply max pooling, and check accuracy in a single line of code.

6. Type II errors are better than Type I errors: When designing computer vision products for consumers (or B2B2C in my case), a false negative is better than a false positive. For example, if our model didn’t recognize a product, this was a missed opportunity but ultimately did not inhibit product success. However, showing a poor match is obvious to the consumer and greatly diminishes user experience. Therefore, though there is always a trade-off between variance and bias, for our use case, a little extra variance was better than bias.

7. Product design still matters: Computer vision is a feature, not a product. Therefore, traditional product design still matters. For us, that involved designing a pleasant, non-invasive UI/UX for the consumer, and perhaps more importantly, building a flexible product. For example, we gave our enterprise customers the ability to change the placement of the “shop” tag to best match their site. While AI is the latest buzzword, old-school product design still matters much more.

8. And most importantly, solving real business problems still matters: Enterprises don’t buy deep learning products because they are cutting-edge. They buy them because they are the best solution for their day-to-day problems. In our case, publishers were looking for native ways to monetize their websites without traditional ads. Our goal was to solve this problem. Computer vision just happened to be the best solution, but we did explore simpler options.

8 Key Lessons From Working On Secret Computer Vision Products

Written by Renee Shah