Even though The International Conference on Computer Vision 2021 was held virtually, it felt like we were in a real conference room.
The first time I participated in the ICCV was in Osaka in 1990. I am a relatively new PhD student and gave an oral presentation on the topic ” Models for detecting transitive migration “. My supervisor, Anandan , told me before the conference, “This is a big conference for you. After this talk, you will be famous.” To be honest with you, I was scared.
My wife was with me, so I used the week before the conference to go sightseeing around Japan. All the while, I was sick to my stomach. I thought Japanese food didn’t suit me, but it turned out to be stress! The moment the lecture was over, my stomach was completely refreshed. After all, no one will remember that lecture, and it didn’t make me famous. Fame doesn’t come from a single paper. If it does, it comes from a reputation built over years of dedication.
Padmanabhan Anandan is an artificial intelligence researcher specializing in computer vision. He spent his research career at Yale University, Microsoft, and Adobe. Anandan and his joint paper with Michael Black, the author of this article, is famous in computer vision, and the
‘Black and Anandan’ method contributed to the popularization of robust statistics in the field .
Looking at the ICCV meeting minutes from 1990, several points are striking.
Papers seem to vary from 4 to 12 pages. I don’t remember why. They also had fewer authors than current papers, often with only one author. Back then, even senior professors like Sandy Pentland could write single-author papers. Many of the authors are still active today (Trevor Darrell, Luc van Gool, Pietro Perona, Jitendra Malik, Bill Freeman, etc.). There are only a handful of women, and unfortunately not much progress is being made today. Also, because the field was still young and not very large, papers did not include as many references as they do today. Getting to grips with the literature was much easier than it is now. Of the 126 papers, one was on neural networks. “Object Recognition by Hopfield Neural Network”.
Back then, there was no WWW, no GitHub, no GPUs, and little existing research to base it on. I had a PhD and started typing (Lisp and C) with an empty Emacs buffer. There were a few videos that people digitized at great expense, and those few sequences were used by everyone. (To understand the research environment at that time) Imagine doing computer vision research without a way to get image or video data into a computer!
The biggest surprise for me in 1990 was when my supervisor paid for my flight to Japan! Having never traveled much before, I was immediately hooked on travel. If you keep writing your thesis, you can travel the world for free! Traveling was a great motivator and helped me to be productive during my PhD. Even now, I still think I can’t attend a conference without a paper.
The (computer vision) community was pretty small back then. ICCV 1990 was the third ICCV, with 419 participants, which was quite large. It was my first computer vision conference, but I was impressed by how friendly everyone welcomed me. I met a young Andrew Zisserman there and walked with him from the hotel to the conference venue. Visual Reconstruction , co-authored with Andrew Blake, was like a bible to me. I was amazed at how smart he was, how quick he was, and how self-taught he was reading Japanese while walking between the venue and the hotel.
What impressed me about ICCV then and now is how friendly and open the people were. Andrew was already a household name in 1990, but he took the time to talk to new graduate students like me. Like him, people I met at early conferences have become friends and colleagues throughout my career. I hope that today’s young researchers feel as welcome as I did in the past.
At conferences these days, a lot of young people come up to me and introduce themselves, and it’s nice to meet them. They are the future of the research field (computer vision). Also, I go see a lot of posters. Poster presentations are where I meet newcomers in the field and find out what they think. When I am having a fun discussion in such a place, there are times when people recognize who I am by reading my badge. When that happens, people say, “Oh, you’re Michael Black!” At such times, I try to reassure them that I am just one of the researchers who wants to understand their paper.
Of course, there are many major differences between ICCV 1990 and ICCV 2021. Besides the big growth in this field, the biggest change is in the tools we use. The problems we are working on are similar, but today our tools are based on neural networks.
After AlexNet won first place in the ImageNet challenge, everyone of any age experienced five levels of grief. The first stage was shock and denial. It felt like the world had turned upside down from the shock (resulting from AlexNet’s work) and nothing was ever going to go back to normal. However, such results could not be denied.
Then, with anger, the bargaining begins. “Sure, these things [ deep learning ] are good at classification. It is by no means suitable for predicting the I certainly told myself so. But then it turned out that deep nets are very good at solving regression problems.
Depression came next. Many older researchers thought so. “My career is over. Nothing I’ve ever done is worth anything. What I wrote in the last five years will never be quoted again. And this new thing I’m not interested.I like calculus, manifolds, geometry, linear algebra.I like to think about problems in a certain way, and I’m good at it.But this new thing thinks about problems in a different way. I’m not interested in thinking about it. What should I do now?”
Some people stopped there. At a certain age, it might have been a good time to retire. But many stopped there. As he overcame his grief, one day a turning point came and he began to accept that grief is over. And then hope was born. People who have experienced this kind of situation come back to the research field with a fresh start and hope. He has a whole new set of tools, a new perspective on the problems he’s been interested in, and a new life.
That said, eight or nine years after this revolution, it’s become clear that some issues aren’t all that interesting, at least not to me. But I see another problem that I didn’t see before.
My career began with research on optical flow estimation presented at ICCV in 1990. My group at Max Plant Institute published a paper last year on adversarial attacks on optical flow networks . However, this may be my last paper on optical flow. I would never say the problem is solved, but given enough data, such as synthetic data or unlabeled video data, it’s a solvable problem to the point that someone wants it solved.
In fact, these problems are called low-level problems in the sense that we can do the same thing for every pixel in the image, such as estimating stereo, albedo, and surface normals, and measure its accuracy numerically. These problems are solvable with enough data that they are no longer so interesting to me.
On the other hand, what I find interesting at the moment is the mid-level problem class. I believe that 3D human pose and shape estimation also falls into this class. These medium-level problems apply non-uniform processing to the image, but use the accuracy calculated from the measurements for evaluation. When estimating the 3D pose and shape of a person, we estimate things that are not completely bound to pixels. It estimates something in the 3D world. However, the accuracy of the estimated result can still be evaluated quantitatively. The advantage of using these estimates is that it is relatively easy to train a neural network to solve this problem given the labeled data. Interestingly, such labeled data is still hard to come by. Advances in this 3D human pose and shape estimation are fairly rapid, and self-supervised methods are improving, so even this problem may not be interesting in five years.
But beyond these medium-level problems lies a wonderful world that we don’t understand. I call it the “high level problem”. Computer vision has a broader mission of “seeing things that are not in images .” That is the real purpose of computer vision. What caused the image? What will happen next? In my case, I am interested in humans, their movements, and their actions. what are they doing why are you doing that? what are their feelings? what next? These are not directly observable from the pixels. There are no pixels in a measurable image that can tell you what’s going on in someone’s head. Progress on these issues will be slow, as they cannot be easily measured with the accuracy that can be calculated from the methods we use. The high-level issues will keep me busy at ICCV in the coming years as well.
Another big difference between 1990 and now (besides the rise of neural networks) is the participation of industry. Many papers from companies can be seen in ICCV. Back in 1990, not much of what we did was successful enough to be useful to the world. Fortunately, that is not the case now, and many ideas from ICCV2021 will be commercialized.
While it’s exciting to see computer vision in the hands of real users, many people underestimate what it takes to go from a research paper at ICCV to a product customers use every day. I think By the time the product is actually used, the DNA of the original research remains, but many people contribute to its commercialization, and many ideas are generated along the way. And what actually makes a good customer experience often has nothing to do with the underlying technology. Perhaps that is why researchers feel that “I contributed to commercialization, but that contribution is only part of a larger puzzle.”
Luckily, since my first ICCV, I have been able to participate in every ICCV since, which has allowed me to see the world and make great friends. As for the future of ICCV, we look forward to returning to face-to-face meetings in Paris in 2023. Virtual meetings don’t give you the feel of the real thing. We are looking forward to seeing you there!
(*Translation Note 2) In the paper ” Attacking Optical Flow, ” co-authored by a research team from the Max Planck Institute, to which the author of this article belongs, Michael Black, and the University of Tübingen, Germany, the movement of objects in a video is vectorized. This paper summarizes
the experimental results of attacks that impede estimation against an AI model that estimates the optical flow expressed in . For multiple AI models estimating optical flow, if noise less than 1% ofthe image area is mixed in the image, some AI models misidentify moving objects . This experimental result is an important finding for ensuring the safety of self-driving cars to which the optical flow estimation model is applied. The image below is the patch group used as noise.
Images recognized by various optical flow estimation models when the following images execute attacks with noise. The optical flow is visualized by coloring according to its estimated value. The image column “Difference” on the right shows the difference in optical flow between no attack and attack conditions. The wider the colored area, the more misidentifications due to attacks .
In addition, the video below summarizes the contents of the above paper.
The paper, ” Competitive Collaboration: Depth, Camera Motion, Optical Flow, and Motion Segments, ” co-authored by Max Plant Research Institute, NVIDIA, and MIT , describes the low-level computer Unsupervised learning methods are proposed to solve vision problems such as single-view depth prediction, camera motion estimation, optical flow, and video segmentation into static scenes and moving regions. Because these
low-level problems are geometrically complementary , performing collective unsupervised learning on these problems has achieved better accuracy than existing work.