Laparoscopic surgeries can take hours, and the video generated by the camera -- the laparoscope -- is often recorded. Those recordings contain a wealth of information that could be useful for training both medical providers and computer systems that would aid with surgery, but because reviewing them is so time consuming, they mostly sit idle.
Researchers at MIT and Massachusetts General Hospital hope to change that, with a new system that can efficiently search through hundreds of hours of video for events and visual features that correspond to a few training examples.
In work they presented at the International Conference on Robotics and Automation this month, the researchers trained their system to recognize different stages of an operation, such as biopsy, tissue removal, stapling, and wound cleansing.
But the system could be applied to any analytical question that doctors deem worthwhile. It could, for instance, be trained to predict when particular medical instruments -- such as additional staple cartridges -- should be prepared for the surgeon's use, or it could sound an alert if a surgeon encounters rare, aberrant anatomy.
"Surgeons are thrilled by all the features that our work enables," says Daniela Rus, an Andrew and Erna Viterbi Professor of Electrical Engineering and Computer Science and senior author on the paper. "They are thrilled to have the surgical tapes automatically segmented and indexed, because now those tapes can be used for training. If we want to learn about phase two of a surgery, we know exactly where to go to look for that segment. We don't have to watch every minute before that. The other thing that is extraordinarily exciting to the surgeons is that in the future, we should be able to monitor the progression of the operation in real-time."
Joining Rus on the paper are first author Mikhail Volkov, who was a postdoc in Rus' group when the work was done and is now a quantitative analyst at SMBC Nikko Securities in Tokyo; Guy Rosman, another postdoc in Rus' group; and Daniel Hashimoto and Ozanan Meireles of Massachusetts General Hospital (MGH).
The new paper builds on previous work from Rus' group on "coresets," or subsets of much larger data sets that preserve their salient statistical characteristics. In the past, Rus' group has used coresets to perform tasks such as deducing the topics of Wikipedia articles or recording the routes traversed by GPS-connected cars.
In this case, the coreset consists of a couple hundred or so short segments of video -- just a few frames each. Each segment is selected because it offers a good approximation of the dozens or even hundreds of frames surrounding it. The coreset thus winnows a video file down to only about one-tenth its initial size, while still preserving most of its vital information.
For this research, MGH surgeons identified seven distinct stages in a procedure for removing part of the stomach, and the researchers tagged the beginnings of each stage in eight laparoscopic videos. Those videos were used to train a machine-learning system, which was in turn applied to the coresets of four laparoscopic videos it hadn't previously seen. For each short video snippet in the coresets, the system was able to assign it to the correct stage of surgery with 93 percent accuracy.
"We wanted to see how this system works for relatively small training sets," Rosman explains. "If you're in a specific hospital, and you're interested in a specific surgery type, or even more important, a specific variant of a surgery -- all the surgeries where this or that happened -- you may not have a lot of examples."
The general procedure that the researchers used to extract the coresets is one they've previously described, but coreset selection always hinges on specific properties of the data it's being applied to. The data included in the coreset -- here, frames of video -- must approximate the data being left out, and the degree of approximation is measured differently for different types of data.
Machine learning can be thought of as a problem of approximation, however. In this case, the system had to learn to identify similarities between frames of video in separate laparoscopic feeds that denoted the same phases of a surgical procedure. The metric of similarity that it arrived at also served to assess the similarity of video frames that were included in the coreset, to those that were omitted.