In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.
But recognition of natural sounds — such as crowds cheering or waves crashing — has lagged behind. That’s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand, which is prohibitively expensive for all but the highest-demand applications.
Sound recognition may be catching up, however, thanks to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present a sound-recognition system that outperforms its predecessors but didn’t require hand-annotated data during training.
Instead, the researchers trained the system on video. First, existing computer vision systems that recognize scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.
“Computer vision has gotten so good that we can transfer it to other domains,” says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and one of the paper’s two first authors. “We’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to learn to understand sound.”
The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.
“Even humans are ambiguous,” says Yusuf Aytar, the paper’s other first author and a postdoc in the lab of MIT professor of electrical engineering and computer science Antonio Torralba. Torralba is the final co-author on the paper.
“We did an experiment with Carl,” Aytar says. “Carl was looking at the computer monitor, and I couldn’t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details — ‘Is it a restaurant?’ — those details are missing. Even for annotation purposes, the task is really hard.”
Daniela Rus loves Singapore. As the MIT professor sits down in her Frank Gehry-designed office in Cambridge, Massachusetts, to talk about her research conducted in Singapore, her face starts to relax in a big smile.
Her story with Singapore started in the summer of 2010, when she made her first visit to one of the most futuristic and forward-looking cities in the world. “It was love at first sight,” says the Andrew (1956) and Erna Viterbi Professor of Electrical Engineering and Computer Science and the director of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). That summer, she came to Singapore to join the Singapore-MIT Alliance for Research and Technology (SMART) as the first principal investigator in residence for the Future of Urban Mobility Research Program.
“In 2010, nobody was talking about autonomous driving. We were pioneers in developing and deploying the first mobility on demand for people with self-driving golf buggies,” says Rus. “And look where we stand today! Every single car maker is investing millions of dollars to advance autonomous driving. Singapore did not hesitate to provide us, at an early stage, with all the financial, logistical, and transportation resources to facilitate our work.”
Since her first visit, Rus has returned each year to follow up on the research, and has been involved in leading revolutionary projects for the future of urban mobility. “Our team worked tremendously hard on self-driving technologies, and we are now presenting a wide range of different devices that allow autonomous and secure mobility,” she says. “Our objective today is to make taking a driverless car for a spin as easy as programming a smartphone. A simple interaction between the human and machine will provide a transportation butler.”
The first mobility devices her team worked on were self-driving golf buggies. Two years ago, these buggies advanced to a point where the group decided to open them to the public in a trial that lasted one week at the Chinese Gardens, an idea facilitated by Singapore’s Land and Transportation Agency (LTA). Over the course of a week, more than 500 people booked rides from the comfort of their homes, and came to the Chinese Gardens at the designated time and spot to experience mobility-on-demand with robots.
The test was conducted around winding paths trafficked by pedestrians, bicyclists, and the occasional monitor lizard. The experiments also tested an online booking system that enabled visitors to schedule pickups and drop-offs around the garden, automatically routing and redeploying the vehicles to accommodate all the requests. The public’s response was joyful and positive, and this brought the team renewed enthusiasm to take the technology to the next level.
The butt of jokes as little as 10 years ago, automatic speech recognition is now on the verge of becoming people’s chief means of interacting with their principal computing devices.
In anticipation of the age of voice-controlled electronics, MIT researchers have built a low-power chip specialized for automatic speech recognition. Whereas a cellphone running speech-recognition software might require about 1 watt of power, the new chip requires between 0.2 and 10 milliwatts, depending on the number of words it has to recognize.
In a real-world application, that probably translates to a power savings of 90 to 99 percent, which could make voice control practical for relatively simple electronic devices. That includes power-constrained devices that have to harvest energy from their environments or go months between battery charges. Such devices form the technological backbone of what’s called the “internet of things,” or IoT, which refers to the idea that vehicles, appliances, civil-engineering structures, manufacturing equipment, and even livestock will soon have sensors that report information directly to networked servers, aiding with maintenance and the coordination of tasks.
“Speech input will become a natural interface for many wearable applications and intelligent devices,” says Anantha Chandrakasan, the Vannevar Bush Professor of Electrical Engineering and Computer Science at MIT, whose group developed the new chip. “The miniaturization of these devices will require a different interface than touch or keyboard. It will be critical to embed the speech functionality locally to save system energy consumption compared to performing this operation in the cloud.”
“I don’t think that we really developed this technology for a particular application,” adds Michael Price, who led the design of the chip as an MIT graduate student in electrical engineering and computer science and now works for chipmaker Analog Devices. “We have tried to put the infrastructure in place to provide better trade-offs to a system designer than they would have had with previous technology, whether it was software or hardware acceleration.”
Price, Chandrakasan, and Jim Glass, a senior research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory, described the new chip in a paper Price presented last week at the International Solid-State Circuits Conference.
Jay W. Forrester SM ’45, professor emeritus in the MIT Sloan School of Management, founder of the field of system dynamics, and a pioneer of digital computing, died Nov. 16. He was 98.
Forrester’s time at MIT was rife with invention. He was a key figure in the development of digital computing, the national air defense system, and MIT’s Lincoln Laboratory. He developed servomechanisms (feedback-based controls for mechanical devices), radar controls, and flight-training computers for the U.S. Navy. He led Project Whirlwind, an early MIT digital computing project. It was his work on Whirlwind that led him to invent magnetic core memory, an early form of RAM for which he holds the patent, in 1949.
MIT Sloan Professor John Sterman, a student, friend, and colleague of Forrester’s since the 1970s, points to a 2003 photo of Forrester on a Segway as an illustration of his work’s lasting impact.
“He really is standing on top of the fruits of his many careers,” Sterman said. “He’s standing on a device that integrates servomechanisms, digital controllers, and a sophisticated feedback control system.”
“From the air traffic control system to 3-D printers, from the software companies use to manage their supply chains to the simulations nations use to understand climate change, the world in which we live today was made possible by Jay’s work,” he said.
Systems dynamics: A new view of management
It was after turning his attention to management in the mid-1950s that Forrester developed system dynamics — a model-based approach to analyzing complex organizations and systems — while studying a General Electric appliance factory. An MIT Technology Review article explores how he sought to combat the factory’s boom-and-bust cycle by examining its “weekly orders, inventory, production rate, and employees.” He then developed a computer simulation of the GE supply chain to show how management practices, not market forces, were causing the cycle.
Forrester’s “Industrial Dynamics” was published in 1961. The field expanded to chart the complexities of economies, supply chains, and organizations. Later, he cast the principles of system dynamics on global issues in “Urban Dynamics,” published in 1969, and “World Dynamics,” published in 1971. The latter was an integrated simulation model of population, resources, and economic growth. Forrester became a critic of growth, a position that earned him few friends.
Surviving breast cancer changed the course of Regina Barzilay’s research. The experience showed her, in stark relief, that oncologists and their patients lack tools for data-driven decision making. That includes what treatments to recommend, but also whether a patient’s sample even warrants a cancer diagnosis, she explained at the Nov. 10 Machine Intelligence Summit, organized by MIT and venture capital firm Pillar.
“We do more machine learning when we decide on Amazon which lipstick you would buy,” said Barzilay, the Delta Electronics Professor of Electrical Engineering and Computer Science at MIT. “But not if you were deciding whether you should get treated for cancer.”
Barzilay now studies how smarter computing can help patients. She wields the powerful predictive approach called machine learning, a technique that allows computers, given enough data and training, to pick out patterns on their own — sometimes even beyond what humans are capable of pinpointing.
Machine learning has long been vaunted in consumer contexts — Apple’s Siri can talk with us because machine learning enables her to understand natural human speech — yet the summit gave a glimpse of the approach’s much broader potential. Its reach could offer not only better Siris (e.g., Amazon’s “Alexa”), but improved health care and government policies.
Machine intelligence is “absolutely going to revolutionize our lives,” said Pillar co-founder Jamie Goldstein ’89. Goldstein and Anantha Chandrakasan, head of the MIT Department of Electrical Engineering and Computer Science (EECS) and the Vannevar Bush Professor of Electrical Engineering and Computer Science, organized the conference to bring together industry leaders, venture capitalists, students, and faculty from the Computer Science and Artificial Intelligence (CSAIL), Institute for Data, Systems, and Society (IDSS), and the Laboratory for Information and Decision Systems (LIDS) to discuss real-world problems and machine learning solutions.
Barzilay is already thinking along those lines. Her group’s work aims to help doctors and patients make more informed medical decisions with machine learning. She has a vision for the future patient in the oncologist’s office: “If you’re taking this treatment, [you’ll see] how your chances are going to be changed.”
One way to handle big data is to shrink it. If you can identify a small subset of your data set that preserves its salient mathematical relationships, you may be able to perform useful analyses on it that would be prohibitively time consuming on the full set.
The methods for creating such “coresets” vary according to application, however. Last week, at the Annual Conference on Neural Information Processing Systems, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and the University of Haifa in Israel presented a new coreset-generation technique that’s tailored to a whole family of data analysis tools with applications in natural-language processing, computer vision, signal processing, recommendation systems, weather prediction, finance, and neuroscience, among many others.
“These are all very general algorithms that are used in so many applications,” says Daniela Rus, the Andrew and Erna Viterbi Professor of Electrical Engineering and Computer Science at MIT and senior author on the new paper. “They’re fundamental to so many problems. By figuring out the coreset for a huge matrix for one of these tools, you can enable computations that at the moment are simply not possible.”
As an example, in their paper the researchers apply their technique to a matrix — that is, a table — that maps every article on the English version of Wikipedia against every word that appears on the site. That’s 1.4 million articles, or matrix rows, and 4.4 million words, or matrix columns.
That matrix would be much too large to analyze using low-rank approximation, an algorithm that can deduce the topics of free-form texts. But with their coreset, the researchers were able to use low-rank approximation to extract clusters of words that denote the 100 most common topics on Wikipedia. The cluster that contains “dress,” “brides,” “bridesmaids,” and “wedding,” for instance, appears to denote the topic of weddings; the cluster that contains “gun,” “fired,” “jammed,” “pistol,” and “shootings” appears to designate the topic of shootings.
Joining Rus on the paper are Mikhail Volkov, an MIT postdoc in electrical engineering and computer science, and Dan Feldman, director of the University of Haifa’s Robotics and Big Data Lab and a former postdoc in Rus’s group.
MIT researchers and their colleagues have developed a new computational model of the human brain’s face-recognition mechanism that seems to capture aspects of human neurology that previous models have missed.
The researchers designed a machine-learning system that implemented their model, and they trained it to recognize particular faces by feeding it a battery of sample images. They found that the trained system included an intermediate processing step that represented a face’s degree of rotation — say, 45 degrees from center — but not the direction — left or right.
This property wasn’t built into the system; it emerged spontaneously from the training process. But it duplicates an experimentally observed feature of the primate face-processing mechanism. The researchers consider this an indication that their system and the brain are doing something similar.
“This is not a proof that we understand what’s going on,” says Tomaso Poggio, a professor of brain and cognitive sciences at MIT and director of the Center for Brains, Minds, and Machines (CBMM), a multi-institution research consortium funded by the National Science Foundation and headquartered at MIT. “Models are kind of cartoons of reality, especially in biology. So I would be surprised if things turn out to be this simple. But I think it’s strong evidence that we are on the right track.”
Indeed, the researchers’ new paper includes a mathematical proof that the particular type of machine-learning system they use, which was intended to offer what Poggio calls a “biologically plausible” model of the nervous system, will inevitably yield intermediary representations that are indifferent to angle of rotation.
Poggio, who is also a primary investigator at MIT’s McGovern Institute for Brain Research, is the senior author on a paper describing the new work, which appeared today in the journal Computational Biology. He’s joined on the paper by several other members of both the CBMM and the McGovern Institute: first author Joel Leibo, a researcher at Google DeepMind, who earned his PhD in brain and cognitive sciences from MIT with Poggio as his advisor; Qianli Liao, an MIT graduate student in electrical engineering and computer science; Fabio Anselmi, a postdoc in the IIT@MIT Laboratory for Computational and Statistical Learning, a joint venture of MIT and the Italian Institute of Technology; and Winrich Freiwald, an associate professor at the Rockefeller University.
This week the Association for Computer Machinery (ACM) announced its 2016 fellows, which include four principal investigators from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL): professors Erik Demaine, Fredo Durand, William Freeman, and Daniel Jackson. They were among the 1 percent of ACM members to receive the distinction.
“Erik, Fredo, Bill, and Daniel are wonderful colleagues and extraordinary computer scientists, and I am so happy to see their contributions recognized with the most prestigious member grade of the ACM,” says CSAIL Director Daniela Rus, who herself was named a fellow last year. “All of us at CSAIL are very proud of these researchers for receiving these esteemed honors.”
ACM’s 53 fellows for 2016 were named for their distinctive contributions spanning such computer science disciplines as computer vision, computer graphics, software design, machine learning, algorithms, and theoretical computer science.
“As nearly 100,000 computing professionals are members of our association, to be selected to join the top 1 percent is truly an honor,” says ACM President Vicki L. Hanson. “Fellows are chosen by their peers and hail from leading universities, corporations and research labs throughout the world. Their inspiration, insights and dedication bring immeasurable benefits that improve lives and help drive the global economy. ”
Demaine was selected for contributions to geometric computing, data structures, and graph algorithms. His research interests include the geometry of understanding how proteins fold and the computational difficulty of playing games. He received the MacArthur Fellowship for his work in computational geometry. He and his father Martin Demaine have produced numerous curved-crease sculptures that explore the intersection of science and art — and that are currently in the Museum of Modern Art in New York.
A Department of Electrical Engineering and Computer Science (EECS) professor whose research spans video graphics and photo-generation, Durand was selected for contributions to computational photography and computer graphics rendering. He also works to develop new algorithms to enable image enhancements and improved scene understanding. He received the ACM SIGGRAPH Computer Graphics Achievement Award in 2016.
Freeman is the Thomas and Gerd Perkins Professor of EECS at MIT. He was selected as a fellow for his contributions to computer vision, machine learning, and computer graphics. His research interests also include Bayesian models of visual perception and computational photography. He received “Outstanding Paper” awards at computer vision and machine learning conferences in 1997, 2006, 2009 and 2012, as well as ACM’s “Test of Time” awards for papers from 1990 and 1995.
When it comes to protecting data from cyberattacks, information technology (IT) specialists who defend computer networks face attackers armed with some advantages. For one, while attackers need only find one vulnerability in a system to gain network access and disrupt, corrupt, or steal data, the IT personnel must constantly guard against and work to mitigate varied and myriad network intrusion attempts.
The homogeneity and uniformity of software applications have traditionally created another advantage for cyber attackers. “Attackers can develop a single exploit against a software application and use it to compromise millions of instances of that application because all instances look alike internally,” says Hamed Okhravi, a senior staff member in the Cyber Security and Information Sciences Division at MIT Lincoln Laboratory. To counter this problem, cybersecurity practitioners have implemented randomization techniques in operating systems. These techniques, notably address space layout randomization (ASLR), diversify the memory locations used by each instance of the application at the point at which the application is loaded into memory.
In response to randomization approaches like ASLR, attackers developed information leakage attacks, also called memory disclosure attacks. Through these software assaults, attackers can make the application disclose how its internals have been randomized while the application is running. Attackers then adjust their exploits to the application’s randomization and successfully hijack control of vulnerable programs. “The power of such attacks has ensured their prevalence in many modern exploit campaigns, including those network infiltrations in which an attacker remains undetected and continues to steal data in the network for a long time,” explains Okhravi, who adds that methods for bypassing ASLR, which is currently deployed in most modern operating systems, and similar defenses can be readily found on the Internet.
Okhravi and colleagues David Bigelow, Robert Rudd, James Landry, and William Streilein, and former staff member Thomas Hobson, have developed a unique randomization technique, timely address space randomization (TASR), to counter information leakage attacks that may thwart ASLR protections. “TASR is the first technology that mitigates an attacker’s ability to leverage information leakage against ASLR, irrespective of the mechanism used to leak information,” says Rudd.
To disallow an information leakage attack, TASR immediately rerandomizes the memory’s layout every time it observes an application processing an output and input pair. “Information may leak to the attacker on any given program output without anybody being able to detect it, but TASR ensures that the memory layout is rerandomized before the attacker has an opportunity to act on that stolen information, and hence denies them the opportunity to use it to bypass operating system defenses,” says Bigelow. Because TASR’s rerandomization is based upon application activity and not upon a set timing (say every so many minutes), an attacker cannot anticipate the interval during which the leaked information might be used to send an exploit to the application before randomization recurs.
“When you’re part of a community, you want to leave it better than you found it,” says Keertan Kini, an MEng student in the Department of Electrical Engineering, or Course 6. That philosophy has guided Kini throughout his years at MIT, as he works to improve policy both inside and out of MIT.
As a member of the Undergraduate Student Advisory Group, former chair of the Course 6 Underground Guide Committee, member of the Internet Policy Research Initiative (IPRI), and of the Advanced Network Architecture group, Kini’s research focus has been in finding ways that technology and policy can work together. As Kini puts it, “there can be unintended consequences when you don’t have technology makers who are talking to policymakers and you don’t have policymakers talking to technologists.” His goal is to allow them to talk to each other.
At 14, Kini first started to get interested in politics. He volunteered for President Obama’s 2008 campaign, making calls and putting up posters. “That was the point I became civically engaged,” says Kini. After that, he was campaigning for a ballot initiative to raise more funding for his high school, and he hasn’t stopped being interested in public policy since.
High school was also where Kini became interested in computer science. He took a computer science class in high school on the recommendation of his sister, and in his senior year, he started watching computer science lectures on MIT OpenCourseWare (OCW) by Hal Abelson, a professor in MIT’s Department of Electrical Engineering and Computer Science.
“That lecture reframed what computer science was. I loved it,” Kini recalls. “The professor said ‘it’s not about computers, and it’s not about science’. It might be an art or engineering, but it’s not science, because what we’re working with are idealized components, and ultimately the power of what we can actually achieve with them is not based so much on physical limitations so much as the limitations of the mind.”
In part thanks to Abelson’s OCW lectures, Kini came to MIT to study electrical engineering and computer science. Kini is currently pursuing an MEng in electrical engineering and computer science, a fifth-year master’s program following his undergraduate studies in electrical engineering and computer science.
Combining two disciplines
Kini set his policy interest to the side his freshman year, until he took 6.805J (Foundations of Information Policy), with Abelson, the same professor who inspired Kini to study computer science. After taking Abelson’s course, Kini joined him and Daniel Weitzner, a principal research scientist in the Computer Science and Artificial Intelligence Laboratory, in putting together a big data and privacy workshop for the White House in the wake of the Edward Snowden leak of classified information from the National Security Agency. Four years later, Kini is now a teaching assistant for 6.805J.
Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.
But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.
At the Neural Information Processing Systems conference this week, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn’t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. The system then learns which acoustic features of the recordings correlate with which image characteristics.
“The goal of this work is to try to get the machine to learn language more like the way humans do,” says Jim Glass, a senior research scientist at CSAIL and a co-author on the paper describing the new system. “The current methods that people use to train up speech recognizers are very supervised. You get an utterance, and you’re told what’s said. And you do this for a large body of data.
“Big advances have been made — Siri, Google — but it’s expensive to get those annotations, and people have thus focused on, really, the major languages of the world. There are 7,000 languages, and I think less than 2 percent have ASR [automatic speech recognition] capability, and probably nothing is going to be done to address the others. So if you’re trying to think about how technology can be beneficial for society at large, it’s interesting to think about what we need to do to change the current situation. And the approach we’ve been taking through the years is looking at what we can learn with less supervision.”
Joining Glass on the paper are first author David Harwath, a graduate student in electrical engineering and computer science (EECS) at MIT; and Antonio Torralba, an EECS professor.
The version of the system reported in the new paper doesn’t correlate recorded speech with written text; instead, it correlates speech with groups of thematically related images. But that correlation could serve as the basis for others.
If, for instance, an utterance is associated with a particular class of images, and the images have text terms associated with them, it should be possible to find a likely transcription of the utterance, all without human intervention. Similarly, a class of images with associated text terms in different languages could provide a way to do automatic translation.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory have developed a new computational model of a neural circuit in the brain, which could shed light on the biological role of inhibitory neurons — neurons that keep other neurons from firing.
The model describes a neural circuit consisting of an array of input neurons and an equivalent number of output neurons. The circuit performs what neuroscientists call a “winner-take-all” operation, in which signals from multiple input neurons induce a signal in just one output neuron.
Using the tools of theoretical computer science, the researchers prove that, within the context of their model, a certain configuration of inhibitory neurons provides the most efficient means of enacting a winner-take-all operation. Because the model makes empirical predictions about the behavior of inhibitory neurons in the brain, it offers a good example of the way in which computational analysis could aid neuroscience.
The researchers will present their results this week at the conference on Innovations in Theoretical Computer Science. Nancy Lynch, the NEC Professor of Software Science and Engineering at MIT, is the senior author on the paper. She’s joined by Merav Parter, a postdoc in her group, and Cameron Musco, an MIT graduate student in electrical engineering and computer science.
For years, Lynch’s group has studied communication and resource allocation in ad hoc networks — networks whose members are continually leaving and rejoining. But recently, the team has begun using the tools of network analysis to investigate biological phenomena.
“There’s a close correspondence between the behavior of networks of computers or other devices like mobile phones and that of biological systems,” Lynch says. “We’re trying to find problems that can benefit from this distributed-computing perspective, focusing on algorithms for which we can prove mathematical properties.”
In recent years, artificial neural networks — computer models roughly based on the structure of the brain — have been responsible for some of the most rapid improvement in artificial-intelligence systems, from speech transcription to face recognition software.
An artificial neural network consists of “nodes” that, like individual neurons, have limited information-processing power but are densely interconnected. Data are fed into the first layer of nodes. If the data received by a given node meet some threshold criterion — for instance, if it exceeds a particular value — the node “fires,” or sends signals along all of its outgoing connections.
Each of those outgoing connections, however, has an associated “weight,” which can augment or diminish a signal. Each node in the next layer of the network receives weighted signals from multiple nodes in the first layer; it adds them together, and again, if their sum exceeds some threshold, it fires. Its outgoing signals pass to the next layer, and so on.