How secure are your AI and machine learning projects?

Artificial intelligence (AI) and machine learning (ML) offer all the same opportunities for vulnerabilities and misconfigurations as earlier technological advances, but they also have unique risks. As enterprises embark on major AI-powered digital transformations, those risks may become greater. “It’s not a good area to rush in,” says Edward Raff, chief scientist at Booz Allen Hamilton.

AI and ML require more data, and more complex data, than other technologies. The algorithms developed by mathematicians and data scientists come out of research projects. “We’re only recently as a scientific community coming to understand that there are security issues with AI,” says Raff.

The volume and processing requirements mean that cloud platforms often handle the workloads, adding another level of complexity and vulnerability. It’s no surprise that cybersecurity is the most worrisome risk for AI adopters. According to a Deloitte survey released in July 2020, 62% of adopters see cybersecurity risks as a major or extreme concern, but only 39% said they are prepared to address those risks.

Compounding the problem is that cybersecurity is one of the top functions for which AI is being used. The more experienced organizations are with AI, the more concerned they are about cybersecurity risks, says Jeff Loucks, executive director of Deloitte’s Center for Technology, Media and Telecommunications.

In addition, enterprises, even the more experienced ones, are not following basic security practices, such as keeping a full inventory of all AI and ML projects or conducting audits and testing. “Companies aren’t doing a great job right now of implementing these,” says Loucks.

Now, if you are interested in doing an end-to-end certification course in Machine Learning, you can check out Intellipaat’s Machine learning classes in Bangalore with Python.

AI and ML data needs create risk

AI and ML systems require three sets of data:

Training data to build a predictive model
Testing data to assess how well the model works
Live transactional or operational data when the model is put to work

While live transactional or operational data is clearly a valuable corporate asset, it can be easy to overlook the pools of training and testing data that also contains sensitive information.

Many of the principles used to protect data in other systems can be applied to AI and ML projects, including anonymization, tokenization, and encryption. The first step is to ask if the data is needed. It’s tempting, when preparing for AI and ML projects, to collect all the data possible and then see what can be done with it.

Focusing on business outcomes can help enterprises limit the data they collect to just what’s needed. “Data science teams can be very data-hungry,” says John Abbatico, CTO at Othot, a company that analyzes student data for educational institutions. “We make it clear in dealing with student data that highly sensitive PII [personally identifiable information] is not required and should never be included in the data that is provided to our team.”

Of course, mistakes do happen. For example, customers sometimes provide sensitive personal information such as Social Security numbers. This information doesn’t improve the performance of the models but does create additional risks. Abbatico says that his team has a procedure in place to identify PII, purge it from all systems, and notify the customers about the error. “We don’t consider it a security incident, but our practices make it seem like one.”

AI systems also want contextualized data, which can dramatically expand a company’s exposure risk. Say an insurance company wants a better handle on the driving habits of its customers, it can buy shopping, driving, location, and other data sets that can easily be cross-correlated and matched to customer accounts. That new, exponentially richer data set is more attractive to hackers and more devastating to the company’s reputation if it is breached.

AI security by design

One company that has a lot of data to protect is Box, the online file sharing platform. Box uses AI to extract metadata and improve search, classification, and other capabilities. “For example, we can extract terms, renewals, and pricing information from contracts,” says Lakshmi Hanspal, CISO at Box. “Most of our customers are coming from an era where the classification of their content is either user-defined classification or has been completely ignored. They’re sitting on mountains of data that could be useful for digital transformation — if the content is classified, self-aware, without waiting for human action.”

Protecting data is a key pillar for Box, Hanspal says, and the same data protection standards are applied to AI systems, including training data. “At Box, we believe that it is the trust we build, the trust we sell, and trust we maintain. We truly believe that this needs to be bolted into the offerings we provide to our partners and customers, not bolted on.”

That means that all systems, including new AI-powered projects, are built around core data security principles, including encryption, logging, monitoring, authentication, and access controls. “Digital trust is innate to our platform, and we operationalize it,” Hanspal says.

Box has a secure development process in place for both traditional code and the new AI and ML-powered systems. “We’re aligned with the ISO industry standards on developing secure products,” says Hanspal. “Security by design is built-in, and there are checks and balances in place, including penetration testing and red teaming. This is a standard process, and AI and ML projects are no different.”

Mathematicians and data scientists typically don’t worry about potential vulnerabilities when writing AI and ML algorithm code. When enterprises build AI systems, they draw on the available open-source algorithms, use commercial “black box” AI systems, or build their own from scratch.

With the open-source code, there’s the possibility that attackers have slipped in malicious code or the code includes vulnerabilities or vulnerable dependencies. Proprietary commercial systems also use that open-source code, plus new code that enterprise customers usually aren’t able to look at.

Inversion attacks a major threat

AI and ML systems usually wind up being a combination of open-source libraries and newly written code created by people who aren’t security engineers. Plus, no standard best practices exist for writing secure AI algorithms. Given the shortage of security experts and the shortage of data scientists, people who are experts in both are even in shorter supply.

One of the biggest potential risks of AI and ML algorithms, and the long-term threat that concerns Booz Allen Hamilton’s Raff the most, is the possibility of leaking training data to attackers. “There are inversion attacks where you can get the AI model to give you information about itself and what it was trained on,” he says. “If it was trained on PII data, you can get the model to leak that information to you. The actual PII can be potentially exposed.”

This is an area of active research, Raff says, and a massive potential pain point. Some tools can protect training data from inversion attacks, but they’re too expensive. “We know how to stop that, but to do that increases the cost of training the models by 100 times,” he says. “That’s not me exaggerating. It’s literally 100 times more expensive and longer to train the model, so nobody does it.”

Blog Post

How secure are your AI and machine learning projects?

AI and ML data needs create risk

tobimarsh43