August 08, 2023

S2E52 | The ethical use of AI in open-source investigations

35:06 SHARE ON:

Summary
transcript

Everyone is talking about using AI for OSINT, but what are the ethical considerations before you fire up your first chat? Before you start feeding a machine learning model your data and relying on it’s results, our guest from Fivecast tells us what you should consider.

Key takeaways

ChatGPT has built-in biases (just like humans)
Understand the human toll behind machine learning
How to use AI responsibly

About Trent Lewis

Trent is part of the Fivecast Data Science team and leads efforts in natural language processing of OSINT data and the evaluation of the Fivecast Risk Detectors.  Prior to joining Fivecast in 2021, Trent was an academic at Flinders University (2005-2021) teaching and researching in databases, programming and machine learning.  With a background in cognitive science, Trent has a PhD in Audio-Visual Speech Processing that explored the use of multi-modal data sources to improve the classification of speech.

References from the show

Responsible use of AI
Analyzing online communities
Leveraging AI for counter terrorism industry brief
Fivecast

Trent Lewis
You don't be afraid to learn something about them. The media will beat them up and make them sound like they're super intelligent beings that are going to take over the world. But I think we're far from that. Given they don't have real, true meaning. I think we should be more worried about bad actors. Using these tools is against us, and I guess that will be a challenge.

Jeff Phillips
Welcome to Needlestack, the podcast for professional online research. I'm Jeff Phillips, your host.

Shannon Ragan
And I'm Shannon Ragan, producer and co host on Needlestack. Today we're going to continue our discussion on generative AI in OSINT and the ethical considerations of using that technology.

Jeff Phillips
Perfect timing for this conversation. And joining us for the discussion is Dr. Trent Lewis, a senior data scientist at Fivecast. A reminder, we've had Abby from Fivecast on the show. In the past, Fivecast deploys advanced data collection and AI to solve complex intelligence challenges. So for intelligence agencies, law enforcement, and even private enterprises. So, Trent, welcome to the show.

Trent Lewis
Thank you. Thanks for having me. Yeah, it's a pleasure be here. I love talking about artificial intelligence and machine learning, so please feel free to cut me off at any point. Going on too much. Being an ex academic lecturer, I can talk for hours, but hopefully when those students don't raise their hands, I just keep going. Yeah, hopefully I can shed some light on some of the things going on in this fairly hot topic at the moment. It's very exciting just to hear all about it and to see it being used in a real way. It's great.

Jeff Phillips
Well, maybe we set the stage for some people, some basics. It seems like. Like I said, everyone's talking about AI right now, and there are lots of terms that get thrown around. I believe we even used one with generative AI. And then you used machine learning. There's large language models, there's deep learning, and I think some of us use these terms interchangeably, which probably is not the right thing to do. Can you kind of walk us through the distinctions between these different terms?

Trent Lewis
Yeah, it's a really good point. Especially in the media, they get conflated and there are subtle and big differences between the different things. I guess starting at the top, there's AI, artificial intelligence and machine learning. So there's probably a bit of a controversy around which way you define it. But personally, I'd see artificial intelligence as sort of the broader umbrella term for describing technology that tries to emulate human level intelligence. And how it does that is not particularly constrained. And you sort of have two branches of that. You'd have, I guess, maybe good old fashioned AI where it's symbolic, rule, reasoning based artificial intelligence. But the other side is what I term is machine learning, where the machine learns from data and it changes its behavior based on the data it gets. So that's an important distinction. You can see artificial intelligence. And then within that you've got machine learning. And then beyond that you have, I guess, machine learning from data. There's lots of different techniques that sit under that. One of the main ones that's driving all the discussion at the moment would be neural networks, where there's some very vague relationship between the cells in our brains and the technology.

Trent Lewis
It's very loose, but that's the catch. So then you have deep learning and deep neural networks. And that sort of technology really came to the fore probably around the 2010s, around that sort of era. And so neural network has been around since 1970s, but there were some big shifts in technology and data. So we could do these really big neural networks. And then that leads into, I guess we're into Transformers, which are another type of neural network, and then keep going down through there. Then you finally get to what these ones that are the generative so GPT, so ChatGPT, we've all heard of ChatGPT. I think it was said that it's one of the fastest growing technology takeups that's ever been. He's put maybe threads in the last week or so that uses all this same technology. So it's an AI, it's machine learning, it is a neural network, it's a deep neural network. And what it does, basically is to try and I guess demystify it a little bit. It doesn't think, it doesn't know what it's talking about. It has no meaning. But basically you feed in some text, and from that text it predicts the next word.

Trent Lewis
And then you take that new sentence and you feed that back through and it predicts the next word. So it doesn't have a full understanding of what that word is, what that word means. And so that's all it's doing. Makes it sound sad, but that's it. That's all it's doing.

Shannon Ragan
Poor bot.

Trent Lewis
Yeah, that's right.

Shannon Ragan
I am a good AI. So maybe that's a good place to pivot to. It seems like these are incredible tools and they truly are. Like, to use them is just fascinating. But especially within the sphere of OSINT or data science, are you seeing ways that either it's being misapplied, misinterpreted, misused what are the limitations of generative AI for OSINT?

Trent Lewis
Yeah, so there's definitely good ways to use it. And I think that's what most people are doing. I think there's several challenges to overcome. And I think one is that demystification of what it is, which we just talked about, but then understanding what these tools have been trained with and then what the implications of that is. So these are trained with massive amounts of data from the internet, from publicly available sources of information, and so they can only predict what they've been trained with. And a lot of the models have only been trained with data up until a certain point in time. So if you're asking it a question and you want some factual answers, then you need to know that it's only relevant up until the point at which it was trained is one thing. So that's a basic sort of model. I think they're incorporating search engines into them as well, so they can be up to date. But generally, if you're just using a GPT, which stands for Generative Pre Trained Transformer, and the Chat GPT is being able to chat with it in there, I think talking about using it well, getting it to help to pass the first draft of trying to write something up, probably not a bad idea.

Trent Lewis
But the challenges you have with there is what they call hallucinations, which is essentially it getting it wrong where it will come up with because all it's doing is predicting the next best token statistically. So not through meaning, not through fact, but what statistically is the next best word and so it can get things wrong. So that's where you need to be careful about the sorts of factual information you're trying to get it to return. So if you were using it to quickly find out information about a person, it's a good first start, but you'd probably want to go and verify that information.

Jeff Phillips
You said it's loaded with information. Right, and I get that it's for example, ChatGPT based on it goes out back to a certain time, or it's not as recent as yesterday, but there's all kinds of information on the internet, is what I wanted to say. It could be misinformation and is it taking information off of reddit or that's it which subreddit, et cetera. So it's got the facts and it's got other information.

Trent Lewis
Yeah, so it's got the facts, it's got other information, it's got biases within it. If you have a look at the ChatGPT's terms of service, that actually they actually explicitly state in there that a human should be involved in the output, someone should review this because it's not always right. And in fact, it's not only just not right, it can be very biased. I think it said something about that it will give more positivity towards different races and negativity towards others and that's just because it's of its built in biases in there. The thing that made ChatGPT, or one of the reasons it made it a huge improvement over other models is that it had a human reinforcement learning built into it. And so that allowed so people to sit here, review the responses and say it was human like, and that was fed back into the model and so it would produce more human like responses. The key is why I'm saying human like is that it wasn't trained to be truthful, it was just to be.

Shannon Ragan
Human like or politically correct or any other thing that you throw out at the window on the internet.

Trent Lewis
Yeah, that's it. So you try to use these things and you try to get them to say bad things and if you're not very clever about it. They won't say bad things, but there are ways to prompt hacking and different things like that out there. I think a good way that these can be used is on your own data sources is you can use them as a summarization technique and you can apply these models onto your own data and that will restrict the sources of information that it's drawing from. And so if your data source that you're using is you consider truthful. So if you have a large investigation with lots of information and you want to summarize it or find some information, some people find it easier to have a chat with the machine and get that information out. But having links back to the source information is very important. So you can verify it. Always multiple avenues of verifying information is important, especially anything dealing with the internet.

Shannon Ragan
Yeah, it seems just an extension of all OSINT, like don't trust one source, don't trust one data point. Always continue to corroborate and verify with anything.

Jeff Phillips
You were talking about asking questions. One of the things you were talking about earlier that you work on as a data scientist is natural language processing, which for someone like me it's not as technical let's say I find intriguing about ChatGPT. But can you tell us a little bit about what natural language processing is and how that can apply in an OSINT world?

Trent Lewis
Yeah, for sure. So I guess there's lots of aspects to natural language processing, but at its core it's dealing with human generated text and doing something interesting with it, whether that's translating it to another language or building understanding within it or just pulling it apart. I go on for a long time. So the really interesting area, it's amazing what has happened in the last 2310 years. So I was working in this area in the early two thousand s and natural language processing, you'd need to have a linguistics background and understand the syntax of language and pulling out parts of speech, nouns and verbs and how all that hangs together. And then we started to see these new architectures of transformers, increases in compute power and then just data, huge amounts of data. And that's where these transformers which came out of Google have really taken off. And you don't really need to understand the, the actual language from these vast amounts of data that we can feed into them. They can start to build, I guess, a semantic understanding of the text. So we can feed in the text and it goes through sequence of operations and at the end you get some numbers.

Trent Lewis
And so the key thing here is that computers are really good at comparing numbers and understanding the difference between numbers. And those numbers represent a semantic understanding of that text. And that allows us to do all sorts of amazing things. So we can take two different bits of text and see if they have a similar meaning. So rather than having to have the exact text match, you could have are these two things sort of the same? Then we can do sentiment analysis and emotion analysis as well, so we can tune them into different directions. So not just a semantic understanding, but an emotion. You can train them to do named entity recognition so it can pull out people's names, places, locations, places and locations, so it can pull out useful information out of a document. So if you've got court documents and you can't be bothered sitting there scanning through it, you got newspaper articles, so you got large things that you can look through and you want to find who all the key people are. So that's a named entity recognition task, which these models can do as well.

Shannon Ragan
You were saying that obviously natural language processing lends itself to text, but there's also a way to do this for image based media, is that correct?

Trent Lewis
Yeah, that's right. So it's amazing, once again, similar sort of technology that's been applied to images and it's all about the amount of data that's there. So there's just so many images on the internet now that these steps through the technology where back in the again early 2000s, doing computer vision, you were handcrafting features. So okay, is this line this way? Does the edge hit here? And you were talking about how that works, but now here's an image, let's just pass it through these really complex neural networks and well, we get something out and you can do the same sorts of things with them. So you can do are these two images similar? Is this the image I'm looking for? So you can do object recognition as well, logo recognition, which can be quite useful if you're doing an investigation, especially with lots of things you can pull down from different social media. Then rather than having to look through it all, you can have things that detect in the images what's happening and it's all interestingly. Nowadays, underlying technology is very similar in a lot of respects. So what that leads to is you can actually do sort of a multimodal approach as well, where you can have text and images all being worked together.

Trent Lewis
So you could say, find me all the images of pipe bombs and you could do that in text or find me all things about pipe bombs and it'll find your images, it'll find your people talking about them because they're sort of this intersection of technology coming together very interesting.

Jeff Phillips
Yeah. And we talk a lot on this podcast. For OSINT analysts, it's the amount of data, the publicly available information that's out there. And so that if AI can help us sift through that and identify and.

Shannon Ragan
More data scientists and more data scientists.

Jeff Phillips
Be it the analysts themselves are the tools the analysts use, right?

Trent Lewis
Yeah. So we were just working on an investigation that has a couple of million, I guess, actors in it from some publicly available sources. And just to sit and sift through that would be fairly painstaking. I mean, I never do anything with my own eyes. I always think, how can I get a computer to do this?

Shannon Ragan
For me, the heavy lifting.

Trent Lewis
Yeah. At Fivecast, we have an eclectic team. We have software engineers building the product. We have a very small number of data scientists trying to help with the data. But we also have trade craft people on staff who are from industry, from working as analysts. And the same data set that I was working on, an analyst has sat down and gone through and pulled out the interesting parts of it and used some of our tools to help filter it. So by using those text analysis tools to be able to filter the data based on simple keywords, but also on finding meaning within there, as well as image sort of recognition. So being able to pull out the bits so you don't have to look at all million of them, you can have some sort of filtered list of them. And so she could do something of interest with all that data by filtering it down, but we still didn't use all the super interesting stuff. So we're having another look at it and hopefully we can pull out some more information by being able to view that large amount of data. So as an individual, it's very difficult to look through, but if we can harness sort of data science and machine learning to do that, pre filtering and grouping and clustering to put it all a nice visualization to work.

Trent Lewis
Yeah.

Jeff Phillips
Well, since you mentioned your team, I know the data science team at Fivecast created a really interesting white paper on the ethical use of AI in OSINT. Can you touch on that a little bit as far as how can researchers use AI ethically and what should you keep in mind whenever you're using it as an analyst?

Trent Lewis
Yeah, that's a really good question. And I guess it's become a fairly hot topic as well, along with things like ChatGPT. So I guess talking about ChatGPT and these large language models that use reinforcement learning with human feedback, there's some ethical considerations around just even the creation of these models and what it really takes to make them. The models themselves, you can't really build these at this scale as an individual or as a small company. You really need someone with millions and millions of dollars to do it, and the resources and the compute power, which translates into using lots of energy. So I guess even at the first, before you even start using it, one needs to ask themselves whether should I be using this and encouraging people to make these? And one of those key parts that reinforcement learning with human feedback means that humans had to sit there and look at all the things that were generated so that we didn't have to. So we talked about the sources it pulls from, pulls from reddit, pulls from all sorts of places. They're filled with all sorts of things. So there was a real concern around what people had to go through just to create these models.

Trent Lewis
It's fairly well documented out there, moving past that. So we're now using these models. Once again, I think it comes back to those biases and trusting the data or trusting the responses that come out of these. I think there was some publicity around someone suing one of these companies because if you typed in their name or asked some questions about them, it said they were part of some embezzlement or something, and that just never happened. So there's that sort of issue. So if you're doing a collection on someone, great way maybe to get started would be to get one of these GPTs to generate a summary for you. But they're prone to hallucinations. The nice way to say they just get it wrong, that you need to, I guess, double check those sources through there. And I think that's key is just being able to so we talk about in AI explainability and transparency in models. So a lot of these models are very complex and it's very difficult to understand why they made a decision or why they gave you the information. So neural networks are a good example of that in that essentially they're a sequence of matrix multiplications.

Trent Lewis
So it's just numbers turning into more numbers into more numbers. And the more times you do that, it's harder to understand why it made a decision. Why did it decide to put this person and classify them into this group? It's hard to know. So there's a lot of effort around trying to open up the black box and see inside. So if you're using any particular machine learning model, whether it's a GPT or a neural network, some sort of classifier, what a lot of people are pushing are what are things called model cards, which explain to you what it was trained on. So you can sort of try to understand the biases in there. Was this data trained on 20 to 24 year old university students? Because that's what most models are trained on. And who are those people? Maybe they're predominantly male, so it might perform worse on underrepresented populations being understand what it was trained on. And then what are the limitations that use cases it should be used in. So sometimes it's easy to use these out of context, which can make them underperform as well. And so it's about understanding those biases and those training.

Trent Lewis
And hopefully there is what we, I guess, strive to do at fivecast is build tools that allow you to understand why decisions were made in that classification. So we try not to put in too many whiz bang things that really take you away from the data, but being able to come back. If we're using these text similarity, which convert text into a bunch of numbers and do similarity to it, we'll say, well, we decide it's part of this. We classified this as being of interest because it was close to these other bits of text. So you can sort of read and go back and forth between them. That's a key thing to look for in your tools that you're using. Can it explain to me what's happening in there and getting the reasoning and the paper trail of why things are happening?

Shannon Ragan
I think that's all great advice on ethics, general distrust of what it spits out or need to verify grub rate.

Trent Lewis
That's right.

Shannon Ragan
I'm wondering, is there anything else for a researcher that you can recommend for how to use AI as best practice or resources that they can use for getting trained up on how best to leverage this?

Trent Lewis
There's a great blog post and article written on Fivecast that you can there's lots of information out there and you can find specifically recommending if you're really keen and interested. There are a lot of the big universities have free online courses to brush up on these skills and a lot of them are through something like Coursera or an online platform where you can dip your toes in and find out about how these taking the covers off and seeing how they work. So doing some sort of introductory course, I think everybody should have some sort of understanding of how these technologies work. If we can start teaching kids in school how to build these things, then they'll have a better understanding of the tools when they come out and understand those biases that exist. I think it goes alongside with media literacy, understanding, being able to do your source checks and things like that. Same thing goes with machine learning when you're using data to train things. So it's not checking your sources, but checking the data sources that have been used to train these models. So you can enroll in a Course or there'd be lots of good videos and technologies out there.

Shannon Ragan
Yeah, it's hard to we can also just ask ChatGPT, how can I learn about you? That's right, it's a trick question.

Trent Lewis
You could yeah, maybe pitch them against each other. You could ask the bing chat to tell me about chat GPT. You set me up for that. I saw that.

Jeff Phillips
Trent, before we wrap up, is there anything else you want to add to our conversation today?

Trent Lewis
Fivecast our thing is providing a tool across that open source, social, publicly available information. And as we sort of touched on that, just generates huge volumes of data, which maybe from a traditional analyst position, if you're using different sorts of OSINT or different sorts of intelligence, it's just not as a different magnitude of scale. And using these tools, these machine learning tools, and just data science in general, so you don't have to go all the way through to machine learning. But using summarization descriptive statistics, getting feels for the data is really important to try and understand that. And I guess don't be afraid to learn something about them. The media will beat them up and make them sound like they're super intelligent beings that are going to take over the world. But I think we're far from that given they don't have real true meaning. I think we should be more worried about bad actors using these tools against us and I guess that will be a challenge for people trying to do anything on the Internet, is using these tools, I guess in nefarious ways, to easily generate variations in text and flood people's inboxes and walls with misinformation around there.

Trent Lewis
And I think it's even trickier when we talked about imagery. So we have these image generation tools as well and we can do that with video and we can do that with voice as well. So you can generate a very excellent misinformation campaign in text, images and video and audio. That can be very convincing. So it's a tricky time and I guess that's what an analyst needs to be aware of when they're trying to piece together something. There are these tools out there now I think I read something about there's CAAS, which is crime as a service because these tools, you can set them up and so cool. Could you please generate me a misinformation campaign? We've all seen that before, but it's much more simple now. You don't need a full bot farm of a bunch of humans to generate all this. So it's an interesting time, right? I think we can use both for good and evil. We'll be doing the good, of course and generating, being able to detect and sift through this large data sets that we get through data science. So have a friendly data scientist at your side.

Shannon Ragan
There you go.

Jeff Phillips
A friendly one.

Trent Lewis
Yes.

Jeff Phillips
Maybe you could generate this podcast, right?

Trent Lewis
That's right. Well, we've got I think we've got enough audio, good enough video.

Jeff Phillips
Off we go. Well, I want to say thank you to our guests, dr. Trent Lewis for joining us today. I think we'll have many more conversations in the future about AI's role in OSINT. If you liked what you heard, you can view transcripts and other episode info on our website authentic8.com/needlestack. That's authentic with the number eight, slash needlestack. And be sure to let us know your thoughts on Twitter @needlestackpod and to like and subscribe wherever you're listening today. We'll be back next week with more on how analysts can use these emerging technologies in their day to day jobs. We'll see you then.

S2E52 | The ethical use of AI in open-source investigations

Key takeaways

About Trent Lewis

References from the show

More like this

S2E50 | LifeRaft: What security teams…
S2E50 | LifeRaft: What security teams need to know about ChatGPT

S2E40 | Take-home tips for OSINT…
S2E40 | Take-home tips for OSINT Curious researchers

S1E30 | How to perform a fact-check:…
S1E30 | How to perform a fact-check: from start to finish

S2E52 | The ethical use of AI in open-source investigations

Key takeaways

About Trent Lewis

References from the show

More like this

S2E50 | LifeRaft: What security teams… S2E50 | LifeRaft: What security teams need to know about ChatGPT

S2E40 | Take-home tips for OSINT… S2E40 | Take-home tips for OSINT Curious researchers

S1E30 | How to perform a fact-check:… S1E30 | How to perform a fact-check: from start to finish

S2E50 | LifeRaft: What security teams…
S2E50 | LifeRaft: What security teams need to know about ChatGPT

S2E40 | Take-home tips for OSINT…
S2E40 | Take-home tips for OSINT Curious researchers

S1E30 | How to perform a fact-check:…
S1E30 | How to perform a fact-check: from start to finish