Back To Top

Getting Closer to Data with Jason Corso

❝AI is Everywhere❞.


Of AI projects

❝Will deliver wrong outcomes due to bias in data.❞ ~


More callbacks for interviews

❝Versus resumes with uniquely African American names with the same qualifications.❞ ~ Brookings Institute

❝Better data leads to better models or better AI systems❞ ~ Jason Corso

Getting Closer to Data with Jason Corso


In this Ethics4NextGen Summit talk, Jason Corso explores the idea that AI is everywhere and rampant across different domains from finance to space travel to defense to banking to health care to education. The speaker discusses the importance of taking time to pause and understand the limitations of AI and what we need to think carefully about in order to achieve the full potential of AI technology without introducing further bias or other concerns or limitations of AI. In his view better data leads to better models or better AI systems – it’s exceptionally important to build data sets that have been sifted through, tested, and that ultimately we understand any intrinsic or implicit bias in those data sets. In this video, the panel discusses how supercritical the analysis part of data is, i.e. where you really touch the data to understand what’s happening.  Jason then moves on to discuss the work he and his team at Vox 51 have been involved in with the Baltimore city police department, to better understand how video analytics can be integrated into helping to get the right data, to the right people, at the right time.



00:00 I’m going to  introduce our next guest Dr Jason Corso, he is a professor of electrical engineering  and computer science  at the University of Michigan and he is  the co-founder and CEO  of voxel 51 which is an ai software  company  creating dev tools for improving the  performance of computer vision  and machine learning systems a veteran  in the field of computer vision, Jason has dedicated over 20 years to  academic research  and has authored nearly 150  peer-reviewed papers and hundreds of thousands of lines of  open source code  on computer vision robotics data science  and general computing prior to the  University of Michigan, he was a member  of the computer science and engineering  faculty  at sunny buffalo and spent two years  prior to that  as a postdoctoral fellow at the  University of California Los Angeles

01:08 he received his Ph.D. and MSc degrees  from the Johns Hopkins University and  his bachelor’s degree  with honors from Loyola University  Maryland  all in computer science please welcome  Dr Jason

01:22   thank you very much it’s a  pleasure to be here she’ll be  thank you, Jason, um before I start  talking could I uh ask you how I’ll control or  do you control the slides  we have enabled sharing so if you want  to do that’s fine we, I also have it so do you want to do it,  oh I’ll share let’s see if I can do that effectively, I have your slides uh so  okay, either way, are you able to see the slide yes yeah  wonderful  okay great

02:10 let’s put that down there oh  yeah it’s a pleasure to be here I’m really truly  honored to be talking at this uh data ethics summit I think it’s an exceptionally critical topic  and I think we should think of this as the beginning of an important  conversation that will continue, uh for some time so as you know I mean we’re here ai is is everywhere rampant across different  domains from finance to space travel  to defense to banking to health care to  education

02:44  it’s almost daunting what we think ai  can do for us driven by data and  learnings and so on  and the critical need to take a pause and understand  what are its limitations and what do we  need to think carefully about  in order to achieve such potential without uh  extra bias or other  concerns or limitations  of the resulting  systems

   03:13  so in our view  we think of better data, leads to better  models or better ai systems – uh it’s exceptionally important to build  data sets  that you’ve sifted through that you’ve  tested, that you understand any intrinsic or  implicit bias in those data sets, and  it’s exceptionally challenging  as a quantitative example, i’m a  technical person myself  uh let’s look at the graph on the right  side of the slide here so on the  horizontal axis  we see an increasing size of a data set  going from twenty thousand samples  to fifty thousand samples this is  actually sampled from a common data set  called c  far ten

03:54 and on the vertical side is the  accuracy on a test set here called the  validation set  there are three curves that are plotted  the blue curve uh  which is uh de facto not doing anything  special with your data  uh the green curve which is best  possible if your data is perfect  and the red curve is what you do when  you look at the data  and get close to the data and fix  mistakes you find in the data  and so here the mistakes are artificial  so i’ve added something like 10  uh erroneous labels in the classification data set  but you can see that if you’re able to  find and fix your data  those mistakes in your data set you can  get far  better performance of your ultimate  model that’s the type of quantitative  thing we want to reach  when we’re building ai systems and  machine learning models

04:40   uh and as you’re aware like as we’re  building these systems there will be a  loop right  we’ll start with some problem we’ll feed  that into some data acquisition  we’ll label the data we’ll understand in  privacy data providence and so on  we’ll push that data forward to models  and in when I say  models here I mean like training models  experimentation tracking and so on  and then ultimately we’ll have a loop in  which either we’ll do some analysis on  our output  how did the model that we train perform  on some other holdout data set or some  other data set  or we’ll actually just do the loop  without an analysis which is somewhat  common unfortunately these days  the analysis part is supercritical to  this need and that’s where you  really will get where you really touch  the data and understand  what’s happening what subsets of  the data are limited  where do you need to get more data where  are your annotations incorrect and  so on 

05:30  and so one example that’s relevant  to this  this meeting here is about predictive  policing so  so what a few years ago we worked with  the Baltimore city the police department  and the city watch program to understand  how  video analytics could be integrated into  helping to get the right  data to the right people at the right  time  you know this was all in in pilot  nothing was deployed this was really to  understand the limitations  and capabilities of contemporary video  analytics for  for allocating not so much in predictive  policing but in  in responsive policing and so we we’re  thinking about how  what are the technical problems and the  real world problems of getting you  getting uh processing video information  rapidly uh to keep people safe and so on  in citywatch in baltimore i think they  have upwards of 800 cameras that need to  be concurrently processed  but a limited compute budget to do that  so this was a good technical problem  you know there was a lot of video it was  streaming video they had a small  team of humans to look at the video and  so on but of course couldn’t  couldn’t always have everyone looking at  you know 50 videos at once right so we  had  the idea was could we sort the video in  a way that their humans could then go  and process  their offices okay could they go and  process the video to understand  um how to respond to events  right so 

06:46  in clear clearly here is how  could we avoid bias if every  if every officer that’s watching the  video uh had to monitor 50 feeds  we were going to show them the top five  of those how could we avoid bias across  that we wanted to do that  in ways that would again get the right  data to the right people at the right  time um without avoiding while avoiding bias right so in this  video here that you’re saying here  uh if we’re if we’re able to detect a  fight that’s going to break out  or or may break out or some that kind of  thing uh  could we could we do this in a way that  is not biased to  the location of the camera for example  which which i i don’t believe is a good  indicator  uh for for this type of signal or the  type of  um behaviors that are being seen  or the type of uh persons in the video 


07:34 so here there was actually a fight that  was that turned into an argument  uh ultimately led to as we learned later  an arrest and so on  but this was all previously captured  video but but it’s a good type of  example where where we need to make sure  that  when we think it’s a fight breaking out  that we test our data  our models on real data and understand  how they might actually perform, right and so in in in my view right to  train better ml ai models  you know it’s super critical to visually  inspect the data  uh and so that we can understand the  diversity the representativeness that  that dataset has  and we also need a way to understand  what subsets of data our model will  perform well  or perform poorly on right so where do  we have high uncertainty or low  uncertainty  in order to do this we need adequate  tooling uh to enable rapid  what we call rapid data set  experimentation right 

08:31  so here are some  tools I think this is recorded great so  this is  you’ll have this capture but here are  some open-source tools that uh  flesh out this this cycle the machine  learning life cycle here  and one that I’ll talk about a little  bit more is in the bottom right here  which is  uh our tool from voxel 51 called 51, which really allows you to do  performance visualization so it really  fleshes out this notion of analysis so  you can achieve the types of things that  we’re talking about  uh here so rat visual inspection to  understand  rapid data set experimentation right so  51 is a tool that  helps data scientists choose their  optimal samples  remove redundant images find annotation  mistakes  bootstrap better training data sets all  the things that we do now  but but it takes a laborious time to do  it because there’s no one common easy  lightweight tool to do it and 51 really  steps up to do that  it has a visual interface that you’re  seeing a screenshot of here on the right  side  as well as a tight bridge to an  underlying  python library that you can use while  doing model development work or python  you know python coding and so on 

09:37 uh the  actual labels or  you know the the the ontology that  you’re using to visualize and test your  data set  is fully flexible up to how you  represent your data  and it’s really important uh to to to  note that  through these tools you can do things  like for example  identifying failure modes in your data set rather quickly  through visual inspection right so  you’ve loaded a data set into python  you launched the app through a couple  lines of code and then you’re able to  you know plot your predictions oh wait  there’s too many predictions what’s the  right confidence threshold i should be  using  for example uh or you know how are my  predictions so in this case red or  ground truth  blue or predictions how are those  predictions um  representative of what what the human  gold standard of the ground truth was  very very quickly and naturally to do  that 

10:26  right so it lets you dig deep into data  samples as well so you can if you were  to click on one of those images  uh you’ll bring up a zoomed in version  of one one of the images here  and you’re seeing lots of relevant  information like the the  iou scores uh where and where  where the object detections are being  labeled and detected and so on  and then you can uh turn turn different  elements on or off for  a really dynamic it almost in my view it  kind of creates a  uh like a visual query language that  lets you  really manipulate and work with your  data naturally so that  you can uh build better data sets  right so a lot of flexibility so it  supports classification problems  uh 2d object detection problems instance  segmentation semantic segmentation uh it  does handle image data  and in a week or two we will behandling  video data so it’s real  really interesting in terms of the  future uh potential  use of the tool here 

11:24  uh in one final  note  uh you know it has a very intuitive and  extensible API  uh so uh as as uh Shilpi may have  mentioned we  we will be available for uh people  in the next few days at the hackathon as  associated with this meeting uh to  provide support uh you know this this  tool basically meets the need to  visualize and  you sort of render what’s happening in  your various systems  uh and the API itself is exceptionally  well documented  uh and supports and so it has a schema-free data model so it you know you don’t  have to laboriously  go through sql queries and meta language  and so on it’s very adaptable  adaptive it sits on top of mongodb uh it  supports arbitrary user defined fields  that are also  immediately rendered in the GUI uh it  has no data set size limitations it also  doesn’t copy your data set into some  central data store  runs on your the machine connects right to  your data where your data is  and again there’s a real tight type  connection between  the code and the UI 

12:26  uh so  again 51 is a is a great uh  data set exploration tool uh that  that enables you to let’s go right back  here  to really tighten the speed or the the  iteration  on this ai system development uh it is  open source  an apache license so no cost  associated with it  and connects really well with things  like tennis or flow pi torch, uh we use it in our own development uh  in connection with  say like ml flow is an experiment  tracking tool and it supports  uh on the order of a dozen different  annotation tool annotation formats  uh so it’s really flexible and  you know if the mission is to get  closer to your data so you understand  what are intrinsic biases or limitations  of that data well  uh this 51 tool is has been designed  from the ground up to enable that  which we think of really is like the  responsibility  of the data scientists or the ai  scientists when we are building  models that will be deployed in practice  uh and impact uh society, thanks for your attention if there’s a  question so how to take it 

13:35   thank you Jason this was really um  inspiring and interesting  um I want to quickly add to what you  already mentioned  it is in the agenda it is on the  community platform  it is on the main website but I just  want to add to what  Jason just said we do have a very  exciting boot camp  taught by Eric Hoffman tomorrow at uh  three from 3 30 to 4 30 and it will be  around he’ll  be doing a hands-on workshop on how 51  works  and how it can be used for visualization  and analysis  of input data something that you can use  for the  48 hour hackathon coming upfront on  Saturday and Sunday  so thank you really excited about the  Bootcamp and the workshop Jason

 14:26  great we are too yeah it’s wonderful thank you and i don’t see any um oh there is a question Jason can you bring your own libraries or just the preferred ones?

 14:38 uh could you could you elaborate on that  the question,  uh so predefined orders predefined  so you have psychic learn um i saw spark on their other libraries uh if you want to be able to bring in on this or is that still or are you using just those? yeah i think uh there’s no limitation at all  actually on what libraries you could use  to support so like say you have a data  set  uh and um you’re you know you’re using  scikit-learn like an SVM or something  like that from sci-kit-learn and you get  your classifications on your images  uh it’s natural kind of like one  two-three to add a scalar field  to the representation in in the  51 data set and then render that on on  the GUI uh  it’s kind of built to be really  lightweight and connect to your workflow  um you know i guess that said the only  constraint is that it’s really for  visual data  right now so it doesn’t support say  audio data or language data or tabular  data things like that but really  for image and visual data I think it’ll  be really good to  um listen to the boot camp tomorrow with  eric  so he’ll walk you through lots of  examples uh that do the types of things  that  uh that um that you’re asking about 

  15:55 yeah I think this is interesting because  uh predictive policing uses a lot of  image recognition technology  which is one of the challenges for our  hackathon  and uh that’s all image recognition  right so  uh very useful in that  yeah 

16:13  it looks like there’s another  question too um  about customers so um yeah a great  question so we have a  so so 51 is a developer tool uh  that’s an open core tool right so open  source um we have  uh we launched it in early August so  it’s pretty young  it’s been in development since just  before covet  it was actually built on top of our  internal libraries  when we were doing the video analytics  work I talked about with Baltimore  police so  so kind of was built from from our own  experience  let’s see so right now we have a slack  community that has about 150 users in it  there are everyone from academics to  major corporations there are some  members in the slack community from lego among some other companies for  example  and um you know we being open core tech  we’re not uh uh we’re not selling  licenses to it right now so we don’t  track the users uh  exceptionally well uh you know in the  future you know  for things like um connecting to your  data in  vast data lakes in the cloud things like  that uh  sharing across teams large-scale  back-end compute that’s the type of  stuff that we’ve  that we’re building that will be  commercializable on top of this open  source library  but right now the open source library  runs uh there are no constraints on the  size of the data or  how much compute you connect to it  locally if you’re if you’re able to do  that  so yeah yeah  good question 

17:43  great thank you Dr Jason take care thank you bye!



Dr. Jason Corso

CEO, Voxel 51



Shilpi Agarwal

DataEthics4All Founder & CEO, Social Impact Leader, CEO of Social Strategi LLC, Member of the Data Ethics Advisory Council