Over the past year, we’ve been working hard at getting our systems and processes in order to better collect, store, manage, integrate and analyze data. For some of that work, we’ve had to write some code; in the spirit of the program and our mandate for more open and shared information, we’re releasing that code on our new Data Catalyst GitHub account.
For now, it’s mostly Python web scrapers, but we’ll be sharing more as we develop new tools and programs. Here’s a quick overview of what we are sharing right now:
- A Profit 500 downloader that uses the Beautiful Soup plug-in. The downloader lets you grab the list, but also downloads additional information from each detail page.
- An AngelList downloader that pulls information via the AngelList API, including market segmentation and people who have participated in the company. The downloader also does some robust cleaning of the location data to make it more usable.
- A CrunchBase downloader that accesses the database and searches (across 120,00+ entries!) for any companies that have Canadian offices.
- A Startup Genome API that gets basic information from the Startup Genome site.
- An Industrial Research Assistance Program (IRAP) web scraper that uses the Beautiful Soup and Pandas plug-ins to put all the results from the National Research Council of Canada website into .csv files.
- A name-matching script that breaks down names in two lists and looks for overlap and similarity.
- A script that normalizes Statistics Canada Canadian Business Patterns data by doing an un-pivot and some other formatting.
If you have some code to contribute to any of the above, please do. If you have a GitHub account with code that could be beneficial to people working with and around data and the innovation economy, let us know in the comments.