Distrito Dataminer
From zero to Brazil's most cited resource on startup data in less than two years
Distrito is an innovation hub in São Paulo, Brazil, which aims to connect startups, investors and corporate players in several ways. One of them, Dataminer, was created as a spinoff from the investing arm Distrito Ventures, which needed trustworthy research on the startup ecosystem in the country before making investment decisions. When the company decided to make Dataminer its own business arm, I was the first employee hired, in November 2018.
At first, I did a bit of everything, from manually collecting data to writing essays on the topics we wrote reports on, interviewing startup founders, developing a strategic vision for the business and more. That up-for-anything role never really went away, but I quickly realized that my biggest contribution came from being able to dedicate myself to automating every process I could, structuring and formalizing data pipelines, mostly in Python, to move from building individual reports to creating a centralized database on the innovation ecosystem that soon eclipsed anything else available on the Brazilian or Latin American market. The team had grown by then, and I became Data Lead.
We very quickly made a name for ourselves as one of the deepest and most complete resources for information about the startup ecosystem in the country. In 2019 alone, our studies were cited in the national press over 160 times. Dataminer went from an area with no revenue to one that paid for itself several times over through a mix of selling content and contract work for clients who wanted to discover startups related to specific challenges they were facing.
On the data science and analysis side, one my biggest achievements was successfully predicting 4 out of 5 new Brazilian Unicorns in 2019. I also created a startup scoring system which takes into account several growth and success metrics and has been able to predict several funding rounds before their official announcement.
With the database and automations I built over my time there, Distrito Dataminer is now constructing a platform where any client will be able to make their own searches and conclusions from the wealth of available information.
Main tools & technologies involved:
Python (pandas, scrapy, scikit-learn), AWS (S3, RDS, EC2), PostgreSQL, Tableau, Git
Some of the reports I personally led
Gallery of interesting images and excerpts
Excerpt from Unicorn Race 2020 showing founder data
Excerpt from HealthTech Report showing a startup map
Network analysis of Venture Capital funds showing how they cluster into distinct groups
Excerpt from EdTech Report showing the situation of women in EdTech
One of the dashboards which allows users to filter and search for startups
A profile for a startup which showcases its data and the score we generated for it