Tips for educators and self-trained data scientists

What I have learned from teaching Data Science at the University of Verona

How to learn in the age of commoditized knowledge and commercialized research

Alex Honchar
Towards Data Science
9 min readMar 1, 2021

--

Half a year ago I have agreed on a small adventure for myself. My university professor and thesis supervisor Luca di Persio offered me to join a teaching squad at a brand-new Data Science faculty at the University of Verona, my second alma mater. I was giving multiple lectures and tech talks for years before, but never as responsible for the long-term result of the student, so I have decided to accept the challenge.

By writing this article, I am targeting:

  • Educators, who are looking for more result-oriented alternatives to academic teaching approaches
  • Self-paced students, who aim to achieve results and be judged by the real world, not by the peers

My approach to data science as an academic discipline, industry profession and business area is rather different from the one prevalent in most educational materials. I am a practice-driven entrepreneur who degreed in academic mathematics in order to be the best in my business, which affects a lot my perspective and teaching approach. I think this perspective can be useful for educators and students who see that the current educational system in the data science field doesn’t meet the goals it is supposed to.

I invite you to a discussion in Clubhouse this Thursday to reflect on this topic:

The problems I knew before starting

An illustration from https://www.azquotes.com/quote/1447564 with a rather radical, yet very truthful quote of Peter Thiel on the topic of this article

I have degreed in 2 universities in Ukraine and Italy myself and self-walked through multiple European and American MOOCs to understand well enough, that university is not a panacea and doesn’t prepare you for real life. However, I like the idea, that modern university:

  • attracts and builds a community of talented people (students, researchers, teachers) who are focused on the single field
  • gives you a unique profound knowledge that is too difficult to obtain with other educational materials or shorter courses

The reality is different, especially in data science, as a new discipline on the intersection of science, engineering, and business:

  • data science programs attract a) academics, b) industry sector professionals, and c) engineers, who are all united by a similar trend, but have almost orthogonal knowledge, skills, and mental models
  • most educational programs either a) follow “classic” statistical learning books or b) “cool and trendy” blogs and both are commoditized and widely accessible and explained by MOOCs and educators online

Shortly, the system doesn't give what it promises, and in this game, everyone is a loser: the governments (that lose science race), the businesses (that won’t be able to compete globally), the universities (that lose human capital), and, of course, the students who lose, most importantly, years of life.

What I have decided to change

My course at the University of Verona was called “Programming” which aims to prepare students for the upcoming courses as statistical learning, databases, etc. I have set up the goal of the 1-semester course as:

Students understand the fundamentals of computer science and scientific computing and are ready to independently create simple yet useful for the end clients data-driven solutions

In order to achieve this, I have split the course into 4 logical blocks and I was evaluating final projects based on every one of them:

Programming basics alignment

Even if skipping steps is so tempting, it rather hurts the career than boosts it. An illustration from https://explainprogrammerhumor.com/post/184600929440/skipping-steps

On the course, there were present students from economics, physics, applied maths, CS, and other departments. I’ve decided to make sure, that in a couple of weeks every student can open and study the main property of any kind of data: tables, images, texts, sounds, etc (real datasets of course). To do so, every lab session I was making small live-coding sessions, where I could track if everyone can open and work with any kind of dataset within 15–20 minutes. I have used the basic datasets that you can find at Kaggle with Numpy, Pandas, OpenCV, Matplotlib.

Computer science and scientific computing

Then, I wanted to give an intuitive understanding of how the data analysis we did in the previous section, works behind the scene:

  • First, we worked on data structures in the broad sense: from memory and how different variables (variables, pandas data frames, tensors) are stored there to lists, hash tables, and trees alongside practical use cases and from-scratch implementations
  • Then, we went through the classic algorithms (sorting and search), machine learning algorithms implemented with loops and Numpy-powered vectorization, and compared performances in different cases

I recommend the following materials to follow:

Human-first software development

Typical software development in academia, an illustration from https://blogs.egu.eu/divisions/gd/2018/09/19/reproducible-computational-science/

Code written in the academy is a well-known nightmare, so I dedicated a whole block to introduce OOP and practice it in several scenarios:

  • Refactoring existing code: taking implementations of linear and logistic regressions and building classes hierarchy around them
  • Planning project structure from scratch: using IDEs instead of Notebooks to build the projects, define the libraries requirements, creating UML diagrams

I have mainly followed two great presentations combined with my practical tips&tricks:

And, of course, I have shown how the best open source projects are organized.

Real-world solutions

Another comparison between academia and the real world, illustration from https://twitter.com/phdcomics/status/604978904558792704

Last but not least, I wanted to make sure, that students can operationalize their solutions as something that makes sense to the end-customer of their solutions. In order to explain, what a “useful data-driven product” is and how it is usually created I briefly went through:

Regarding operationalization, I have offered three choices: analytical dashboard, interactive GUI, or a REST API. For the first two kinds of apps, I have recommended Streamlit, for the latter — Flask. Also, I have shown the dockerization of the solution and publishing it on Heroku.

What happened as a result

I was genuinely surprised (in a good way) with the final quality of the projects — students with deeper CS background created their own games with algorithms learning from the playing data; maths, and economics students have created cool applications with analysis of the data from real businesses of their friends and relatives. I want to mention a couple of projects of the students who agreed to share their experience:

Deep learning-based pizza classifier

Illustration by the author

Jordy Dal Corso (LinkedIn, Email) dove the deepest into the algorithms and created an app that recognized pizza ingredients from the photo using state-of-the-art deep learning models. Check out how great is organized the source code and the usage of the algorithms in the repository. There is no good course without a project related to the pizza :)

Social media graph analysis for influencers

Illustration by the author

Marta Bonioli (Linkedin, Email) focused on the very practical application — finding the right influencers for every company based on the social graph analysis. She has developed a tool where you can enter your target company, it analyzes the network of people around and finds who can promote their products the best!

Twitter sentiment analysis

Illustration by the author

Hunter Paul Youngquist (Email) also focused on Twitter, but more on the sentiment analysis point of view. Following the PAIR guides, he figured out the typical pains of black-box solutions that don’t give the user the opportunity to give feedback and explain the performance and fix that in his app. Check out his repository too!

Amazon sales forecasting

Illustration by the author

Martina Urbani (Email) used data of a real retail company that has business activity through Amazon and has built an interactive dashboard that provides real-time segmented analytics, analyses commissions, and forecasts volume of sales as well. This will help a lot to the business owners to make the right strategical decisions!

Feel free to reach the guys out if you think they can help you with your challenges ;)

What I could have done better?

Even the overall course went pretty well, I feel that there are two moments that I definitely could have worked through better:

Online and offline engagement

During the lectures in the class, I could at least understand empathetically if students are engaged to the material, offline it was very hard to do. In my next courses I want to try:

  • Interactive ML playgrounds, that don’t require too much coding and where I can see instant feedback about the progress and level of understanding.
  • Inviting guest lecturers or “clients” that can work with the students on some specific topics. This is something that as we can see works well in Clubhouse :)
  • Team projects — very obvious play but I have completely ignored it during the course

Stressing up fundamentals

I had rather limited time to dive deeper into scientific computing and computer science fundamentals, which I find very crucial. After I have explained them, I think it’s great to check the understanding with automatically graded exercises, as Coursera does. Having weekly automatically scored exercises on profound topics definitely could have made a more positive impact on understanding the CS topics.

Conclusions

I want to finish this article with my point of view on the future of academic education. With knowledge getting commoditized and research getting commercialized, universities need to define a new sweet spot where they can stay unique and, in some way, elite institutions. Also reminding about the Clubhouse chat coming soon ;)

First, I believe, that universities should build and protect the monopoly on fundamental research activities, which requires the formation of independent thinkers focused on a very narrow frontier field of science yet having a broad perspective on important world problems. The concept of independence I have emphasized in my course through the mental model of a data science consultant accountable for the business results as expressed metrics. In science, this is a bit different and much better described in the blog of Paul Graham:

Second, I think that to compete with institutions on both education and applied research, universities need to get skin in the game, get accountable for the results and depend less on government grants (which should go to fundamental research only). VC spinoffs based on the university applied research, business and technology accelerators, partnerships with industry with clear KPIs can give academia needed boost to get out from the stagnance and refresh the blood.

Third, science has to become sexy again. If Einstein lived in our age, his Instagram would be definitely of the most popular ones but as we can see today, scientists aren’t getting even in the top-1000. Nobel prize doesn’t really do it well, because if you ask a random person if they know any winners, they will mention peace or literature geniuses, maybe economists, but physicists won’t be the first. I personally like what Yuri Milner and his Breakthrough Initiative do, we do need much more things like this:

P.S.
If you found this content useful and perspective, you can support me on Bitclout. I am open to discussions and collaborations in the technology education field, you can connect with me on Facebook or LinkedIn, where I regularly post some AI-related articles or news opinions that are too short for Medium.

--

--