Most data scientists have a math, statistical, or other quantitative background and while they are well-trained in building high-performing models, they may be less versed in using cloud technologies to engineer a production-grade model. Data scientists frequently ask:
Once I create a model, how can I make it usable for other websites, apps, and services?
A: There are many steps and considerations when transitioning from a trained model to production. These include determining where your model will be hosted, how it will be called from other applications, latency requirements, and many more. See our model building checklist for a full list of considerations.
How does machine learning differ from statistical analysis?
A: There are many differences (and similarities) between machine learning and statistical analysis. Machine learning primarily deals with making predictions - that is, given a set of data inputs, can the model accurately predict some output? Statistical analysis primarily deals with correlation and relationships between variables, such as finding the variables most associated with some outcome. While the underlying theory and methods for machine learning and statistical analysis overlap considerably (logistic regression being one example), the objectives are usually different.
What are the tradeoffs between various programming languages for machine learning, such as R and Python?
A: Machine learning models can be built in many programming languages such as R, Python, SAS, Stata, and others. However, if your goal is to create a machine learning model in production, you need to consider the systems your model will need to integrate with. For this reason (and a few others), Python has become the go-to language for machine learning. Python makes it easier to integrate models with other software systems compared to the other languages. While you can certainly create a model in any of the above languages, if you are using the model in a cloud framework (AWS, GCP, etc.) or want to integrate it with a website or app, your life will be easier with Python.
What is the role of Big Data in Data Science and Machine Learning?
A: Big Data plays a vital (and sometimes overlooked) role in machine learning. It is common to have the data you are working with be so big, that big data tools like Spark, Hadoop, and others are necessary to process this large amount of data. Cloud frameworks like AWS and GCP also have tools that can help process large amounts of data.
What's the difference between a data scientist and a data analyst?
A: Data analysts are usually responsible for creating tables, charts, and visualizations from data. They usually have an undergraduate degree in a STEM (science, technology, engineering, math) field, but this is not required. Data analysts have practical programming skills and can combine, manipulate, and analyze data effectively. Data scientists have the skills of a data analyst, but usually have stronger programming and mathematical skills. They perform the same tasks as data analysts, but also create statistical and machine learning models when needed.
Software engineers typically have a computer science background and strong coding skills. They are comfortable with building production-grade applications but may need to learn the mathematical side of data science and machine learning in order to create a well-performing model. Software engineers typically ask:
How much math do I need to know in order to be successful in machine-learning?
A: It depends on how involved you will be in the ML side of things (ex: model training, refactoring model code to make it production ready, or creating the model architecture itself). Taking courses in linear algebra and advanced calculus are extremely helpful in machine learning. Knowing the mathematical details of modeling algorithms will better help you make modeling decisions and choose between trade-offs.
What are the methods most typically used by data scientists when analyzing data?
A: It varies and largely depends on the type of machine learning used by your company. DataScienceCentral provides a nice list of 40 common techniques used by Data Scientists. This list is a good starting point, but if you want to specialize in a particular area of machine learning (e.g. computer vision or natural language processing), you’ll want to be familiar with those algorithms too.
What software engineering skills are most useful in machine learning?
A: Software engineering skills are very helpful in machine learning. Having strong coding skills (along with ML knowledge) will allow you to build high-performing models. Also, engineering skills will make it easier to put models in production and build scalable, flexible systems around your model.
How do machine learning projects differ from software engineering projects?
A: The biggest difference between machine learning and software engineering projects is the ability to guarantee deliverables and performance. For example, an engineering projects typically have specific requirements (such building a website with features X, Y, and Z) and it’s easy to tell when those requirements are fulfilled. Machine learning projects can be more vague and open-ended, since model performance cannot usually be guaranteed. For example, you could build a recommendation engine but cannot guarantee how well it performs in production or how well it will perform in the future.
Product and project managers
As a product manager, how do I know whether or not to use ML for a particular problem?
A: Don’t use ML when your problem: (1)can be solved by simple rules (2)does not adapt to new data (3)requires 100% accuracy (4)requires full interpretability. Use ML when your problem has existing examples of actual answers, and: (1)handles very complex logic (2)scales up fast (3)requires specialized personalization (4)adapts in real time. If you are trying to create a complex formula involving too many variables, ML may be better. Ex: Search products can have an unlimited number of inputs, hence impossible to craft rules based on the input
Which areas of ML should product managers be involved in?
A: PMs should be heavily involved in defining: (1)What is the problem to solve (2)What is the measurable goal (3)What do you want to predict. Data selection also has high PM involvement: which datasets to use (public, internal, custom) and for what purposes (training and tuning, measuring success, replace flawed or outdated data). Areas that have moderate PM involvement include: (1)Data cleaning (ex: removing or fixing missing data) (2)Data sampling (ex: choosing representative data, solving issues such as seasonality, trends, leakage, biases), (3)Data unintended bias (4)Data labeling (ex: tagging/classifying data).
How can I measure the impact of machine learning on my product?
A: Methods like A/B testing and others are typically used to measure how ML models impact products and what lift (financial and otherwise) they provide. See our Model Evaluation section for more details.
How is managing an ML project different from managing a software project?
A: Many software development best practices can also be applied to ML projects. They key is to treat the model as one (but key) piece of the over software system. For example, if the model will be invoked as an API, then decouple the development of that API from the rest of the system. If you are using Agile (Scrum) as your software development methodology, the API should deliver predictions at the end of each sprint, same as any other software component. The quality of the predictions will improve as the sprints go on, as the data science team enriches the model behind the API. See managing ML projects for best practices.