Introduction

PySpark is an interface for Apache Spark in Python. It is a great tool for data scientists who stick with Python to manipulate data, build machine learning pipelines and deploy models in a distributed environment. Compared with Python and its libraries such as pandas and scikit-learn, PySpark has better scaling capabilities to handle really huge data sets.

MySQL, PostgreSQL are two database management systems. MySQL is an open-source relational database management system (RDBMS), while PostgreSQL, also know as Postgres, is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance. MySQL has been famous for its…


Introduction: Docker and Numerical Simulations

You may have heard of Docker before. The wikipedia gives the definition of Docker as “a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. All containers are run by a single operating system kernel and therefore use fewer resources than virtual machines.” Docker has been widely used in industry.

On the other hand, let me brief talk about numerical simulations. In science, many physical and…


Copyright: Shutterstock: www.eatthis.com/starbucks-facts

This is a capstone project of the Udacity data science nanodegree program.

The dataset contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. The dataset includes three json files:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic…


Credit: Airbnb

Airbnb is a global vacation rental online marketplace which offers arrangement for lodging, primarily homestays and tourism experiences since 2008. Many of us would like to use Airbnb when we travel around the world, since Airbnb is easy to order, usually less expensive than traditional hotels, and provide opportunities to connect travelers with good local hosts. I had a great experience with an Airbnb host for 5 weeks at Seattle when I joined a data science program at Seattle last year.

From the customer perspective, it is important to better understand the Airbnb price. Do not just look at the…

Dong Zhang

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store