Data Science and the definition and role of a Data Scientist

Data scientists can be invaluable in generating insights, especially from “big data”, but their unique combination of technical and business skills, together with their heightened demand, makes them difficult to find or cultivate.

The research note, Emerging Role of the Data Scientist and the Art of Data Science, authored by Doug Laney and Lisa Kart was released by Gartner in 2012. The authors of the report stated that since most of the data scientist role dissenters they came in contact with seem to believe that the role’s title is nothing more than a pretentious moniker for a statistician or Business Intelligence (BI) analyst, the authors decided to take an… scientific… approach to making that determination.

They thought it would be entirely fitting to perform text analysis of hundreds of job descriptions for “data scientist,” “statistician,” and “BI analyst” to learn what the commonalities and differences are according to those actually hiring for the the role.

Those findings led them to more clearly define and distinguish the role of the data scientist, without speculation, than anyone else to-date. Through the research they learned that data scientists are expected to work more in teams, have a comfort and experience with “big data” sets, and are skilled at communication. They also frequently require experience in machine learning, computing and algorithms, and are required to have a PhD nearly twice as often as statisticians. Even the technology requirements for each role differed, with data scientist job descriptions more frequently mentioning Hadoop, Pig, Python and Java among others.

The piece then goes on to define and describe the three core data science skills: data management, analytics modeling and business analysis. But beyond these, there’s an art to data science. The report authors detailed several soft skills that the research showed are also critical to success, i.e., communication, collaboration, leadership, creativity, discipline and passion (for information and truth).

With the need for data scientists growing at about 3x those for statisticians and BI analysts, and an anticipated 100,000+ person analytic talent shortage through 2020, the report also included a listing of university programs around the world offering degrees in advanced analytics.

Source: Gartner

About Data Science

Data science is a discipline that incorporates varying degrees of Data Engineering, Scientific Method, Math, Statistics, Advanced Computing, Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data Science is called a Data Scientist. Data Scientists solve complex data problems. An individual data scientist is most likely an expert in only one or two of these disciplines and proficient in another two or three. There is probably no living person who is expert in all these disciplines, and an extremely rare person would be proficient in all these disciplines. This means that data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.

Origins of Data Science

Data Science has existed for over a decade. An early claimant to the term Data Science is William S. Cleveland who wrote Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique. About a year later, the International Council for Science: Committee on Data for Science and Technology started publishing the CODATA Data Science Journal beginning April 2002. Shortly thereafter, in January of 2003, Columbia University began publishing The Journal of Data Science.

Popularization of Data Science

Mike Loukides Vice President of Content Strategy for O’Reilly Media helped to bring Data Science into the mainstream vernacular in 2010 with his article What is data science?. In the last few years, data science is increasingly being associated with the analysis of Big Data. In the mid-2000s, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook created data science teams specifically to derive business value out of the extremely large data being generated at by their websites. There are now several ongoing conferences devoted to big data and data science, such as O’Reilly’s Strata Conferences and Greenplum’s Data Science Summits.

The job title “Data Scientist” has similarly become very popular. On one heavily used employment site, the number of job postings for “data scientist” has increased 4,000 percent between 2010 and 2012.

Domain-specific interest of Data Science

Data Science is the practice of deriving valuable insights from data. Data Science is emerging to meet the challenges of processing very large data sets i.e. “Big Data” and the explosion of new data generated from smart devices, web, mobile and social media. Many practicing data scientist specialize in specific domains such as marketing, medical, security, fraud and finance fields.

Security Data Science

Data Science has a long and rich history in security and fraud monitoring. Paul Braxton founder of coined the term Security Data Science and defined it as the application of advanced analytics to activity and access data to uncover unknown risks. Security Data Science is focused on advancing information security through practical applications of exploratory data analysis, statistics, machine learning and data visualization.

Although the tools and techniques are no different that those used in data science in any data domain, this group has a micro-focus on reducing risk, identifying fraud or malicious insiders using data science. The information security and fraud prevention industry have been evolving Security Data Science in order to tackle the challenges of managing and gaining insights from huge streams of log data, discover insider threats and prevent fraud. Security Data Science is “data driven” meaning that new insights and value comes directly from data.

Academic programs for Data Scientists

Several universities have begun graduate programs in data science, such as at the Institute for Advanced Analytics at North Carolina State University, the McCormick School of Engineering at Northwestern University, and the six-week summer program at the University of Illinois. Other programs include:

Professional organizations and domain-specific organizations

A few professional organizations have sprung up recently. Data Science Central, kaggle and ScraperWiki are examples.

Data scientist work in many industries across many data domains however specialization in some domains have emerged. One such is security. Association of Security Data Scientist has formed and is promoting Security Data Science as a sub-disipline of Information Security.

Further Readings on Data Science

More references

From Wikipedia