Course Resources

Here are some resources you might find helpful for this course (and hopefully beyond). Recommendations are welcome.

Data Sources

KDnuggets Data Repositories List — Data repository list maintained by KDnuggets, a popular data mining website

UCI Datasets — The UC Irvine Machine Learning Repository, a popular source of machine learning datasets — A public repository for machine learning data

Wikipedia Database — Webpage for access to complete Wikipedia database dumps

IMDb Datasets — Webpage for access to IMDb datasets Datasets — Webpage for access to datasets — US government source of data about the nation's people and economy — Source of machine readable datasets generated by the US government

UK's Office for National Statistics — Source of datasets generated by the UK's Office for National Statistics

UK's Met Office Data — Climate station records from the UK's National Weather Service

CDC Data — Medical data from the Centers for Disease Control and Prevention

World Bank Catalog — World Bank data

RealClimate Data — Aggregator for selected sources of code and data related to climate science

Google Public Data Explorer — Google's public data portal to explore, visualize, and communicate large datasets

Dataverse Network — Repository for research datasets

Linked Data — Linkage site for distributed data

Datamob — Aggregator for public datasets

Quandl — Search engine for financial, economic, and social datasets

Data Market — Portal for shared business data

CKAN — Open-source data portal platform

Hilary Mason (bitly) Data Links — Hilary Mason's bookmarked research-quality datasets

Peter Skomoroch (LinkedIn) Data Links — Peter Skomoroch's bookmarked machine learning data resources

Jake Hofman Data Links — Jake Hofman's bookmarked computational social science data resources

Reddit Open Data — Forum on the social news site reddit for open APIs and datasets

Guardian DataBlog — Data journalism and data visualization from the Guardian

Free SVG Maps — Website for free geographic maps

StateMaster — Reference site for data on US states

Wolfram|Alpha — Computational knowledge engine or answer engine

Data Visualization Resources

Many Eyes — Web community that connects visualization experts, practitioners, academics, and enthusiasts

Visual Complexity — Resource space for anyone interested in the visualization of complex networks

Thumbs Up Viz — Collection of elegant, efficient, and (above all) effective data visualizations

WTF Visualizations — Visualizations that make no sense

Python — The Official Python Website

The Python Tutorial — The Python tutorial

Learn Python in X Minutes — Whirlwind tour of Python programming

Learn Python the Hard Way — Teaches Python by slowly building and establishing skills through practice and application

Learn Python (interactive) — Engaging Python tutorials

Google's Python Class — Teaches Python via written materials, lecture videos, and lots of code exercises — Python-related video index

yhat Data Science in Python Tutorial — Uses IPython to teach data science

Anaconda Python Distribution — Free Python distribution for large-scale data processing and predictive analytics

The Python Package Index — Repository of Python software

pip — Tool for installing and managing Python packages

NumPy — Python package for scientific computing

SciPy Library — Python package for mathematics, science, and engineering

Matplotlib — Python package for 2D plotting

pandas — Python package for high-performance, easy-to-use data structures and data analysis tools

IPython — Architecture for interactive computing with Python

scikit-learn — Python package for machine learning

Additional Recommended Books

Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition (Witten, Frank, & Hall, 2011)

“Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.” An ebook copy is available via the Hesburgh Libraries.

Programming Collective Intelligence (Segaran, 2007)

“This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it. Programming Collective Intelligence takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general—all from information that you and others collect every day.”

The Elements of Statistical Learning, 2nd Edition (Hastie, Tibshirani, & Friedman, 2009)

“During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics.” Free PDF download available.

Learning from Data (Abu-Mostafa, Magdon-Ismail, & Lin, 2012)

“Machine learning allows computational systems to adaptively improve their performance with experience accumulated from the observed data. Its techniques are widely applied in engineering, science, finance, and commerce. This book is designed for a short course on machine learning. It is a short course, not a hurried course. From over a decade of teaching this material, we have distilled what we believe to be the core topics that every student of the subject should know. We chose the title ‘learning from data’ that faithfully describes what the subject is about, and made it a point to cover the topics in a story-like fashion. Our hope is that the reader can learn all the fundamentals of the subject by reading the book cover to cover.”

And Then There's…

Big Data Borat