Research Interests

My principal research interest is in large-scale multi-modal (typed/heterogeneous) information networks, and more generally in data mining, databases, machine learning, statistics and network science. Due to the size of certain information networks and their corresponding databases, most of my work falls within big data paradigm, which requires an expanded use of new technology.

Vaguely, some of the goals include: (1) understanding the underlying mechanism of information network formation; (2) mining interesting and useful patterns in such networks; (3) predicting the evolution patterns for dynamic information networks; and (4) studying real-world applications and helping decision-makers design better information networks to enhance their usability.

Current Projects

The Notre Dame Information Network Analysis (NINA) Toolkit is an open source network analysis framework that combines the best, cutting edge resources from text analysis, machine learning, and network science fields into a single toolkit. This toolkit is under active development and is currently released without any warranty whatsoever.
By viewing certain networks as hierarchies we've been successful at identifying types and roles of actors within the network. This is part of an ongoing task to explore the formation and dynamics of multi-model networks.
Social news Web sites like Slashdot, Reddit, Fark, etc. are a new type of Web democracy movement that is changing the way people organize and consume information. We are actively investigating the methods and implications for this new type of media platform.
We have been investigating how the network representation of the World Wide Web can be used to create a to improve information processing technologies to aid in better access, organization and retrieval.

Past Projects

Timelapse/Evolution of a research paper - I submitted a paper to a conference in 2013. While writing this paper I took a snapshot every time I compiled the latex file. This resulted in 463 individual snapshots. I wrote some scripts to glue all the pages together and then snipped the frames together. This is the result.
The problem of extracting structured data from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. We propose a new hybrid method that uses both visual and HTML cues to extract general lists and tables from the Web.
Typical Web pages contain lots of information. We argue that a lot of the space on a Web page is filled with advertisements, navigational menus, copyright information, and other items, which should not generally be considered content. For many reasons we ought to find ways to extract only the content of a Web page, and throw out all of the non-content junk.
I was fortunate to win a few PhD fellowships. I remember struggling to find examples of winning essays when I formulating my proposal, so I've decided to publish my essays, as well as my theses in order to help folks like me.