Data World: Platform for Data Science Collaboration
While there are multiple excellent commercial data science platforms available (Dataiku, Databricks, DataRobot, etc.), they are expensive and not open to public collaboration. There are only a few platforms that are free or low cost and align with the Open Data and Open Science movements. The examples that come to mind are Harvard Dataverse, the Open Science Framework (OSF), and Data World. This article will discuss Data World and its many unique features.
Data World is a public benefit corporation, located in Austin Texas that launched in 2016. Data World is an online platform where participants can find data or upload their own, create a project, and then invite collaborators. Projects and datasets can be made private or public. Thousands of open datasets covering a variety of topics are hosted on Data World. At the time of writing, there were 4425 health-related datasets hosted that are available publicly.
Data World integrates with more than 40 external programs using REST APIs. The integrations can be grouped into these categories: spreadsheets, visualization software, storage sites, analytics tools, programming languages, databases, extract, transfer and load (ETL) tools, and miscellaneous such as integration with Canvas, Slack, and Jupyter Notebooks. The list of integrations can be viewed here. What this means is that your data can be processed in another program without the need to download it and then upload it into an external program. Among the other unique features are excellent SQL and SPARQL tutorials.
Sample Project
Creating a new project and uploading the data is straightforward. The following figure shows a project created by the author that makes use of NHANES data. The Description section summarizes the project. The main menu includes Overview (description), Activity, Insights (your analyses), Discussion, and People (collaborators). For this project, the goal was to take 24 tables (csv files) out of the125 tables available for the 2011–2012 NHANES time period and combine them into one file for easier mining. As a result, the combined table includes demographics, social determinants of health, conditions/diseases, vital signs, and lab data. There are forty-nine attributes and 5206 adult patients in the dataset.
Once the project description has been written you are ready to go to your workspace (launch workspace button). The next figure will display the sections of the workspace: the project directory where the data dictionary, project files, SQL queries, and insights are stored.
In the center is a spreadsheet-like section where columns can be sorted and modified. When the hashtag to the left of a column header is selected descriptive statistics are generated as seen in the next figure.
In the upper right of the workspace are the options to download the file, integrate with external programs, and to generate an SQL query. The following figure displays an SQL query to identify patients with a hemoglobin A1c equal to or greater than 6.5 and a fasting glucose equal to or greater than 126. This query will identify most diabetics in the cohort. The SQL query can be stored for future use or shared.
Because the project is public, anyone accessing Data World who is looking for patient-related data can access the project, share the results, and upload data.
The insight section is where analyses are posted. The following image shows an analysis of what percent of adult patients deny diabetes but their lab tests show otherwise. We also looked at what percent of patients who claim pre-diabetes meet the American Diabetes Association criteria for diabetes, based on lab work.
These examples are a small snapshot of what can be analyzed and posted on Data World. I have posted results using statistical packages, programming languages, machine learning, etc. I have also posted exercises for one of my textbooks Introduction to Biomedical Data Science.
The platform is free for three projects and includes 1 GB of storage. The Professional option includes 20 projects, 100 GB of storage, and costs $12 monthly. A discounted academic license is also available.
Data science is a team sport, so it is highly recommended that anyone participating in data science consider adopting an open collaboration platform like Data World.