Cancer Genetics Study Using Data Analysis and Regression Modeling - A Python Project

Credits to: Haohan Wang, Jinglin Jian from the University of Illinois Urbana-Champaign

Background & Objective:

Understanding the genetic basis of diseases can open doors to tailored treatments and personalized medicine. Furthermore, the interplay between genes, diseases, and external factors, such as demographics and other traits, can offer deeper insights into disease progression and outcomes.

This project focuses on deciphering the intricate relationships between specific genes, cancer development, and various conditions through a comprehensive analysis of genetic data. Utilizing the Xena dataset, which encompasses clinical and genetic information about 36 types of cancers, this project employs advanced data analysis and regression modeling techniques to explore gene-trait pairs in the context of cancer genetics.

The objective is to understand how certain genes contribute to cancer risk and progression under different demographic and environmental conditions. This project aims not only to deepen our understanding of cancer genetics but also to contribute to personalized medicine approaches in oncology.

Tools and Platforms:

Data Analysis: Python Data Visualization: Matplotlib, Seaborn Version Control: GitHub

Methodology:

To achieve the project objective, I explored 48 gene-trait pairs and conditions, which form the unit of each research question. Each of the 48 research questions roughly follows the following methodology.

Selection of the Research Question: For each research question, a gene-trait pair will be selected along with a condition. This forms the core of the analysis and dictates the data sources and analysis techniques to be used.
Data Collection: We will mainly be using clinical and genetic data from the Xena dataset, which contains information on 36 types of cancers. Data from the GEO dataset that caters to most of our research problems. Depending on the chosen research question, additional datasets may need to be downloaded or supplementary preprocessing scripts might be required.
Data Analysis: Data Pre-processing: Cleaning, normalizing, and transforming the data as required. Implementation of the Regression Model: The model will be used to analyze the relationships between the selected gene-trait pairs and conditions.
Validation: This ensures the results of the regression model are accurate and reliable.
Interpretation: The final step will involve understanding the results, making sense of the findings, and deducing conclusions.

Example Jupyter notebook to solve the question: "What's the relationship between the TP53 gene and Adrenocortical Cancer when considering the influence of age?”

screencapture-github-Jayzhang0-AI4Science-blob-main-week1-b-ipynb-2024-01-22-12_12_56 (1).png

screencapture-github-Jayzhang0-AI4Science-blob-main-week1-b-ipynb-2024-01-22-12_12_56-2 (1).png