Feature Engineering Techniques for Healthcare Data Analysis | Real-World Examples & Insights by Leo Anello
We’ll continue our focus on feature engineering — this remains the core objective of this project.
Upon completing all feature engineering tasks, I’ll save the results in a CSV file as the final deliverable, marking the project’s completion.
Our primary objective here remains consistent: refining data through feature engineering. In the previous tutorial, we explored several techniques and stopped at this cell.
# 57. Value counts for 'admission_source_id' after recategorization
df['admission_source_id'].value_counts()
I’ll now continue working in the same notebook, picking up where we left off. In our dataset, we have three variables — diag_1
, diag_2
, and diag_3
—each representing a medical diagnosis.
So, how should we handle these variables? I don’t have a background in medical diagnoses, nor am I a healthcare professional.
In cases like this, what do we do? Research. If needed, we consult experts, or we study reference materials.
Let’s start by taking a look at the data, shall we?
# 58. Viewing the data
df[['diag_1', 'diag_2', 'diag_3']].head()
I’ll filter the DataFrame to focus on diag_1
, diag_2
, and diag_3
, each containing numerical ICD-9 codes that classify specific diseases (primary, secondary, and additional) for each patient.
Using these codes directly might make the analysis too granular, so instead, we’ll group them into four comorbidity-based categories—a healthcare concept that highlights when multiple health conditions coexist.
This step shifts our approach from raw disease codes to a more interpretable, high-level metric. Rather than complex code, this involves interpretive decisions for better insight extraction.
If we keep the codes as-is, our analysis will remain focused on disease classifications alone. But by consolidating the data from diag_1
, diag_2
, and diag_3
into a new comorbidity variable, we gain richer insights. Effective feature engineering means converting available information into higher-value metrics.
To proceed, we’ll define this new variable based on a clear criterion — comorbidity. This way, our transformation is clinically relevant and adaptable for other analyses. Even if domain knowledge is limited, we can consult field experts to guide the feature design.
I’ll walk through creating this feature in Python, transforming the raw diagnoses into a feature that captures critical patient health patterns, underscoring the power of domain-driven feature engineering.
Applying Feature Engineering Strategies
We’re working here to uncover hidden insights within our dataset by transforming the variables.
This information exists, but it’s not immediately visible; we need feature engineering to reveal it. The visible details, like individual disease codes, are straightforward and valuable in their own right, but there’s often more depth in the hidden layers of data.
By extracting these invisible insights, we can analyze the data from a new angle or perspective — a shift that can greatly enhance daily data analysis. Personally, I see feature engineering as more of an art than a purely technical task.
The Python programming we’re doing isn’t particularly complex; the real skill is in reaching a level of abstraction where we can see insights that aren’t immediately obvious.
This ability to abstract develops with experience — working on diverse projects, learning from mistakes, and gradually noticing that almost every dataset holds hidden information that, when properly engineered, can enhance analysis. That’s precisely what we’re working on here together.
Based on our exploration, we’ve decided to create a new variable from these three diagnostic columns. We’ll apply comorbidity as our guiding criterion, which will allow us to group these variables based on whether the patient has multiple coexisting conditions.
To proceed, I’ll create a new DataFrame named diagnosis
that will contain diag_1
, diag_2
, and diag_3
. This setup allows us to focus exclusively on these columns as we implement the comorbidity-based transformation.
# 59. Concatenating 3 variables into a dataframe
diagnosis = df[['diag_1', 'diag_2', 'diag_3']]
Here, I have the values for you — they’re all disease codes.
# 60. Viewing the data
diagnosis.head(10)
Also, note that we have no missing values.
# 61. Checking for missing values
diagnosis.isnull().any()
To create a new variable based on comorbidity, our first step is to establish a clear criterion that defines it within our dataset. In practical terms, comorbidity simply means the presence of more than one disorder in a patient. For instance, if a patient has three diagnoses corresponding to three different conditions, it’s likely they have comorbidities.
Imagine a patient diagnosed with both depression and diabetes — these conditions may be interconnected. Our aim is to detect these overlaps and extract useful information. This process transforms raw data into actionable insights.
Feature engineering, in this sense, goes beyond the obvious. Many professionals focus only on visible data — analyzing it as it is, without uncovering deeper, interconnected patterns. However, invisible information can reveal more nuanced insights, and uncovering it requires experience and a refined sense of abstraction.
To determine the comorbidity of different conditions, we’ll need to use domain knowledge. Here’s where understanding patterns in the medical field helps us apply relevant criteria. For example:
- Mental Health and Chronic Conditions: Someone diagnosed with social anxiety and depression has comorbid mental health conditions. Similar patterns apply with other pairs, like diabetes and cardiovascular diseases or infectious diseases and dementia.
- Eating Disorders: Commonly overlap with anxiety disorders and substance abuse, forming a complex comorbid profile.
When identifying these connections, it’s often helpful to refer to a data dictionary or consult with the business or healthcare team, especially if we’re unfamiliar with the specific disorders. The goal isn’t just to look knowledgeable but to learn and leverage expert insights. Many times, insights from others reveal aspects of data that we might not have anticipated.
Our task now is to set up criteria for comorbidity within this dataset. This will involve:
- Creating a function to analyze the diagnoses.
- Assigning codes to identify specific disorders, which we’ll use to determine if a patient has multiple overlapping health issues.
Once the criteria are defined, we’ll translate them into Python code, generating a new variable that represents the comorbidity level for each patient. This new feature will allow us to explore how overlapping conditions impact health outcomes in a structured, data-driven way.
Let’s begin by setting up the Python function to implement this approach.
# 63. Function that calculates Comorbidity
def calculate_comorbidity(row):# 63.a Code 250 indicates diabetes
diabetes_disease_codes = "^[2][5][0]"
# Codes 39x (x = value between 0 and 9)
# Codes 4zx (z = value between 0 and 6, and x = value between 0 and 9)
# 63.b These codes indicate circulatory problems
circulatory_disease_codes = "^[3][9][0-9]|^[4][0-6][0-9]"
# 63.c Initialize return variable
value = 0
# Value 0 indicates that:
# 63.d Diabetes and circulatory problems were not detected simultaneously in the patient
if (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 0
# Value 1 indicates that:
# 63.e At least one diagnosis of diabetes AND circulatory problems was detected simultaneously in the patient
elif (bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 1
# Value 2 indicates that:
# 63.f Diabetes and at least one diagnosis of circulatory problems were detected simultaneously in the patient
elif (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
(bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3'])))))):
value = 2
# Value 3 indicates that:
# At least one diagnosis of diabetes and at least one diagnosis of circulatory problems
# 63.g were detected simultaneously in the patient
elif (bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
(bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3'])))))):
value = 3
return value
At first glance, I know this Python code might look intimidating, right? What’s this? This huge block of code? Don’t worry — it’s much simpler than it seems, okay? Follow the explanation with me here.
I have a function called calculate_comorbidity
, which takes a row from my DataFrame as input, processes it, and outputs a result. I even call this function here, like so.
# 64. Applying the comorbidity function to the data
%%time
df['comorbidity'] = diagnosis.apply(calculate_comorbidity, axis=1)
Notice that I’m calling the diagnosis DataFrame, which contains the values for diag1
, diag2
, and diag3
. I’m applying the function and generating a new column. So, what does this function actually do?
First, when we enter the function, we create a Python variable called diabetes_disease_codes
. I’m using diabetes as one of the health conditions here, as it’s a critical issue, right? What’s the code for diabetes? It’s 250.
Where did I get this information? I pulled it from the ICD table. If you visit this table, which includes classification codes for diseases, you’ll see that 250 corresponds to diabetes.
The patient with ID 2 was diagnosed with diabetes in the second diagnosis. So, I retrieved the diabetes code, which is 250.
However, I added the caret symbol (^
). Why did I do this? Because I’m creating a string that will be used as a regular expression to search within my DataFrame.
In fact, I’m using it below, take a look:
# Value 0 indicates that:
# 63.d Diabetes and circulatory problems were not detected simultaneously in the patient
if (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 0
re
is the Python package for regular expressions, used specifically for data searching based on defined criteria.
Here, I’ll use it to search for diabetes_disease_codes
in diag1
, diag2
, and diag3
. This is a method to check if these columns contain the code 250.
In addition to diabetes, I’ll also use circulatory_disease_codes
for circulatory conditions.
To identify circulatory issues, I’ll create a pattern based on the ICD-9 code system. Specifically:
- Code pattern “39x”: where
x
ranges from 0 to 9. - Code pattern “4zx”: where
z
ranges from 0 to 6 andx
from 0 to 9.
Using this knowledge, I created a regular expression to target these ranges:
- I start with the caret (^), which specifies the beginning of the string, followed by 39 to capture any codes that start with “39” and end with any digit (0–9).
- I use the pipe (|) operator, meaning “or”, to expand the pattern to include codes beginning with “4” and followed by a digit from 0 to 6 and then 0 to 9.
By combining these patterns, we can filter for general circulatory issues without being too specific. This regular expression enables a flexible but targeted approach for our analysis.
Creating the Filter
I’ll apply this pattern as a filter on diag_1
, diag_2
, and diag_3
. This filter will be assigned to a new variable named value
(defined earlier in #63.c), which serves as our return variable.
The value
variable is initialized as 0 and later adjusted based on specific criteria.
Classification Values
We’ll establish four distinct categories for comorbidity:
- Value 0: No comorbidities detected.
- Value 1: Diabetes detected, no circulatory issues.
- Value 2: Circulatory issues detected, no diabetes.
- Value 3: Both diabetes and circulatory issues detected.
This new variable will consolidate information from diag_1, diag_2, and diag_3 into a single categorical feature with four levels based on these conditions, streamlining our data and enhancing its usability for downstream analysis.
# Value 0 indicates that:
# 63.d Diabetes and circulatory problems were not detected simultaneously in the patient
if (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 0
Let’s break down what’s happening in the code:
I’m using re
, Python’s regular expressions package, to match specific patterns in each diagnosis column (diag_1
, diag_2
, and diag_3
). Specifically, I’m checking whether each diagnosis contains a diabetes code or a circulatory issue code.
Here’s the process:
- Convert each diagnosis into a string format suitable for regular expression searches.
- Check each column (diag_1, diag_2, diag_3) for diabetes or circulatory codes using
re.match
. - Convert these checks into Boolean values (
True
if a match is found,False
if not). - Negate the results to identify when no matches for either diabetes or circulatory issues exist in any of the three diagnoses.
The outcome:
- If no diabetes or circulatory codes are present across all three columns (
diag_1
,diag_2
,diag_3
), the value is set to 0.
By negating the Boolean checks, we classify cases where both diabetes and circulatory issues are absent as 0, marking this category as the baseline for patients without these comorbidities.
If this returns True, it means the code was found. But that’s not the goal here; we want cases without diabetes or circulatory codes. That’s why we negate the result.
Note how I’m also using not
for circulatory issues. If all checks return not (meaning neither diabetes nor circulatory issues are present in diag_1
, diag_2
, or diag_3
), we set the value to 0.
For value 1, we capture cases where at least one diagnosis has diabetes but no circulatory problem. Here, I’ve removed the not
for diabetes, while keeping it for circulatory codes to isolate diabetes-only cases.
So, if it finds a diabetes diagnosis, even if it doesn’t find a circulatory problem, it will assign the value 1.
For value 2, it indicates that diabetes and at least one diagnosis of circulatory problems were detected simultaneously.
Here, I kept the not
condition specifically for diabetes and removed it for circulatory problems. Notice the detail here: we’re using both AND and OR logic, following the rules we defined to assign the value.
Finally, if there is at least one diabetes diagnosis and at least one circulatory problem diagnosis detected simultaneously, we assign value 3.
Notice here that the OR operator applies to each diagnosis (diag_1
, diag_2
, and diag_3
) when both diabetes and circulatory issues are considered. This allows the entire condition to return True if any one diagnosis meets these criteria.
With this setup, the calculate_comorbidity
function consolidates information from diag_1
, diag_2
, and diag_3
into a new variable that reflects comorbidity status—an example of domain-based feature engineering. This function will classify the comorbidity status into four categories based on the rules we established.
Here, we’re focusing specifically on diabetes and circulatory issues to streamline the example. This approach, however, can easily be adapted to create variables for other comorbid conditions if needed.
Now, create the function and proceed with the next instruction to apply it.
# 64. Applying the comorbidity function to the data
%%time
df['comorbidity'] = diagnosis.apply(calculate_comorbidity, axis=1)# -> CPU times: user 6.72 s, sys: 4.43 ms, total: 6.73 s
# Wall time: 6.78 s
It takes a bit of time, doesn’t it, to process the entire dataset? Notice that I’m using diagnosis, which contains precisely the three variables: diag_1
, diag_2
, and diag_3
. So, this step takes a little over eight seconds.
Let’s now check the shape of the dataset, and then take a look at the data itself.
# 65. Shape
df.shape# (98052, 43)
# # 66. Viewing the data
df.head()
Take a look at what we’ve accomplished here. The comorbidity
variable is now added at the very end of our dataset.
Now, we have a new variable that identifies if a patient has both diabetes and circulatory issues simultaneously.
This goes beyond technical work — it’s almost an art. We’ve uncovered hidden insights and created a valuable new variable.
This allows us to perform further analyses, which we’ll explore shortly. Let’s check the unique values in this variable.
# 67. Unique values in 'comorbidity'
df['comorbidity'].unique()# > array([1, 3, 2, 0])
As you can see, we have exactly the four categories we defined in the function: 0, 1, 2, and 3.
Now, let’s check the count and frequency of each category.
# 68. Unique value counts in 'comorbidity'
df['comorbidity'].value_counts()
So, we observe that the highest frequency is for index 2, while the lowest is for index 3.
Let’s take a closer look at what index 2 represents.
# Value 2 indicates that:
# 63.f Diabetes and at least one diagnosis of circulatory problems were
# detected simultaneously in the patient
Diabetes and at least one circulatory problem diagnosis were detected simultaneously in the patient. This observation applies to the majority of cases, indicating that many patients have both diabetes and at least one circulatory issue.
This raises some important questions:
- Do these patients require a different treatment approach?
- Does this condition influence their hospital readmission rates?
These findings open up numerous avenues for further analysis. Now, let’s identify the category with the fewest entries — Category 3.
# Value 3 indicates that:
# 63.g At least one diagnosis of diabetes and at least one diagnosis of
# circulatory problems were detected simultaneously in the patient
A simultaneous diagnosis of diabetes and circulatory issues is less frequent, with Category 2 being the most common.
This analysis goes beyond the obvious, unlocking deeper insights through feature engineering that others might overlook.
These comorbidity insights weren’t created — they were simply hidden within the data. By combining existing columns, we generated a variable that answers questions not yet asked. This process takes time and experience and can elevate your data analysis.
To wrap up, let’s create a chart. But first, let’s delete the original columns, diag_1
, diag_2
, and diag_3
, as we’ve consolidated them into the comorbidity variable. While other diseases might be present, our focus here is strictly on diabetes and circulatory issues.
# 69. Dropping individual diagnosis variables
df.drop(['diag_1', 'diag_2', 'diag_3'], axis=1, inplace=True)
Delete those columns now, and then let’s proceed by creating a cross-tabulation between comorbidity and readmission status.
# 70. Calculating the percentage of comorbidity by type and target variable class
percent_com = pd.crosstab(df['comorbidity'], df['readmitted'], normalize='index') * 100
Remember this variable? Now, I’ll calculate the percentage and display it for you.
Zero (0
) indicates no readmission, while one (1
) indicates readmission. Among readmitted patients, 44% had no comorbidities—no occurrence of diabetes
or circulatory issues
—revealing key insights already embedded in the data.
Category 2, with both diabetes
and circulatory issues
, shows the highest readmission rate at 48%. This highlights a direct correlation: patients with two conditions are more likely to be readmitted.
These findings, uncovered through feature engineering, demonstrate how hidden information can guide operational strategies. Let’s proceed with visualizing these insights.
# 71. Plot# Prepare the figure from the data
fig = percent_com.plot(kind='bar',
figsize=(16, 8),
width=0.5,
edgecolor='g',
color=['b', 'r'])
# Draw each group
for i in fig.patches:
fig.text(i.get_x() + 0.00,
i.get_height() + 0.3,
str(round((i.get_height()), 2)),
fontsize=15,
color='black',
rotation=0)
# Title and display
plt.title("Comorbidity vs Readmissions", fontsize=15)
plt.show()
I’ll create the plot using the comorbidity percentages we’ve calculated.
I’ll set up a bar chart with parameters and formatting, adding titles and labels for clarity, and ensuring each group is distinct and easy to interpret.
The X-axis displays comorbidity levels (0
, 1
, 2
, and 3
).
Blue bars represent patients not readmitted, while red barsindicate those readmitted, allowing a clear visual comparison across each comorbidity level.
- The largest blue bar, corresponding to index 0 (patients with no comorbidities like diabetes or circulatory issues), shows that about 55% of these patients were not readmitted, suggesting effective treatment and lower readmission rates due to the absence of comorbid conditions.
- Red bar at index 2 represents patients with both diabetes and a circulatory problem. This group shows a notably higher readmission rate, aligning with expectations that comorbid patients are at greater risk of requiring further medical care.
This graph reflects more than a simple visualization; it encapsulates critical steps:
- Understanding the domain-specific problem.
- Defining criteria for comorbidity.
- Applying feature engineering to transform raw data into actionable insights.
- Using Python for automated data processing.
The underlying question, likely unconsidered without these steps, is: Does having two simultaneous conditions impact readmission rates? The data provides a clear yes.
This insight enables healthcare providers to better support high-risk patients and potentially lower readmissions — a testament to how data analysis can turn hidden insights into concrete, actionable strategies, rooted in data-driven evidence rather than speculation.
Have we completed the feature engineering work? Not quite. There’s one more aspect of the data that I haven’t yet shown you.
# 72. Viewing the data
df.head()
Let’s take a look at the columns to see how the dataset is organized after our feature engineering efforts.
# 73. Viewing column names
df.columns
The dataset includes 23 medications, each indicating whether a change was made during the patient’s hospitalization. This prompts the question: Does a medication change impact the likelihood of readmission?
Consider two scenarios:
- No change in medication, the patient recovers, and returns home.
- A significant dosage adjustment occurs, potentially causing side effects and leading to a return to the hospital.
To analyze this, rather than plotting all 23 variables (which may have similar behaviors), we’ll chart 4 selected medications to highlight specific trends.
# 74. Plot
fig = plt.figure(figsize=(20, 15))ax1 = fig.add_subplot(221)
ax1 = df.groupby('miglitol').size().plot(kind='bar', color='green')
plt.xlabel('miglitol', fontsize=15)
plt.ylabel('Count', fontsize=15)
ax2 = fig.add_subplot(222)
ax2 = df.groupby('nateglinide').size().plot(kind='bar', color='magenta')
plt.xlabel('nateglinide', fontsize=15)
plt.ylabel('Count', fontsize=15)
ax3 = fig.add_subplot(223)
ax3 = df.groupby('acarbose').size().plot(kind='bar', color='black')
plt.xlabel('acarbose', fontsize=15)
plt.ylabel('Count', fontsize=15)
ax4 = fig.add_subplot(224)
ax4 = df.groupby('insulin').size().plot(kind='bar', color='cyan')
plt.xlabel('insulin', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.show()
I created 4 plots for 4 variables, each representing a different medication. Below, you’ll find the results visualized across 4 distinct charts.
Consider the first medication in the chart. Do we know its specifics? No, and for our purposes, we don’t need to. All we need is to understand the four possible categories:
- Modification in dosage
- Reduction in dosage
- No modification (steady level)
- Increase in dosage
This is sufficient for our analysis. Deep domain knowledge isn’t required here; the focus is on identifying these categories.
Now, let’s interpret the chart: For one medication, most entries are labeled as new, meaning no change in dosage. A thin pink line stands out, indicating cases with steady dosage.
n some cases, the medication remained steady, which could be notable, especially for certain patients.
However, for most, there was no modification in dosage.
Now, observe the light blue chart — the distribution here is more varied, indicating a broader range of dosage adjustments.
Some patients had a reduction in dosage, others had no modification, some remained steady, and a few experienced an increase. This is our current view of medication variables.
Now, do we need feature engineering here? Instead of displaying all four categories, we could simplify by creating a binary variable: Did the medication change or not? This would streamline analysis by recoding categories into binary information.
This recoding allows us to look at these variables differently, extracting hidden insights. By counting total medication modifications per patient, we can create a new attribute that may reveal correlations with the frequency of changes.
Another attribute could track the total number of medications a patient consumed, which we can analyze against readmission rates.
Let’s implement this strategy.
# 75. List of medication variable names (3 variables were previously removed)
medications = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide',
'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'troglitazone', 'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-pioglitazone']
First, let’s create a Python list containing the column names that represent the medications. In previous steps, we already removed three variables.
Therefore, while the original dataset had 23 medication variables, we now have only 20 because three were deleted due to identified issues and thus are no longer part of our analysis. However, in the original dataset, there are indeed 23 medication variables.
With the list created, let’s proceed to iterate over it in a loop to implement the next steps.
# 76. Loop to adjust the value of medication variables
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df[colname] = df[col].apply(lambda x: 0 if (x == 'No' or x == 'Steady') else 1)
For each column in the medications
list, I’ll locate it in the DataFrame, append a temp
suffix for a new column, and apply a lambda function:
- If
x
is “No” or “Steady”, return 0. - Otherwise, return 1.
This recodes the variable from four categories to just two (0 or 1), simplifying our interpretation. We can then verify the new columns at the end of the DataFrame.
Check if the temp
variables are already present, right at the end of the dataset.
Now, I’ll create a new variable to store the number of medication dosage changes.
# 78. Creating a variable to store the count per patient
df['num_med_dosage_changes'] = 0
I’ll create the variable and initialize it with 0. Then, I’ll run another loop to update it.
# 79. Counting medication dosage changes
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df['num_med_dosage_changes'] = df['num_med_dosage_changes'] + df[colname]
del df[colname]
For each column in the medications
list, I search for it in the DataFrame, create a temporary column with a temp
suffix, then:
- Add the value in
df[colname]
todf['num_med_dosage_changes']
to count dosage changes per patient. - Delete the temporary column to keep the DataFrame clean.
Finally, using value_counts
on df['num_med_dosage_changes']
reveals dosage adjustment frequency across patients, offering insight into treatment patterns.
# 80. Checking the total count of medication dosage changes
df.num_med_dosage_changes.value_counts()
The distribution of dosage changes is as follows:
- 0 changes: 71,309
- 1 change: 25,350
- 2 changes: 1,281
- 3 changes: 107
- 4 changes: 5
Now, let’s check the dataset head to confirm the new variable has been accurately incorporated.
# 81. Viewing the data
df.head()
Run the command, scroll to the end, and there it is — the new variable has been successfully added at the end of the dataset.
Now I know the exact count of medication dosage changes for each patient. For instance, the first patient had one change, the second had none, the third had one, and so on.
Next, we’ll adjust the medication columns to reflect whether each medication is being administered to a patient. This is an additional modification to simplify the dataset.
As you’ve observed, the attribute engineering strategy here mainly involves using loops. We start with the first loop:
# 76. Loop to adjust the value of medication variables
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df[colname] = df[col].apply(lambda x: 0 if (x == 'No' or x == 'Steady')
Then the second loop:
# 79. Counting medication dosage changes
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df['num_med_dosage_changes'] = df['num_med_dosage_changes'] + df[colname]
del df[colname]
The strategy here is technical, but the real challenge is abstracting the data: understanding what each variable represents and viewing it from a new angle.
This abstraction allows us to extract new features through feature engineering. It’s not a simple task — it requires experience to “see” invisible insights.
Once you grasp this concept, the programming becomes straightforward. Now, let’s move on to modify the medication columns.
# 82. Recoding medication columns
for col in medications:
if col in df.columns:
df[col] = df[col].replace('No', 0)
df[col] = df[col].replace('Steady', 1)
df[col] = df[col].replace('Up', 1)
df[col] = df[col].replace('Down', 1)
I will create a loop once again through the medication list, iterating over each column. I’ll replace no
with zero(indicating no change), while steady, up, and down will imply that there was a change in the medication. I will now convert this into zero and one, effectively recoding the variable.
After this, we’ll create a new column to reflect how many medications are being administered to each patient.
# 83. Variable to store the count of medications per patient
df['num_med'] = 0
And then, we load the new variable.
# 84. Populating the new variable
for col in medications:
if col in df.columns:
df['num_med'] = df['num_med'] + df[col]
Let’s take a look at the value_counts.
# 85. Checking the total count of medications
df['num_med'].value_counts()
One medication was administered to most patients (45,447 cases), with 22,702 receiving none, 21,056 receiving two, and 7,485 receiving three.
Only five patients required six medications. After creating these new columns, the original medication columns are no longer needed, as they’ve served their purpose for insight generation. We can now discard them.
# 86. Removing the medication columns
df = df.drop(columns=medications)
Just like I did with the comorbidity variable, where I used the Diag
columns to create a new variable, I no longer need the original Diag
columns.
So, I simply dropped them. I’m doing the same thing here now. Take a look at the shape
.
# 87. Shape
df.shape# (98052, 22)
We now have 22 columns. Here is the head
of the dataset.
# 88. Viewing the data
df.head()
Our dataset is getting better and better.
Each time simpler. Each time more compact. Making our analysis work easier.
Let’s take a look at the dtypes
.
# 89. Variables and their data types
df.dtypes