Ethical Data Practices for Building Better Systems


Introduction

As engineers, our work entails more than just building reliable and efficient software systems. Every decision we make in designing data systems carries consequences for individuals, communities, and society as a whole. From predictive analytics to large-scale surveillance, technologies can amplify inequalities or create unintended harm if not carefully designed. This subchapter serves as a call to action for engineers to reflect on these impacts, emphasizing the ethical responsibility to “do the right thing.”


Predictive Analytics: Power and Pitfalls

Predictive analytics turns historical data into decisions about the future. While it offers transformative benefits, such as predicting the spread of diseases or assisting in disaster response, it also poses risks when applied to areas such as:

  1. Loan Decisions or Hiring: Algorithms trained on biased data may reinforce existing inequalities, disproportionately denying loans or jobs to certain groups.
  2. Algorithmic Exclusion: Mislabeling someone as “high risk” (e.g., for loans, travel, or insurance) can lead to systemic discrimination and exclusion. This phenomenon—referred to as the “algorithmic prison”—restricts opportunities without offering paths to challenge or appeal.

While data-driven insights hold the promise of objectivity, they risk amplifying the biases of the input data. As engineers, our job is to ensure models remain transparent, accountable, and free of discriminatory behavior.


Bias and Discrimination in Automated Systems

Algorithms are a reflection of the data they feed on. If that data embeds systemic biases:

  • Predictive models will learn and amplify these patterns.
  • Traits correlated with protected categories (e.g., race, gender) can indirectly perpetuate discrimination.

For instance:

  • Targeted Ads and Discrimination: Factors like a user’s postal code, often correlated with race or socioeconomic status, can lead to biased ad targeting—even unintentionally.
  • Machine Learning Myths: Some believe that algorithms are inherently fair, but their opacity often obscures discriminatory outputs, a problem humorously referred to as “machine learning as laundering for bias.”

Solution: Proactively audit both training data and model outputs for fairness and enforce anti-discrimination compliance in systems making impactful decisions.


Surveillance vs. Privacy

Modern technologies—social media platforms, IoT devices, and search engines—gather vast datasets. These are often repurposed to create behavioral models, marketing profiles, or even intrusive surveillance measures.

The ethical challenge is ensuring that data collection respects users’ privacy. For example:

  • Consent Issues: Users often lack clear knowledge about what data they provide and how it will be used, undermining the concept of informed consent.
  • Data as Exploitation: Behavioral data has been described not as a mere byproduct of an interaction but as the “core asset” of modern platforms. Many policies disguise this extraction within dense terms of service agreements, granting companies sweeping rights to exploit user data.

Feedback Loops: Harmful Paternalism

Even seemingly benign systems like recommendation engines warrant closer scrutiny:

  • Echo Chambers: Algorithms prioritizing user affinity create polarized environments where misinformation spreads unchecked.
  • Poverty Loops: For example, employer reliance on credit scores can create vicious cycles—an applicant with a low score is denied jobs, worsening their financial status and damaging their score further.

Addressing these issues demands “systems thinking,” a holistic understanding of how each component of a system impacts users and society.


Key Principles for Ethical Data Engineering

  1. Transparency and Accountability
    • Ensure that users understand how decisions are made about their data. Systems relying on opaque, hard-to-audit processes render recourse impractical.
  2. Minimizing Harmful Consequences
    • Build features that assume the possibility of failure and plan for recovery. Systems should gracefully adapt to errors rather than propagating them downstream.
  3. Secure Stewardship of Data
    • Treat data as a liability, not merely an asset. Engineers must adopt stronger safeguards against breaches, unauthorized access, and misuse.
  4. Fair Practices
    • Continuously audit algorithms and dataflows to root out entrenched biases. Adopt designs that enable inclusivity by default.

Conclusion

Software systems wield immense power to shape society—not always for the better. As stewards of these technologies, engineers must rise above technical challenges to consider the societal and human consequences of our work. By embedding fairness, transparency, and ethical considerations into every stage, we can focus on building systems that foster equality and respect dignity. The path forward for the information age shouldn’t mirror the exploitation of the industrial era—it must be built on a foundation of responsibility and humanity.

Series Designing Data-Intensive Applications Part 41 of 41
  1. Designing Reliable Data Systems
  2. What is Scalability in Data Systems?
  3. Building Maintainable Software Systems
  4. Relational Model Versus Document Model
  5. Speaking the Language of Data- A Guide to Query Languages
  6. Unraveling Connections- Exploring Graph-Like Data Models
  7. The Backbone of Databases- Data Structures that Power Storage
  8. Transaction Processing vs. Analytics Let's understand the divide
  9. Understanding Column-Oriented Storage- A Deep Dive into Analytics Optimization
  10. Formats for Encoding Data
  11. Modes of Dataflow in Distributed Systems
  12. Leaders and Followers - The Core of Replication
  13. Problems with Replication Lag - Challenges and Solutions
  14. Multi-Leader Replication in Distributed Databases
  15. Leaderless Replication Flexibility for Distributed Databases
  16. Partitioning and Replication in Scaling Distributed Databases
  17. Partitioning of Key-Value Data- Strategies and Challenges
  18. Partitioning and Secondary Indexes- Balancing Efficiency and Complexity
  19. Efficient Methods for Rebalancing Data in Distributed Systems
  20. Ensuring Accurate Request Routing in Distributed Databases
  21. The Slippery Concept of a Transaction
  22. Exploring Weak Isolation Levels in Databases
  23. Achieving Serializability in Transactions
  24. Faults and Partial Failures in Distributed Systems
  25. Navigating Unreliable Networks in Distributed Systems
  26. The Challenges of Unreliable Clocks in Distributed Systems
  27. Knowledge Truth and Lies in Distributed Systems
  28. Consistency Guarantees in Distributed Systems
  29. Linearizability in Distributed Systems
  30. Understanding Ordering Guarantees in Distributed Systems
  31. Achieving Reliability with Distributed Transactions and Consensus Mechanisms
  32. Leveraging Unix Tools for Efficient Batch Processing
  33. MapReduce and Distributed Filesystems- Foundations of Scalable Data Processing
  34. Advancing Beyond MapReduce- Modern Frameworks for Scalable Data Processing
  35. Enabling Reliable and Scalable Event Streams in Distributed Systems
  36. Synchronizing Databases with Real-Time Streams
  37. Unifying Batch and Stream Processing for Modern Pipelines
  38. Integrating Distributed Systems for Unified Data Pipelines
  39. Unbundling Monolithic Databases for Flexibility
  40. Building Correct Systems in Distributed Environments
  41. Ethical Data Practices for Building Better Systems

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!